|
|
Line 1: |
Line 1: |
| {{machine learning bar}}
| | I am 24 years old and my name is Jannie Mccallister. I life in Front Royal (United States).<br><br>Visit my blog [http://safedietsthatwork.webnode.com diet plans] |
| {{Redirect|Perceptrons|the book of that title|Perceptrons (book)}}
| |
| | |
| In [[machine learning]], the '''perceptron''' is an algorithm for [[supervised classification|supervised]] [[classification (machine learning)|classification]] of an input into one of several possible non-binary outputs. It is a type of [[linear classifier]], i.e. a classification algorithm that makes its predictions based on a [[linear predictor function]] combining a set of weights with the [[feature vector]]. The algorithm allows for [[online algorithm|online learning]], in that it processes elements in the training set one at a time.
| |
| | |
| The perceptron algorithm was invented in 1957 at the [[Cornell Aeronautical Laboratory]] by [[Frank Rosenblatt]].<ref>Rosenblatt, Frank (1957), The Perceptron--a perceiving and recognizing automaton. Report 85-460-1, Cornell Aeronautical Laboratory.</ref>
| |
| | |
| == Definition ==
| |
| The perceptron is a binary classifier which maps its input <math>x</math> (a real-valued [[Vector space|vector]]) to an output value <math>f(x)</math> (a single [[Binary function|binary]] value):
| |
| | |
| :<math>
| |
| f(x) = \begin{cases}1 & \text{if }w \cdot x + b > 0\\0 & \text{otherwise}\end{cases}
| |
| </math>
| |
| | |
| where <math>w</math> is a vector of real-valued weights, <math>w \cdot x</math> is the [[dot product]] (which here computes a weighted sum), and <math>b</math> is the 'bias', a constant term that does not depend on any input value.
| |
| | |
| The value of <math>f(x)</math> (0 or 1) is used to classify <math>x</math> as either a positive or a negative instance, in the case of a binary classification problem. If <math>b</math> is negative, then the weighted combination of inputs must produce a positive value greater than <math>|b|</math> in order to push the classifier neuron over the 0 threshold. Spatially, the bias alters the position (though not the orientation) of the [[decision boundary]]. The perceptron learning algorithm does not terminate if the learning set is not [[linearly separable]]. If the vectors are not linearly separable learning will never reach a point where all vectors are classified properly. The most famous example of the perceptron's inability to solve problems with linearly nonseparable vectors is the Boolean exclusive-or problem. The solution spaces of decision boundaries for all binary functions and learning behaviors are studied in the reference.<ref>{{cite journal | last=Liou | first=D.-R. | last2=Liou | first2=J.-W. | last3=Liou | first3=C.-Y. | title=Learning Behaviors of Perceptron |journal=ISBN 978-1-477554-73-9. iConcept Press. | year=2013 |url=http://www.iconceptpress.com}}</ref>
| |
| | |
| In the context of [[artificial neural network]]s, a perceptron is an [[artificial neuron]] using the [[Heaviside step function]] as the activation function. The perceptron algorithm is also termed the '''single-layer perceptron''', to distinguish it from a [[multilayer perceptron]], which is a misnomer for a more complicated neural network. As a linear classifier, the single-layer perceptron is the simplest [[feedforward neural network]].
| |
| | |
| ==Learning algorithm==
| |
| Below is an example of a learning algorithm for a (single-layer) perceptron. For [[multilayer perceptron]]s, where a hidden layer exists, more sophisticated algorithms such as [[backpropagation]] must be used. Alternatively, methods such as the [[delta rule]] can be used if the function is non-linear and differentiable, although the one below will work as well.
| |
| | |
| When multiple perceptrons are combined in an artificial neural network, each output neuron operates independently of all the others; thus, learning each output can be considered in isolation.
| |
| | |
| === Definitions ===
| |
| | |
| We first define some variables:
| |
| *<math>y = f(\mathbf{z}) \,</math> denotes the ''output'' from the perceptron for an input vector <math>\mathbf{z}</math>.
| |
| *<math>b \,</math> is the ''bias'' term, which in the example below we take to be 0.
| |
| *<math>D = \{(\mathbf{x}_1,d_1),\dots,(\mathbf{x}_s,d_s)\} \,</math> is the ''training set'' of <math>s</math> samples, where:
| |
| ** <math>\mathbf{x}_j</math> is the <math>n</math>-dimensional input vector.
| |
| ** <math>d_j \,</math> is the desired output value of the perceptron for that input.
| |
| We show the values of the nodes as follows:
| |
| *<math>x_{j,i} \,</math> is the value of the <math>i</math>th node of the <math>j</math>th training ''input vector''.
| |
| *<math>x_{j,0} = 1 \,</math>.
| |
| To represent the weights:
| |
| *<math>w_i \,</math> is the <math>i</math>th value in the ''weight vector'', to be multiplied by the value of the <math>i</math>th input node.
| |
| *Because <math>x_{j,0} = 1 \,</math>, the <math>w_0 \,</math> effectively replaces the bias term.
| |
| To show the time-dependence of <math>\mathbf{w}</math>, we use:
| |
| *<math>w_i(t) \,</math> is the weight <math>i</math> at time <math>t</math>.
| |
| *<math>\alpha \,</math> is the ''learning rate'', where <math>0 < \alpha \leq 1</math>.
| |
| Too high a learning rate makes the perceptron periodically oscillate around the solution unless [[Perceptron#Variants|additional steps]] are taken.
| |
| | |
| [[Image:Perceptron.svg|right|thumb|150px|The appropriate weights are applied to the inputs, and the resulting weighted sum passed to a function that produces the output y.]]
| |
| | |
| === Steps===
| |
| 1. Initialise the weights and the threshold. Weights may be initialised to 0 or to a small random value. In the example below, we use 0.
| |
| | |
| 2. For each example <math>j \,</math> in our training set <math>D \,</math>, perform the following steps over the input <math>\mathbf{x}_j \,</math> and desired output <math>d_j \,</math>:
| |
| :2a. Calculate the actual output:
| |
| ::<math>y_j(t) = f[\mathbf{w}(t)\cdot\mathbf{x}_j] = f[w_0(t) + w_1(t)x_{j,1} + w_2(t)x_{j,2} + \dotsb + w_n(t)x_{j,n}]</math>
| |
| :2b. Update the weights:
| |
| ::<math>w_i(t+1) = w_i(t) + \alpha (d_j - y_j(t)) x_{j,i} \,</math>, for all nodes <math>0 \leq i \leq n</math>.
| |
| | |
| 3. For [[offline learning]], the step 2 may be repeated until the iteration error <math>\frac{1}{s} \sum_j^s [d_j - y_j(t)] \,</math> is less than a user-specified error threshold <math>\gamma \,</math>, or a predetermined number of iterations have been completed.
| |
| | |
| The algorithm updates the weights after steps 2a and 2b. These weights are immediately applied to a pair in the training set, and subsequently updated, rather than waiting until all pairs in the training set have undergone these steps.
| |
| | |
| ===Convergence===
| |
| The perceptron is a [[linear classifier]], therefore it will never get to the state with all the input vectors classified correctly if the training set <math>D</math> is not [[linearly separable]], i.e. if the positive examples can not be separated from the negative examples by a hyperplane.
| |
| | |
| But if the training set ''is'' linearly separable, then the perceptron is guaranteed to converge, and there is an upper bound on the number of times the perceptron will adjust its weights during the training.
| |
| | |
| Suppose that the input vectors from the two classes can be separated by a hyperplane with a margin <math>\gamma </math>, i.e. there exists a weight vector <math>\mathbf{w}, ||\mathbf{w}||=1</math>, and a bias term <math>b</math> such that <math>\mathbf{w}\cdot\mathbf{x}_j + b > \gamma </math> for all <math>j: d_j=1</math> and <math>\mathbf{w}\cdot\mathbf{x}_j + b < -\gamma </math> for all <math>j: d_j=0</math>. And also let <math>R</math> denote the maximum norm of an input vector. Novikoff (1962) proved that in this case the perceptron algorithm converges after making <math>O(R^2/\gamma^2)</math> updates. The idea of the proof is that the weight vector is always adjusted by a bounded amount in a direction that it has a negative [[dot product]] with, and thus can be bounded above by <math>O(\sqrt{t})</math> where ''t'' is the number of changes to the weight vector. But it can also be bounded below by <math>O(t)</math> because if there exists an (unknown) satisfactory weight vector, then every change makes progress in this (unknown) direction by a positive
| |
| amount that depends only on the input vector.
| |
| | |
| The decision boundary of a perceptron is invariant with respect to scaling of the weight vector; that is, a perceptron trained with initial weight vector <math>\mathbf{w}</math> and learning rate <math>\alpha \,</math> behaves identically to a perceptron trained with initial weight vector <math>\mathbf{w}/\alpha \,</math> and learning rate 1. Thus, since the initial weights become irrelevant with increasing number of iterations, the learning rate does not matter in the case of the perceptron and is usually just set to 1.
| |
| | |
| == Variants ==
| |
| The pocket algorithm with ratchet (Gallant, 1990) solves the stability problem of perceptron learning by keeping the best solution seen so far "in its pocket". The pocket algorithm then returns the solution in the pocket, rather than the last solution. It can be used also for non-separable data sets, where the aim is to find a perceptron with a small number of misclassifications.
| |
| | |
| In separable problems, perceptron training can also aim at finding the largest separating margin between the classes. The so-called perceptron of optimal stability can be determined by means of iterative training and optimization schemes, such as the Min-Over algorithm (Krauth and Mezard, 1987)<ref>W. Krauth and M. Mezard. Learning algorithms with optimal stabilty in neural networks. J. of Physics A: Math. Gen. 20: L745-L752 (1987)</ref> or the AdaTron (Anlauf and Biehl, 1989))
| |
| .<ref>J.K. Anlauf and M. Biehl. The AdaTron: an Adaptive Perceptron algorithm. Europhysics Letters 10: 687-692 (1989)</ref> AdaTron uses the fact that the corresponding quadratic optimization problem is convex. The perceptron of optimal stability, together with the [[kernel trick]], are the conceptual foundations of the [[support vector machine]].
| |
| | |
| The <math>\alpha</math>-perceptron further used a pre-processing layer of fixed random weights, with thresholded output units. This enabled the perceptron to classify [[:wiktionary:analogue|analogue]] patterns, by projecting them into a [[Binary Space Partition|binary space]]. In fact, for a projection space of sufficiently high dimension, patterns can become linearly separable.
| |
| | |
| For example, consider the case of having to classify data into two classes. Here is a small such data set, consisting of two points coming from two [[Gaussian distribution]]s.
| |
| <gallery>
| |
| Image:Two_class_Gaussian_data.png|Two-class Gaussian data
| |
| Image:Linear_classifier_on_Gaussian_data.png|A linear classifier operating on the original space
| |
| Image:Hidden_space_linear_classifier_on_Gaussian_data.png|A linear classifier operating on a high-dimensional projection
| |
| </gallery>
| |
| | |
| A linear classifier can only separate points with a [[hyperplane]], so no linear classifier can classify all the points here perfectly. On the other hand, the data can be projected into a large number of dimensions. In our example, a [[random matrix]] was used to project the data linearly to a 1000-dimensional space; then each resulting data point was transformed through the [[hyperbolic function|hyperbolic tangent function]]. A linear classifier can then separate the data, as shown in the third figure. However the data may still not be completely separable in this space, in which the perceptron algorithm would not converge. In the example shown, [[stochastic gradient descent|stochastic steepest gradient descent]] was used to adapt the parameters.
| |
| | |
| Another way to solve nonlinear problems without using multiple layers is to use higher order networks ([[sigma-pi unit]]). In this type of network, each element in the input vector is extended with each pairwise combination of multiplied inputs (second order). This can be extended to an ''n''-order network.
| |
| | |
| It should be kept in mind, however, that the best classifier is not necessarily that which classifies all the training data perfectly. Indeed, if we had the prior constraint that the data come from equi-variant Gaussian distributions, the linear separation in the input space is optimal.
| |
| | |
| Other linear classification algorithms include [[Winnow (algorithm)|Winnow]], [[support vector machine]] and [[logistic regression]].
| |
| | |
| == Example ==
| |
| A perceptron learns to perform a binary [[Sheffer stroke|NAND]] function on inputs <math>x_1 \,</math> and <math>x_2 \,</math>.
| |
| | |
| Inputs: <math>x_0 \,</math>, <math>x_1 \,</math>, <math>x_2 \,</math>, with input <math>x_0 \,</math> held constant at 1.
| |
| | |
| Threshold (<math>t</math>): 0.5
| |
| | |
| Bias (<math>b</math>): 0
| |
| | |
| Learning rate (<math>r</math>): 0.1
| |
| | |
| Training set, consisting of four samples:
| |
| <math>\{((0, 0), 1), ((0, 1), 1), ((1, 0), 1), ((1, 1), 0)\} \,</math>
| |
| | |
| In the following, the final weights of one iteration become the initial weights of the next. Each cycle over all the samples in the training set is demarcated with heavy lines.
| |
| | |
| {| class="wikitable" style="text-align: center;" <hiddentext>generated with [[:de:Wikipedia:Helferlein/VBA-Macro for EXCEL tableconversion]] V1.7 - - Edited in native ascii code by FabioVeronese, pgan002<\hiddentext>
| |
| |-
| |
| | colspan="4" | Input
| |
| | rowspan="2" colspan="3" | Initial weights
| |
| | colspan="5" | Output
| |
| | rowspan="2" | Error
| |
| | rowspan="2" | Correction
| |
| | rowspan="2" colspan="3" | Final weights
| |
| |-
| |
| | colspan="3" | Sensor values
| |
| | Desired output
| |
| | colspan="3" | Per sensor
| |
| | Sum
| |
| | Network
| |
| |-
| |
| | <math>x_0</math> || <math>x_1</math> || <math>x_2</math> || <math>z</math> || <math>w_0</math> || <math>w_1</math> || <math>w_2</math> || <math>c_0</math> || <math>c_1</math> || <math>c_2</math> || <math>s</math> || <math>n</math> || <math>e</math> || <math>d</math> || <math>w_0</math> || <math>w_1</math> || <math>w_2</math>
| |
| |-
| |
| | || || || || || || || <math>x_0*w_0</math> || <math>x_1*w_1</math> || <math>x_2*w_2</math> || <math>c_0+c_1+c_2</math> || if <math>s>t</math> then 1, else 0 || <math>z-n</math> || <math>r * e</math> || <math>\Delta(x_0*d)</math> ||<math>\Delta(x_1*d)</math> ||<math>\Delta(x_2*d)</math>
| |
| |- style = "border-top: 4px solid darkgray"
| |
| | 1 || 0 || 0 || 1 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 1 || +0.1 || 0.1 || 0 || 0
| |
| |-
| |
| | 1 || 0 || 1 || 1 || 0.1 || 0 || 0 || 0.1 || 0 || 0 || 0.1 || 0 || 1 || +0.1 || 0.2 || 0 || 0.1
| |
| |-
| |
| | 1 || 1 || 0 || 1 || 0.2 || 0 || 0.1 || 0.2 || 0 || 0 || 0.2 || 0 || 1 || +0.1 || 0.3 || 0.1 || 0.1
| |
| |-
| |
| | 1 || 1 || 1 || 0 || 0.3 || 0.1 || 0.1 || 0.3 || 0.1 || 0.1 || 0.5 || 0 || 0 || 0 || 0.3 || 0.1 || 0.1
| |
| |- style = "border-top: 4px solid darkgray"
| |
| | 1 || 0 || 0 || 1 || 0.3 || 0.1 || 0.1 || 0.3 || 0 || 0 || 0.3 || 0 || 1 || +0.1 || 0.4 || 0.1 || 0.1
| |
| |-
| |
| | 1 || 0 || 1 || 1 || 0.4 || 0.1 || 0.1 || 0.4 || 0 || 0.1 || 0.5 || 0 || 1 || +0.1 || 0.5 || 0.1 || 0.2
| |
| |-
| |
| | 1 || 1 || 0 || 1 || 0.5 || 0.1 || 0.2 || 0.5 || 0.1 || 0 || 0.6 || 1 || 0 || 0 || 0.5 || 0.1 || 0.2
| |
| |-
| |
| | 1 || 1 || 1 || 0 || 0.5 || 0.1 || 0.2 || 0.5 || 0.1 || 0.2 || 0.8 || 1 || -1 || -0.1 || 0.4 || 0 || 0.1
| |
| |- style = "border-top: 4px solid darkgray"
| |
| | 1 || 0 || 0 || 1 || 0.4 || 0 || 0.1 || 0.4 || 0 || 0 || 0.4 || 0 || 1 || +0.1 || 0.5 || 0 || 0.1
| |
| |-
| |
| | 1 || 0 || 1 || 1 || 0.5 || 0 || 0.1 || 0.5 || 0 || 0.1 || 0.6 || 1 || 0 || 0 || 0.5 || 0 || 0.1
| |
| |-
| |
| | 1 || 1 || 0 || 1 || 0.5 || 0 || 0.1 || 0.5 || 0 || 0 || 0.5 || 0 || 1 || +0.1 || 0.6 || 0.1 || 0.1
| |
| |-
| |
| | 1 || 1 || 1 || 0 || 0.6 || 0.1 || 0.1 || 0.6 || 0.1 || 0.1 || 0.8 || 1 || -1 || -0.1 || 0.5 || 0 || 0
| |
| |- style = "border-top: 4px solid darkgray"
| |
| | 1 || 0 || 0 || 1 || 0.5 || 0 || 0 || 0.5 || 0 || 0 || 0.5 || 0 || 1 || +0.1 || 0.6 || 0 || 0
| |
| |-
| |
| | 1 || 0 || 1 || 1 || 0.6 || 0 || 0 || 0.6 || 0 || 0 || 0.6 || 1 || 0 || 0 || 0.6 || 0 || 0
| |
| |-
| |
| | 1 || 1 || 0 || 1 || 0.6 || 0 || 0 || 0.6 || 0 || 0 || 0.6 || 1 || 0 || 0 || 0.6 || 0 || 0
| |
| |-
| |
| | 1 || 1 || 1 || 0 || 0.6 || 0 || 0 || 0.6 || 0 || 0 || 0.6 || 1 || -1 || -0.1 || 0.5 || -0.1 || -0.1
| |
| |- style = "border-top: 4px solid darkgray"
| |
| | 1 || 0 || 0 || 1 || 0.5 || -0.1 || -0.1 || 0.5 || 0 || 0 || 0.5 || 0 || 1 || +0.1 || 0.6 || -0.1 || -0.1
| |
| |-
| |
| | 1 || 0 || 1 || 1 || 0.6 || -0.1 || -0.1 || 0.6 || 0 || -0.1 || 0.5 || 0 || 1 || +0.1 || 0.7 || -0.1 || 0
| |
| |-
| |
| | 1 || 1 || 0 || 1 || 0.7 || -0.1 || 0 || 0.7 || -0.1 || 0 || 0.6 || 1 || 0 || 0 || 0.7 || -0.1 || 0
| |
| |-
| |
| | 1 || 1 || 1 || 0 || 0.7 || -0.1 || 0 || 0.7 || -0.1 || 0 || 0.6 || 1 || -1 || -0.1 || 0.6 || -0.2 || -0.1
| |
| |- style = "border-top: 4px solid darkgray"
| |
| | 1 || 0 || 0 || 1 || 0.6 || -0.2 || -0.1 || 0.6 || 0 || 0 || 0.6 || 1 || 0 || 0 || 0.6 || -0.2 || -0.1
| |
| |-
| |
| | 1 || 0 || 1 || 1 || 0.6 || -0.2 || -0.1 || 0.6 || 0 || -0.1 || 0.5 || 0 || 1 || +0.1 || 0.7 || -0.2 || 0
| |
| |-
| |
| | 1 || 1 || 0 || 1 || 0.7 || -0.2 || 0 || 0.7 || -0.2 || 0 || 0.5 || 0 || 1 || +0.1 || 0.8 || -0.1 || 0
| |
| |-
| |
| | 1 || 1 || 1 || 0 || 0.8 || -0.1 || 0 || 0.8 || -0.1 || 0 || 0.7 || 1 || -1 || -0.1 || 0.7 || -0.2 || -0.1
| |
| |- style = "border-top: 4px solid darkgray"
| |
| | 1 || 0 || 0 || 1 || 0.7 || -0.2 || -0.1 || 0.7 || 0 || 0 || 0.7 || 1 || 0 || 0 || 0.7 || -0.2 || -0.1
| |
| |-
| |
| | 1 || 0 || 1 || 1 || 0.7 || -0.2 || -0.1 || 0.7 || 0 || -0.1 || 0.6 || 1 || 0 || 0 || 0.7 || -0.2 || -0.1
| |
| |-
| |
| | 1 || 1 || 0 || 1 || 0.7 || -0.2 || -0.1 || 0.7 || -0.2 || 0 || 0.5 || 0 || 1 || +0.1 || 0.8 || -0.1 || -0.1
| |
| |-
| |
| | 1 || 1 || 1 || 0 || 0.8 || -0.1 || -0.1 || 0.8 || -0.1 || -0.1 || 0.6 || 1 || -1 || -0.1 || 0.7 || -0.2 || -0.2
| |
| |- style = "border-top: 4px solid darkgray"
| |
| | 1 || 0 || 0 || 1 || 0.7 || -0.2 || -0.2 || 0.7 || 0 || 0 || 0.7 || 1 || 0 || 0 || 0.7 || -0.2 || -0.2
| |
| |-
| |
| | 1 || 0 || 1 || 1 || 0.7 || -0.2 || -0.2 || 0.7 || 0 || -0.2 || 0.5 || 0 || 1 || +0.1 || 0.8 || -0.2 || -0.1
| |
| |-
| |
| | '''1''' || '''1''' || '''0''' || '''1''' || '''0.8''' || '''-0.2''' || '''-0.1''' || '''0.8''' || '''-0.2''' || '''0''' || '''0.6''' || '''1''' || '''0''' || '''0''' || '''0.8''' || '''-0.2''' || '''-0.1'''
| |
| |-
| |
| | '''1''' || '''1''' || '''1''' || '''0''' || '''0.8''' || '''-0.2''' || '''-0.1''' || '''0.8''' || '''-0.2''' || '''-0.1''' || '''0.5''' || '''0''' || '''0''' || '''0''' || '''0.8''' || '''-0.2''' || '''-0.1'''
| |
| |- style = "border-top: 4px solid darkgray"
| |
| | '''1''' || '''0''' || '''0''' || '''1''' || '''0.8''' || '''-0.2''' || '''-0.1''' || '''0.8''' || '''0''' || '''0''' || '''0.8''' || '''1''' || '''0''' || '''0''' || '''0.8''' || '''-0.2''' || '''-0.1'''
| |
| |-
| |
| | '''1''' || '''0''' || '''1''' || '''1''' || '''0.8''' || '''-0.2''' || '''-0.1''' || '''0.8''' || '''0''' || '''-0.1''' || '''0.7''' || '''1''' || '''0''' || '''0''' || '''0.8''' || '''-0.2''' || '''-0.1'''
| |
| |}
| |
| | |
| This example can be implemented in the following [[Python (programming language)|Python]] code.
| |
| | |
| <source lang="python">
| |
| threshold = 0.5
| |
| learning_rate = 0.1
| |
| weights = [0, 0, 0]
| |
| training_set = [((1, 0, 0), 1), ((1, 0, 1), 1), ((1, 1, 0), 1), ((1, 1, 1), 0)]
| |
| | |
| def dot_product(values, weights):
| |
| return sum(value * weight for value, weight in zip(values, weights))
| |
| | |
| while True:
| |
| print('-' * 60)
| |
| error_count = 0
| |
| for input_vector, desired_output in training_set:
| |
| print(weights)
| |
| result = dot_product(input_vector, weights) > threshold
| |
| error = desired_output - result
| |
| if error != 0:
| |
| error_count += 1
| |
| for index, value in enumerate(input_vector):
| |
| weights[index] += learning_rate * error * value
| |
| if error_count == 0:
| |
| break
| |
| </source>
| |
| | |
| == Multiclass perceptron ==
| |
| Like most other techniques for training linear classifiers, the perceptron generalizes naturally to [[multiclass classification]]. Here, the input <math>x</math> and the output <math>y</math> are drawn from arbitrary sets. A feature representation function <math>f(x,y)</math> maps each possible input/output pair to a finite-dimensional real-valued feature vector. As before, the feature vector is multiplied by a weight vector <math>w</math>, but now the resulting score is used to choose among many possible outputs:
| |
| | |
| :<math>\hat y = \mathrm{argmax}_y f(x,y) \cdot w.</math>
| |
| | |
| Learning again iterates over the examples, predicting an output for each, leaving the weights unchanged when the predicted output matches the target, and changing them when it does not. The update becomes:
| |
| | |
| :<math> w_{t+1} = w_t + f(x, y) - f(x,\hat y).</math>
| |
| | |
| This multiclass formulation reduces to the original perceptron when <math>x</math> is a real-valued vector, <math>y</math> is chosen from <math>\{0,1\}</math>, and <math>f(x,y) = y x</math>.
| |
| | |
| For certain problems, input/output representations and features can be chosen so that <math>\mathrm{argmax}_y f(x,y) \cdot w</math> can be found efficiently even though <math>y</math> is chosen from a very large or even infinite set.
| |
| | |
| In recent years, perceptron training has become popular in the field of [[natural language processing]] for such tasks as [[part-of-speech tagging]] and [[syntactic parsing]] (Collins, 2002).
| |
| | |
| == History ==
| |
| :''See also: [[History of artificial intelligence#Perceptrons and the dark age of connectionism|History of artificial intelligence]], [[AI winter#The abandonment of connectionism in 1969|AI winter]] and [[Frank Rosenblatt]]''
| |
| Although the perceptron initially seemed promising, it was eventually proved that perceptrons could not be trained to recognise many classes of patterns. This led to the field of neural network research stagnating for many years, before it was recognised that a feedforward neural network with two or more layers (also called a [[Feedforward_neural_network#Multi-layer_perceptron|multilayer perceptron]]) had far greater processing power than perceptrons with one layer (also called a [[Feedforward_neural_network#Single-layer_perceptron|single layer perceptron]]).
| |
| Single layer perceptrons are only capable of learning [[linearly separable]] patterns; in 1969 a famous book entitled ''[[Perceptrons (book)|Perceptrons]]'' by [[Marvin Minsky]] and [[Seymour Papert]] showed that it was impossible for these classes of network to learn an [[XOR]] function. It is often believed that they also conjectured (incorrectly) that a similar result would hold for a multi-layer perceptron network. However, this is not true, as both Minsky and Papert already knew that multi-layer perceptrons were capable of producing an XOR Function. (See the page on ''[[Perceptrons (book)|Perceptrons]]'' for more information.) Three years later [[Stephen Grossberg]] published a series of papers introducing networks capable of modelling differential, contrast-enhancing and XOR functions. (The papers were published in 1972 and 1973, see e.g.: Grossberg, Contour enhancement, short-term memory, and constancies in reverberating neural networks. Studies in Applied Mathematics, 52 (1973), 213-257, online [http://cns.bu.edu/Profiles/Grossberg/Gro1973StudiesAppliedMath.pdf]). Nevertheless the often-miscited Minsky/Papert text caused a significant decline in interest and funding of neural network research. It took ten more years until [[neural network]] research experienced a resurgence in the 1980s. This text was reprinted in 1987 as "Perceptrons - Expanded Edition" where some errors in the original text are shown and corrected.
| |
| | |
| The kernel Perceptron algorithm was already introduced in 1964 by Aizerman et al.<ref>M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821–837, 1964</ref> Margin bounds guarantees were given for the Perceptron algorithm in the general non-separable case first by [[Yoav Freund|Freund]] and [[Robert Schapire|Schapire]] (1998),<ref>Freund, Y. and Schapire, R. E. 1998. Large margin classification using the perceptron algorithm. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT' 98). ACM Press.</ref> and more recently by [[Mehryar Mohri|Mohri]] and Rostamizadeh (2013) who extend previous results and give new L1 bounds.<ref>Mohri, Mehryar and Rostamizadeh, Afshin (2013). [http://arxiv.org/pdf/1305.0208.pdf Perceptron Mistake Bounds] arXiv:1305.0208, 2013.</ref>
| |
| | |
| ==References==
| |
| {{Reflist}}
| |
| * Aizerman, M. A. and Braverman, E. M. and Lev I. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821–837, 1964.
| |
| * Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6, pp. 386–408. {{doi|10.1037/h0042519}}.
| |
| * Rosenblatt, Frank (1962), Principles of Neurodynamics. Washington, DC:Spartan Books.
| |
| * Minsky M. L. and Papert S. A. 1969. ''Perceptrons''. Cambridge, MA: MIT Press.
| |
| * Freund, Y. and Schapire, R. E. 1998. Large margin classification using the perceptron algorithm. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT' 98). ACM Press.
| |
| * Freund, Y. and Schapire, R. E. 1999. [http://www.cs.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf Large margin classification using the perceptron algorithm.] In Machine Learning 37(3):277-296, 1999.
| |
| * Gallant, S. I. (1990). [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=80230 Perceptron-based learning algorithms.] IEEE Transactions on Neural Networks, vol. 1, no. 2, pp. 179–191.
| |
| * Mohri, Mehryar and Rostamizadeh, Afshin (2013). [http://arxiv.org/pdf/1305.0208.pdf Perceptron Mistake Bounds] arXiv:1305.0208, 2013.
| |
| * Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata, 12, 615-622. Polytechnic Institute of Brooklyn.
| |
| * [[Bernard Widrow|Widrow, B.]], Lehr, M.A., "30 years of Adaptive Neural Networks: Perceptron, Madaline, and Backpropagation," ''Proc. IEEE'', vol 78, no 9, pp. 1415–1442, (1990).
| |
| * [[Michael Collins (computational linguist)|Collins, M.]] 2002. Discriminative training methods for hidden Markov models: Theory and experiments with the perceptron algorithm in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '02).
| |
| * Yin, Hongfeng (1996), Perceptron-Based Algorithms and Analysis, Spectrum Library, Concordia University, Canada
| |
| | |
| == External links ==
| |
| * [http://www.mathworks.com/matlabcentral/fileexchange/32949-a-perceptron-learns-to-perform-a-binary-nand-function/content/PerceptronImpl.m A Perceptron implemented in MATLAB to learn binary NAND function]
| |
| * Chapter 3 [http://page.mi.fu-berlin.de/rojas/neural/chapter/K3.pdf Weighted networks - the perceptron] and chapter 4 [http://page.mi.fu-berlin.de/rojas/neural/chapter/K4.pdf Perceptron learning] of [http://page.mi.fu-berlin.de/rojas/neural/index.html.html ''Neural Networks - A Systematic Introduction''] by [[Raúl Rojas]] (ISBN 978-3-540-60505-8)
| |
| * [http://www-cse.ucsd.edu/users/elkan/250B/perceptron.pdf Explanation of the update rule] by Charles Elkan
| |
| * [http://www.csulb.edu/~cwallis/artificialn/History.htm History of perceptrons]
| |
| * [http://www.cis.hut.fi/ahonkela/dippa/node41.html Mathematics of perceptrons]
| |
| * [http://library.thinkquest.org/18242/perceptron.shtml Perceptron demo applet and an introduction by examples]
| |
| | |
| [[Category:Classification algorithms]]
| |
| [[Category:Neural networks]]
| |
| [[Category:Articles with example Python code]]
| |
| | |
| {{Link FA|ru}}
| |