Multiplication: Difference between revisions
No edit summary |
|||
Line 1: | Line 1: | ||
{{seealso|Unsupervised learning}} | |||
{{Machine learning bar}} | |||
{{more footnotes|date=January 2013}} | |||
'''Supervised learning''' is the [[machine learning]] task of inferring a function from labeled training data.<ref>[[Mehryar Mohri]], Afshin Rostamizadeh, Ameet Talwalkar (2012) ''Foundations of Machine Learning'', The | |||
MIT Press ISBN 9780262018258.</ref> The [[training set|training data]] consist of a set of ''training examples''. In supervised learning, each example is a ''pair'' consisting of an input object (typically a vector) and a desired output value (also called the ''supervisory signal''). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way (see [[inductive bias]]). | |||
The parallel task in human and animal psychology is often referred to as [[concept learning]]. | |||
==Overview== | |||
In order to solve a given problem of supervised learning, one has to perform the following steps: | |||
# Determine the type of training examples. Before doing anything else, the user should decide what kind of data is to be used as a training set. In the case of handwriting analysis, for example, this might be a single handwritten character, an entire handwritten word, or an entire line of handwriting. | |||
# Gather a training set. The training set needs to be representative of the real-world use of the function. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements. | |||
# Determine the input feature representation of the learned function. The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the [[curse of dimensionality]]; but should contain enough information to accurately predict the output. | |||
# Determine the structure of the learned function and corresponding learning algorithm. For example, the engineer may choose to use [[support vector machine]]s or [[Decision tree learning|decision tree]]s. | |||
# Complete the design. Run the learning algorithm on the gathered training set. Some supervised learning algorithms require the user to determine certain control parameters. These parameters may be adjusted by optimizing performance on a subset (called a ''validation'' set) of the training set, or via [[cross-validation (statistics)|cross-validation]]. | |||
# Evaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set. | |||
A wide range of supervised learning algorithms is available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems (see the [[No free lunch in search and optimization|No free lunch theorem]]). | |||
There are four major issues to consider in supervised learning: | |||
===Bias-variance tradeoff=== | |||
{{Main|Bias-variance dilemma}} | |||
A first issue is the tradeoff between ''bias'' and ''variance''.<ref>S. Geman, E. Bienenstock, and R. Doursat (1992). Neural networks and the bias/variance dilemma. Neural Computation 4, 1–58.</ref> Imagine that we have available several different, but equally good, training data sets. A learning algorithm is biased for a particular input <math>x</math> if, when trained on each of these data sets, it is systematically incorrect when predicting the correct output for <math>x</math>. A learning algorithm has high variance for a particular input <math>x</math> if it predicts different output values when trained on different training sets. The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm.<ref>G. James (2003) Variance and Bias for General Loss Functions, Machine Learning 51, 115-135. (http://www-bcf.usc.edu/~gareth/research/bv.pdf)</ref> Generally, there is a tradeoff between bias and variance. A learning algorithm with low bias must be "flexible" so that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training data set differently, and hence have high variance. A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can adjust). | |||
===Function complexity and amount of training data=== | |||
The second issue is the amount of training data available relative to the complexity of the "true" function (classifier or regression function). If the true function is simple, then an "inflexible" learning algorithm with high bias and low variance will be able to learn it from a small amount of data. But if the true function is highly complex (e.g., because it involves complex interactions among many different input features and behaves differently in different parts of the input space), then the function will only be learnable from a very large amount of training data and using a "flexible" learning algorithm with low bias and high variance. Good learning algorithms therefore automatically adjust the bias/variance tradeoff based on the amount of data available and the apparent complexity of the function to be learned. | |||
===Dimensionality of the input space=== | |||
A third issue is the dimensionality of the input space. If the input feature vectors have very high dimension, the learning problem can be difficult even if the true function only depends on a small number of those features. This is because the many "extra" dimensions can confuse the learning algorithm and cause it to have high variance. Hence, high input dimensionality typically requires tuning the classifier to have low variance and high bias. In practice, if the engineer can manually remove irrelevant features from the input data, this is likely to improve the accuracy of the learned function. In addition, there are many algorithms for [[feature selection]] that seek to identify the relevant features and discard the irrelevant ones. This is an instance of the more general strategy of [[dimensionality reduction]], which seeks to map the input data into a lower dimensional space prior to running the supervised learning algorithm. | |||
===Noise in the output values=== | |||
A fourth issue is the degree of noise in the desired output values (the supervisory targets). If the desired output values are often incorrect (because of human error or sensor errors), then the learning algorithm should not attempt to find a function that exactly matches the training examples. Attempting to fit the data too carefully leads to [[overfitting]]. You can overfit even when there are no measurement errors (stochastic noise) if the function you are trying to learn is too complex for your learning model. In such a situation that part of the target function that cannot be modeled "corrupts" your training data - this phenomenon has been called [[deterministic noise]]. When either type of noise is present, it is better to go with a higher bias, lower variance estimator. | |||
In practice, there are several approaches to alleviate noise in the output values such as [[early stopping]] to prevent [[overfitting]] as well as [[anomaly detection|detecting]] and removing the noisy training examples prior to training the supervised learning algorithm. There are several algorithms that identify noisy training examples and removing the suspected noisy training examples prior to training has decreased [[generalization error]] with [[statistical significance]].<ref>C.E. Brodely and M.A. Friedl (1999). Identifying and Eliminating Mislabeled Training Instances, Journal of Artificial Intelligence Research 11, 131-167. (http://jair.org/media/606/live-606-1803-jair.pdf)</ref><ref>{{cite conference |author=M.R. Smith and T. Martinez |title=Improving Classification Accuracy by Identifying and Removing Instances that Should Be Misclassified |booktitle=Proceedings of International Joint Conference on Neural Networks (IJCNN 2011) |pages=2690–2697 |year=2011 |location= |url=http://dx.doi.org/10.1109/IJCNN.2011.6033571 }}</ref> | |||
===Other factors to consider=== | |||
Other factors to consider when choosing and applying a learning algorithm include the following: | |||
# Heterogeneity of the data. If the feature vectors include features of many different kinds (discrete, discrete ordered, counts, continuous values), some algorithms are easier to apply than others. Many algorithms, including [[Support Vector Machines]], [[linear regression]], [[logistic regression]], [[Artificial neural network|neural networks]], and [[k-nearest neighbor algorithm|nearest neighbor methods]], require that the input features be numerical and scaled to similar ranges (e.g., to the [-1,1] interval). Methods that employ a distance function, such as [[k-nearest neighbor algorithm|nearest neighbor methods]] and [[Support Vector Machines|support vector machines with Gaussian kernels]], are particularly sensitive to this. An advantage of [[Decision tree learning|decision trees]] is that they easily handle heterogeneous data. | |||
# Redundancy in the data. If the input features contain redundant information (e.g., highly correlated features), some learning algorithms (e.g., [[linear regression]], [[logistic regression]], and [[k-nearest neighbor algorithm|distance based methods]]) will perform poorly because of numerical instabilities. These problems can often be solved by imposing some form of [[Regularization (mathematics)|regularization]]. | |||
# Presence of interactions and non-linearities. If each of the features makes an independent contribution to the output, then algorithms based on linear functions (e.g., [[linear regression]], [[logistic regression]], [[Support Vector Machines]], [[Naive Bayes classifier|naive Bayes]]) and distance functions (e.g., [[k-nearest neighbor algorithm|nearest neighbor methods]], [[Support Vector Machines|support vector machines with Gaussian kernels]]) generally perform well. However, if there are complex interactions among features, then algorithms such as [[Decision tree learning|decision trees]] and [[Artificial neural network|neural networks]] work better, because they are specifically designed to discover these interactions. Linear methods can also be applied, but the engineer must manually specify the interactions when using them. | |||
When considering a new application, the engineer can compare multiple learning algorithms and experimentally determine which one works best on the problem at hand (see [[Cross-validation (statistics)|cross validation]]). Tuning the performance of a learning algorithm can be very time-consuming. Given fixed resources, it is often better to spend more time collecting additional training data and more informative features than it is to spend extra time tuning the learning algorithms. | |||
The most widely used learning algorithms are [[Support Vector Machines]], [[linear regression]], [[logistic regression]], [[Naive Bayes classifier|naive Bayes]], [[linear discriminant analysis]], [[Decision tree learning|decision trees]], [[k-nearest neighbor algorithm]], and [[Artificial neural network|Neural Networks]] ([[Multilayer perceptron]]). | |||
==How supervised learning algorithms work== | |||
Given a set of <math>N</math> training examples of the form <math>\{(x_1, y_1), ..., (x_N,\; y_N)\}</math> such that <math>x_i</math> is the [[feature vector]] of the i-th example and <math>y_i</math> is its label (i.e., class), a learning algorithm seeks a function <math>g: X \to Y</math>, where <math>X</math> is the input space and | |||
<math>Y</math> is the output space. The function <math>g</math> is an element of some space of possible functions <math>G</math>, usually called the ''hypothesis space''. It is sometimes convenient to | |||
represent <math>g</math> using a scoring function <math>f: X \times Y \to \Bbb{R}</math> such that <math>g</math> is defined as returning the <math>y</math> value that gives the highest score: <math>g(x) = \arg \max_y \; f(x,y)</math>. Let <math>F</math> denote the space of scoring functions. | |||
Although <math>G</math> and <math>F</math> can be any space of functions, many learning algorithms are probabilistic models where <math>g</math> takes the form of a conditional probability model <math>g(x) = | |||
P(y|x)</math>, or <math>f</math> takes the form of a joint probability model <math>f(x,y) = P(x,y)</math>. For example, [[Naive Bayes classifier|naive Bayes]] and [[linear discriminant analysis]] are joint probability models, whereas [[logistic regression]] is a conditional probability model. | |||
There are two basic approaches to choosing <math>f</math> or <math>g</math>: [[empirical risk minimization]] and [[structural risk minimization]].<ref>Vapnik, V. N. The Nature of Statistical Learning Theory (2nd Ed.), Springer Verlag, 2000.</ref> Empirical risk minimization seeks the function that best fits the training data. Structural risk minimize includes a ''penalty function'' that controls the bias/variance tradeoff. | |||
In both cases, it is assumed that the training set consists of a sample of [[independent and identically-distributed random variables|independent and identically distributed pairs]], <math>(x_i, \;y_i)</math>. In order to measure how well a function fits the training data, a [[loss function]] <math>L: Y \times Y \to | |||
\Bbb{R}^{\ge 0}</math> is defined. For training example <math>(x_i,\;y_i)</math>, the loss of predicting the value <math>\hat{y}</math> is <math>L(y_i,\hat{y})</math>. | |||
The ''risk'' <math>R(g)</math> of function <math>g</math> is defined as the expected loss of <math>g</math>. This can be estimated from the training data as | |||
:<math>R_{emp}(g) = \frac{1}{N} \sum_i L(y_i, g(x_i))</math>. | |||
===Empirical risk minimization=== | |||
{{main|Empirical risk minimization}} | |||
In empirical risk minimization, the supervised learning algorithm seeks the function <math>g</math> that minimizes <math>R(g)</math>. Hence, a supervised learning algorithm can be constructed by applying an [[Optimization (mathematics)|optimization algorithm]] to find <math>g</math>. | |||
When <math>g</math> is a conditional probability distribution <math>P(y|x)</math> and the loss function is the negative log likelihood: <math>L(y, \hat{y}) = -\log P(y | x)</math>, then empirical risk minimization is equivalent to [[Maximum likelihood|maximum likelihood estimation]]. | |||
When <math>G</math> contains many candidate functions or the training set is not sufficiently large, empirical risk minimization leads to high variance and poor generalization. The learning algorithm is able | |||
to memorize the training examples without generalizing well. This is called [[overfitting]]. | |||
===Structural risk minimization=== | |||
[[Structural risk minimization]] seeks to prevent overfitting by incorporating a [[Regularization (mathematics)|regularization penalty]] into the optimization. The regularization penalty can be viewed as implementing a form of [[Occam's razor]] that prefers simpler functions over more complex ones. | |||
A wide variety of penalties have been employed that correspond to different definitions of complexity. For example, consider the case where the function <math>g</math> is a linear function of the form | |||
:<math> g(x) = \sum_{j=1}^d \beta_j x_j</math>. | |||
A popular regularization penalty is <math>\sum_j \beta_j^2</math>, which is the squared [[Euclidean norm]] of the weights, also known as the <math>L_2</math> norm. Other norms include the <math>L_1</math> norm, <math>\sum_j |\beta_j|</math>, and the <math>L_0</math> norm, which is the number of non-zero <math>\beta_j</math>s. The penalty will be denoted by <math>C(g)</math>. | |||
The supervised learning optimization problem is to find the function <math>g</math> that minimizes | |||
:<math> J(g) = R_{emp}(g) + \lambda C(g).</math> | |||
The parameter <math>\lambda</math> controls the bias-variance tradeoff. When <math>\lambda = 0</math>, this gives empirical risk minimization with low bias and high variance. When <math>\lambda</math> is large, the learning algorithm will have high bias and low variance. The value of <math>\lambda</math> can be chosen empirically via [[cross-validation (statistics)|cross validation]]. | |||
The complexity penalty has a Bayesian interpretation as the negative log prior probability of <math>g</math>, <math>-\log P(g)</math>, in which case <math>J(g)</math> is the posterior probabability of <math>g</math>. | |||
==Generative training== | |||
The training methods described above are ''discriminative training'' methods, because they seek to find a function <math>g</math> that discriminates well between the different output values (see [[discriminative model]]). For the special case where <math>f(x,y) = P(x,y)</math> is a joint probability distribution and the loss function is the negative log likelihood <math>- \sum_i \log P(x_i, y_i),</math> a risk minimization algorithm is said to perform ''generative training'', because <math>f</math> can be regarded as a [[generative model]] that explains how the data were generated. Generative training algorithms are often simpler and more computationally efficient than discriminative training algorithms. In some cases, the solution can be computed in closed form as in [[Naive Bayes classifier|naive Bayes]] and [[linear discriminant analysis]]. | |||
==Generalizations of supervised learning== | |||
There are several ways in which the standard supervised learning problem can be generalized: | |||
# [[Semi-supervised learning]]: In this setting, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled. | |||
# [[Active learning (machine learning)|Active learning]]: Instead of assuming that all of the training examples are given at the start, active learning algorithms interactively collect new examples, typically by making queries to a human user. Often, the queries are based on unlabeled data, which is a scenario that combines semi-supervised learning with active learning. | |||
# [[Structured prediction]]: When the desired output value is a complex object, such as a parse tree or a labeled graph, then standard methods must be extended. | |||
# [[Learning to rank]]: When the input is a set of objects and the desired output is a ranking of those objects, then again the standard methods must be extended. | |||
== Approaches and algorithms == | |||
* [[Analytical learning]] | |||
* [[Artificial neural network]] | |||
* [[Backpropagation]] | |||
* [[Boosting (meta-algorithm)]] | |||
* [[Bayesian statistics]] | |||
* [[Case-based reasoning]] | |||
* [[Decision tree learning]] | |||
* [[Inductive logic programming]] | |||
* [[Gaussian process regression]] | |||
* [[Group method of data handling]] | |||
* [[Variable kernel density estimation#Use for statistical classification|Kernel estimators]] | |||
* [[Learning Automata]] | |||
* [[Minimum message length]] ([[decision tree]]s, decision graphs, etc.) | |||
* [[Multilinear subspace learning]] | |||
* [[Naive bayes classifier]] | |||
* [[Nearest neighbor (pattern recognition)|Nearest Neighbor Algorithm]] | |||
* [[Probably approximately correct learning]] (PAC) learning | |||
* [[Ripple down rules]], a knowledge acquisition methodology | |||
* [[Symbolic machine learning]] algorithms | |||
* [[Subsymbolic machine learning]] algorithms | |||
* [[Support vector machine]]s | |||
* [[Random forest|Random Forests]] | |||
* [[Ensembles of Classifiers]] | |||
* [[Ordinal classification]] | |||
* [[Data Pre-processing]] | |||
* [[Handling imbalanced datasets]] | |||
* [[Statistical relational learning]] | |||
* [[Proaftn]], a multicriteria classification algorithm | |||
== Applications == | |||
* [[Bioinformatics]] | |||
* [[Cheminformatics]] | |||
**[[Quantitative structure–activity relationship]] | |||
* [[Database marketing]] | |||
* [[Handwriting recognition]] | |||
* [[Information retrieval]] | |||
** [[Learning to rank]] | |||
* Object recognition in [[computer vision]] | |||
* [[Optical character recognition]] | |||
* [[Spamming|Spam detection]] | |||
* [[Pattern recognition]] | |||
* [[Speech recognition]] | |||
== General issues == | |||
* [[Computational learning theory]] | |||
* [[Inductive bias]] | |||
* [[Overfitting (machine learning)]] | |||
* (Uncalibrated) [[Class membership probabilities]] | |||
* [[Version space]]s | |||
==References== | |||
{{reflist}} | |||
==External links== | |||
* [http://www.mloss.org mloss.org]: a directory of open source machine learning software. | |||
{{DEFAULTSORT:Supervised Learning}} | |||
[[Category:Machine learning]] |
Revision as of 02:35, 8 January 2014
Fitter (General ) Cameron Broadbent from Stevensville, spends time with hobbies including metal detection, property developers in singapore and psychology. Finds encouragement by making vacation to Tomb of Askia.
Have a look at my page; condo for Sale
Genital herpes is a kind of sexually transmitted disease that certain becomes through sexual or oral connection with someone else that is afflicted by the viral disorder. Oral herpes requires occasional eruptions of fever blisters" round the mouth Figure 02 Also known as cold sores" or fever blisters," characteristic herpes lesions often appear around the mouth sometimes of illness, after sunlight or wind publicity, during menstruation, or with mental stress.
Though statistical numbers aren't nearly where they should be, increasing numbers of people are arriving at various clinics regarding the herpes symptoms also to have themselves and their companions treated.
Because symptoms may be recognised incorrectly as skin irritation or something else, a partner can't be determined by the partner with herpes to constantly find out when he or she is contagious. Some who contract herpes are symptom-no cost, others have just one breakout, and still others have standard bouts of symptoms.
Similarly, careful hand washing should be practiced to avoid the virus from spreading to other parts of the body, especially the eye and mouth. If you think you have already been exposed or show signs of herpes infection, see your medical provider. Prompt qualified diagnosis may boost your chances of responding to a prescription drugs like acyclovir that decreases the duration and severity of a short bout of symptoms.
HSV type 1 is the herpes virus that is usually responsible for cold sores of the mouth, the so-referred to as " fever blisters." You get HSV-1 by coming into contact with the saliva of an contaminated person.
If you are you looking for more information regarding herpes symptoms oral pictures look into our own web page.
Template:More footnotes
Supervised learning is the machine learning task of inferring a function from labeled training data.[1] The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way (see inductive bias).
The parallel task in human and animal psychology is often referred to as concept learning.
Overview
In order to solve a given problem of supervised learning, one has to perform the following steps:
- Determine the type of training examples. Before doing anything else, the user should decide what kind of data is to be used as a training set. In the case of handwriting analysis, for example, this might be a single handwritten character, an entire handwritten word, or an entire line of handwriting.
- Gather a training set. The training set needs to be representative of the real-world use of the function. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements.
- Determine the input feature representation of the learned function. The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output.
- Determine the structure of the learned function and corresponding learning algorithm. For example, the engineer may choose to use support vector machines or decision trees.
- Complete the design. Run the learning algorithm on the gathered training set. Some supervised learning algorithms require the user to determine certain control parameters. These parameters may be adjusted by optimizing performance on a subset (called a validation set) of the training set, or via cross-validation.
- Evaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set.
A wide range of supervised learning algorithms is available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems (see the No free lunch theorem).
There are four major issues to consider in supervised learning:
Bias-variance tradeoff
Mining Engineer (Excluding Oil ) Truman from Alma, loves to spend time knotting, largest property developers in singapore developers in singapore and stamp collecting. Recently had a family visit to Urnes Stave Church.
A first issue is the tradeoff between bias and variance.[2] Imagine that we have available several different, but equally good, training data sets. A learning algorithm is biased for a particular input if, when trained on each of these data sets, it is systematically incorrect when predicting the correct output for . A learning algorithm has high variance for a particular input if it predicts different output values when trained on different training sets. The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm.[3] Generally, there is a tradeoff between bias and variance. A learning algorithm with low bias must be "flexible" so that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training data set differently, and hence have high variance. A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can adjust).
Function complexity and amount of training data
The second issue is the amount of training data available relative to the complexity of the "true" function (classifier or regression function). If the true function is simple, then an "inflexible" learning algorithm with high bias and low variance will be able to learn it from a small amount of data. But if the true function is highly complex (e.g., because it involves complex interactions among many different input features and behaves differently in different parts of the input space), then the function will only be learnable from a very large amount of training data and using a "flexible" learning algorithm with low bias and high variance. Good learning algorithms therefore automatically adjust the bias/variance tradeoff based on the amount of data available and the apparent complexity of the function to be learned.
Dimensionality of the input space
A third issue is the dimensionality of the input space. If the input feature vectors have very high dimension, the learning problem can be difficult even if the true function only depends on a small number of those features. This is because the many "extra" dimensions can confuse the learning algorithm and cause it to have high variance. Hence, high input dimensionality typically requires tuning the classifier to have low variance and high bias. In practice, if the engineer can manually remove irrelevant features from the input data, this is likely to improve the accuracy of the learned function. In addition, there are many algorithms for feature selection that seek to identify the relevant features and discard the irrelevant ones. This is an instance of the more general strategy of dimensionality reduction, which seeks to map the input data into a lower dimensional space prior to running the supervised learning algorithm.
Noise in the output values
A fourth issue is the degree of noise in the desired output values (the supervisory targets). If the desired output values are often incorrect (because of human error or sensor errors), then the learning algorithm should not attempt to find a function that exactly matches the training examples. Attempting to fit the data too carefully leads to overfitting. You can overfit even when there are no measurement errors (stochastic noise) if the function you are trying to learn is too complex for your learning model. In such a situation that part of the target function that cannot be modeled "corrupts" your training data - this phenomenon has been called deterministic noise. When either type of noise is present, it is better to go with a higher bias, lower variance estimator.
In practice, there are several approaches to alleviate noise in the output values such as early stopping to prevent overfitting as well as detecting and removing the noisy training examples prior to training the supervised learning algorithm. There are several algorithms that identify noisy training examples and removing the suspected noisy training examples prior to training has decreased generalization error with statistical significance.[4][5]
Other factors to consider
Other factors to consider when choosing and applying a learning algorithm include the following:
- Heterogeneity of the data. If the feature vectors include features of many different kinds (discrete, discrete ordered, counts, continuous values), some algorithms are easier to apply than others. Many algorithms, including Support Vector Machines, linear regression, logistic regression, neural networks, and nearest neighbor methods, require that the input features be numerical and scaled to similar ranges (e.g., to the [-1,1] interval). Methods that employ a distance function, such as nearest neighbor methods and support vector machines with Gaussian kernels, are particularly sensitive to this. An advantage of decision trees is that they easily handle heterogeneous data.
- Redundancy in the data. If the input features contain redundant information (e.g., highly correlated features), some learning algorithms (e.g., linear regression, logistic regression, and distance based methods) will perform poorly because of numerical instabilities. These problems can often be solved by imposing some form of regularization.
- Presence of interactions and non-linearities. If each of the features makes an independent contribution to the output, then algorithms based on linear functions (e.g., linear regression, logistic regression, Support Vector Machines, naive Bayes) and distance functions (e.g., nearest neighbor methods, support vector machines with Gaussian kernels) generally perform well. However, if there are complex interactions among features, then algorithms such as decision trees and neural networks work better, because they are specifically designed to discover these interactions. Linear methods can also be applied, but the engineer must manually specify the interactions when using them.
When considering a new application, the engineer can compare multiple learning algorithms and experimentally determine which one works best on the problem at hand (see cross validation). Tuning the performance of a learning algorithm can be very time-consuming. Given fixed resources, it is often better to spend more time collecting additional training data and more informative features than it is to spend extra time tuning the learning algorithms.
The most widely used learning algorithms are Support Vector Machines, linear regression, logistic regression, naive Bayes, linear discriminant analysis, decision trees, k-nearest neighbor algorithm, and Neural Networks (Multilayer perceptron).
How supervised learning algorithms work
Given a set of training examples of the form such that is the feature vector of the i-th example and is its label (i.e., class), a learning algorithm seeks a function , where is the input space and is the output space. The function is an element of some space of possible functions , usually called the hypothesis space. It is sometimes convenient to represent using a scoring function such that is defined as returning the value that gives the highest score: . Let denote the space of scoring functions.
Although and can be any space of functions, many learning algorithms are probabilistic models where takes the form of a conditional probability model , or takes the form of a joint probability model . For example, naive Bayes and linear discriminant analysis are joint probability models, whereas logistic regression is a conditional probability model.
There are two basic approaches to choosing or : empirical risk minimization and structural risk minimization.[6] Empirical risk minimization seeks the function that best fits the training data. Structural risk minimize includes a penalty function that controls the bias/variance tradeoff.
In both cases, it is assumed that the training set consists of a sample of independent and identically distributed pairs, . In order to measure how well a function fits the training data, a loss function is defined. For training example , the loss of predicting the value is .
The risk of function is defined as the expected loss of . This can be estimated from the training data as
Empirical risk minimization
Mining Engineer (Excluding Oil ) Truman from Alma, loves to spend time knotting, largest property developers in singapore developers in singapore and stamp collecting. Recently had a family visit to Urnes Stave Church.
In empirical risk minimization, the supervised learning algorithm seeks the function that minimizes . Hence, a supervised learning algorithm can be constructed by applying an optimization algorithm to find .
When is a conditional probability distribution and the loss function is the negative log likelihood: , then empirical risk minimization is equivalent to maximum likelihood estimation.
When contains many candidate functions or the training set is not sufficiently large, empirical risk minimization leads to high variance and poor generalization. The learning algorithm is able to memorize the training examples without generalizing well. This is called overfitting.
Structural risk minimization
Structural risk minimization seeks to prevent overfitting by incorporating a regularization penalty into the optimization. The regularization penalty can be viewed as implementing a form of Occam's razor that prefers simpler functions over more complex ones.
A wide variety of penalties have been employed that correspond to different definitions of complexity. For example, consider the case where the function is a linear function of the form
A popular regularization penalty is , which is the squared Euclidean norm of the weights, also known as the norm. Other norms include the norm, , and the norm, which is the number of non-zero s. The penalty will be denoted by .
The supervised learning optimization problem is to find the function that minimizes
The parameter controls the bias-variance tradeoff. When , this gives empirical risk minimization with low bias and high variance. When is large, the learning algorithm will have high bias and low variance. The value of can be chosen empirically via cross validation.
The complexity penalty has a Bayesian interpretation as the negative log prior probability of , , in which case is the posterior probabability of .
Generative training
The training methods described above are discriminative training methods, because they seek to find a function that discriminates well between the different output values (see discriminative model). For the special case where is a joint probability distribution and the loss function is the negative log likelihood a risk minimization algorithm is said to perform generative training, because can be regarded as a generative model that explains how the data were generated. Generative training algorithms are often simpler and more computationally efficient than discriminative training algorithms. In some cases, the solution can be computed in closed form as in naive Bayes and linear discriminant analysis.
Generalizations of supervised learning
There are several ways in which the standard supervised learning problem can be generalized:
- Semi-supervised learning: In this setting, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled.
- Active learning: Instead of assuming that all of the training examples are given at the start, active learning algorithms interactively collect new examples, typically by making queries to a human user. Often, the queries are based on unlabeled data, which is a scenario that combines semi-supervised learning with active learning.
- Structured prediction: When the desired output value is a complex object, such as a parse tree or a labeled graph, then standard methods must be extended.
- Learning to rank: When the input is a set of objects and the desired output is a ranking of those objects, then again the standard methods must be extended.
Approaches and algorithms
- Analytical learning
- Artificial neural network
- Backpropagation
- Boosting (meta-algorithm)
- Bayesian statistics
- Case-based reasoning
- Decision tree learning
- Inductive logic programming
- Gaussian process regression
- Group method of data handling
- Kernel estimators
- Learning Automata
- Minimum message length (decision trees, decision graphs, etc.)
- Multilinear subspace learning
- Naive bayes classifier
- Nearest Neighbor Algorithm
- Probably approximately correct learning (PAC) learning
- Ripple down rules, a knowledge acquisition methodology
- Symbolic machine learning algorithms
- Subsymbolic machine learning algorithms
- Support vector machines
- Random Forests
- Ensembles of Classifiers
- Ordinal classification
- Data Pre-processing
- Handling imbalanced datasets
- Statistical relational learning
- Proaftn, a multicriteria classification algorithm
Applications
- Bioinformatics
- Cheminformatics
- Database marketing
- Handwriting recognition
- Information retrieval
- Object recognition in computer vision
- Optical character recognition
- Spam detection
- Pattern recognition
- Speech recognition
General issues
- Computational learning theory
- Inductive bias
- Overfitting (machine learning)
- (Uncalibrated) Class membership probabilities
- Version spaces
References
43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.
External links
- mloss.org: a directory of open source machine learning software.
- ↑ Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar (2012) Foundations of Machine Learning, The MIT Press ISBN 9780262018258.
- ↑ S. Geman, E. Bienenstock, and R. Doursat (1992). Neural networks and the bias/variance dilemma. Neural Computation 4, 1–58.
- ↑ G. James (2003) Variance and Bias for General Loss Functions, Machine Learning 51, 115-135. (http://www-bcf.usc.edu/~gareth/research/bv.pdf)
- ↑ C.E. Brodely and M.A. Friedl (1999). Identifying and Eliminating Mislabeled Training Instances, Journal of Artificial Intelligence Research 11, 131-167. (http://jair.org/media/606/live-606-1803-jair.pdf)
- ↑ 55 years old Systems Administrator Antony from Clarence Creek, really loves learning, PC Software and aerobics. Likes to travel and was inspired after making a journey to Historic Ensemble of the Potala Palace.
You can view that web-site... ccleaner free download - ↑ Vapnik, V. N. The Nature of Statistical Learning Theory (2nd Ed.), Springer Verlag, 2000.