Bias–variance dilemma

{{#invoke:Multiple image|render}}

In machine learning, the bias–variance dilemma or bias–variance tradeoff is the problem of simultaneously minimizing the bias (how accurate a model is across different training sets) and variance of the model error (how sensitive the model is to small changes in training set). This tradeoff applies to all forms of supervised learning: classification, function fitting,[1][2] and structured output learning.

The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that at the same time captures the regularities in its training data, but also generalizes well to unseen data. Models with high bias are intuitively simple models: they impose restrictions on the kind of regularities that can be learned (examples include linear classifiers). The problem with these models is that they underfit, i.e. not learn the relationship between predicted (target) variables and features. Models with high variance are those that can learn many kinds of complex regularities—but that includes the possibility to learn noise in the training data, i.e. overfitting. To achieve good performance on data outside the training set, a tradeoff must be made.

Relevance

This trade-off is related to issues of overfitting and underfitting. Models that reliably have small deviations from the training set are typically bad predictors for non-training set inputs and will change significantly if the set changes (making them sensitive to outliers and noise). Meanwhile models that systematically deviate can resist noise and generalise well, although too strong deviation decreases performance.

Typically the goal is to find an optimal trade-off between bias and variance. A common model selection criterion is that the decrease of bias with increasing model complexity becomes equal to the increase of variance. There are however other possible considerations, such as error losses or complexity costs, that may lead to other trade-offs. The choice of model may also introduce biases that correspond to useful prior information, for example that the output must be within a given interval.

Bias–variance decomposition

(After [3])

Suppose we have data ${\displaystyle D=\{(x_{1},y_{1}),(x_{2},y_{2}),\ldots ,(x_{N},y_{N})\}}$ derived from the true model ${\displaystyle y=f(x)+\epsilon }$, where ${\displaystyle \epsilon }$ is a random variable with ${\displaystyle E[\epsilon ]=0}$.

Given the data, we train a model ${\displaystyle g(x|D)}$ to approximate ${\displaystyle f}$. For brevity and clarity ${\displaystyle g(x_{i}|D)}$ will be written as ${\displaystyle g_{i}}$ below. Similarly ${\displaystyle f_{i}=f(x_{i})}$.

The value of interest is the expectation of the MSE across different realisations of data:

${\displaystyle E[MSE]=(1/N)\sum _{i=1}^{N}E[(y_{i}-g_{i})^{2}]}$

This formula can be split apart:

${\displaystyle E[(y_{i}-g_{i})^{2}]=E[(y_{i}-f_{i}+f_{i}-g_{i})^{2}]}$
${\displaystyle =E[(y_{i}-f_{i})^{2}]+E[(f_{i}-g_{i})^{2}]+2E[(f_{i}-g_{i})(y_{i}-f_{i})]}$
${\displaystyle =E[\epsilon ^{2}]+E[(f_{i}-g_{i})^{2}]+2\left(E[f_{i}y_{i}]-E[f_{i}^{2}]-E[g_{i}y_{i}]+E[g_{i}f_{i}]\right)}$
${\displaystyle E[(y_{i}-g_{i})^{2}]=E[\epsilon ^{2}]+E[(f_{i}-g_{i})^{2}]}$

The second term can be decomposed in the same way using the same trick:

${\displaystyle E[(f_{i}-g_{i})^{2}]=E[(f_{i}-E[g_{i}])^{2}]+E[(E[g_{i}]-g_{i})^{2}]+0}$

Hence

${\displaystyle E[(y_{i}-g_{i})^{2}]=E[\epsilon ^{2}]+E[(f_{i}-E[g_{i}])^{2}]+E[(E[g_{i}]-g_{i})^{2}]}$
${\displaystyle =\mathrm {Var} (\epsilon )+(\mathrm {bias} )^{2}+\mathrm {Var} (g_{i})}$

The expected MSE is hence composed of an irreducible error due to the noise (the first term), a bias due to the model systematically deviating from f (the second term), and a variance due to the model not perfectly predicting the particular realization of the data (the third term).

In the above the data was written as scalar values, but the decomposition works for vectorial data too.

Approaches

Cross-validation is widely used to check model error by testing on data not part of the training set. Multiple rounds with randomly selected test sets are averaged together to reduce variability of the cross-validation; high variability of the model will produce high average errors on the test set.

Dimensionality reduction and feature selection can decrease variance by simplifying models. Similarly, a larger training set tends to decrease variance. Adding features (predictors) tends to decrease bias, at the expense of introducing additional variance. Learning algorithms typically have some tunable parameters that control bias and variance, e.g.:

One way of resolving the trade-off is to use mixture models and ensemble learning.[6][7] For example, boosting combines many "weak" (high bias) models in an ensemble that has greater variance than the individual models, while bagging combines "strong" learners in a way that reduces their variance.