# Empirical risk minimization

Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on the performance of learning algorithms.

## Background

We also assume that we are given a non-negative real-valued loss function $L({\hat {y}},y)$ which measures how different the prediction ${\hat {y}}$ of a hypothesis is from the true outcome $y$ . The risk associated with hypothesis $h(x)$ is then defined as the expectation of the loss function:

$R(h)=\mathbf {E} [L(h(x),y)]=\int L(h(x),y)\,dP(x,y).$ $h^{*}=\arg \min _{h\in {\mathcal {H}}}R(h).$ ## Empirical risk minimization

In general, the risk $R(h)$ cannot be computed because the distribution $P(x,y)$ is unknown to the learning algorithm (this situation is referred to as agnostic learning). However, we can compute an approximation, called empirical risk, by averaging the loss function on the training set:

$\!R_{\mbox{emp}}(h)={\frac {1}{m}}\sum _{i=1}^{m}L(h(x_{i}),y_{i}).$ Empirical risk minimization principle states that the learning algorithm should choose a hypothesis ${\hat {h}}$ which minimizes the empirical risk:

${\hat {h}}=\arg \min _{h\in {\mathcal {H}}}R_{\mbox{emp}}(h).$ Thus the learning algorithm defined by the ERM principle consists in solving the above optimization problem.

## Properties

### Computational complexity

Empirical risk minimization for a classification problem with 0-1 loss function is known to be an NP-hard problem even for such relatively simple class of functions as linear classifiers. Though, it can be solved efficiently when minimal empirical risk is zero, i.e. data is linearly separable.

In practice, machine learning algorithms cope with that either by employing a convex approximation to 0-1 loss function (like hinge loss for SVM), which is easier to optimize, or by posing assumptions on the distribution $P(x,y)$ (and thus stop being agnostic learning algorithms to which the above result applies,)