Bayesian interpretation of regularization: Difference between revisions

Latest revision as of 18:36, 7 December 2014

French luxury goods maker Hermes voiced its frustration at having arch-rival LVMH as its biggest external shareholder at its annual general meeting on Tuesday and once more called on the group to sell its stake.

LVMH, the world's biggest luxury group, which owns 23 percent of Hermes, was fined 8 million euros by the French market watchdog AMF last year for failing to properly disclose its building http://www.pcs-systems.co.uk/Images/celinebag.aspx of a stake before 2010.

Hermes, the 177-year-old maker of Birkin and Kelly handbags which is more than 70 percent family-owned, has been vehemently protesting the presence of LVMH in its shareholder capital ever since it learned of its surprise entry in 2010.
"We do not want shareholders that are rivals," Hermes Chief Executive Axel Dumas told the company's annual shareholder meeting on Tuesday. "We want to preserve our independence."

After the meeting, Dumas told Reuters he was "not aware" whether LVMH would be willing to sell down its stake.
In an interview with Le Figaro newspaper published on Tuesday, Dumas said: "LVMH is totally free to sell its shares and to be honest, would be welcome to do so."
LVMH, owner of Louis Vuitton, Dior and Celine fashion brands, has repeatedly said it was "satisfied" being Hermes' shareholder and backed its management's strategy.

"But satisfied does not mean friendly," Dumas told Le Figaro. And they (LVMH) are not particularly friendly with our management."
Separately, Dumas said Hermes was considering opening a shop in South Africa in the medium term, preferably in Johannesburg. He said he expected sales in Japan, one of the company's biggest markets where sales rose 6.5 percent at constant exchange rates in 2013, would be similar this year.

Hermes shares - which have lost nearly 2 percent since Jan 1 after climbing nearly 17 percent in 2013 - were barely changed in midday trading at 259.2 euros, valuing the company at 27.4 billion euros.
That makes Hermes the third largest luxury group by market capitalisation behind LVMH and Richemont.
(Reporting by Pascale Denis and Astrid Wendlandt; Editing by Sophie Walker)

@@ Line 1: / Line 1: @@
-{{context|date=June 2012}}
-In [[Computational learning theory|statistical learning theory]], a '''representer theorem''' is any of several related results stating that a minimizer <math>f^{*}</math> of a regularized [[Empirical risk minimization|empirical risk function]] defined over a [[reproducing kernel Hilbert space]] can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.
-==Formal Statement==
+French luxury goods [http://Www.Twitpic.com/tag/maker+Hermes maker Hermes] voiced its frustration at having arch-rival LVMH as its biggest external shareholder at its annual general meeting on Tuesday and once more called on the group to sell its stake.<br><br>LVMH, the world's biggest luxury group, which owns 23 percent of Hermes, was fined 8 million euros by the French market watchdog AMF last year for failing to properly disclose its building  [http://www.pcs-systems.co.uk/Images/celinebag.aspx http://www.pcs-systems.co.uk/Images/celinebag.aspx] of a stake before 2010.<br><br>Hermes, the 177-year-old maker of Birkin and Kelly handbags which is more than 70 percent family-owned, has been vehemently protesting the presence of LVMH in its shareholder capital ever since it learned of its surprise entry in 2010.<br>"We do not want shareholders that are rivals," Hermes Chief Executive Axel Dumas told the company's annual shareholder meeting on Tuesday. "We want to preserve our independence."<br><br>After the meeting, Dumas told Reuters he was "not aware" whether LVMH would be willing to sell down its stake.<br>In an interview with Le Figaro newspaper published on Tuesday, Dumas said: "LVMH is totally free to sell its shares and to be honest, would be welcome to do so."<br>LVMH, owner of Louis Vuitton, Dior and Celine fashion brands, has repeatedly said it was "satisfied" being Hermes' shareholder and backed its management's strategy.<br><br>"But satisfied does not mean friendly," Dumas told Le Figaro. And they (LVMH) are not particularly friendly with our management."<br>Separately, Dumas said Hermes was considering opening a shop in South Africa in the medium term, preferably in Johannesburg. He said he expected sales in Japan, one of the company's biggest markets where sales rose 6.5 percent at constant exchange rates in 2013, would be similar this year.<br><br>Hermes shares - which have lost nearly 2 percent since Jan 1 after climbing nearly 17 percent in 2013 - were barely changed in midday trading at 259.2 euros, valuing the company at 27.4 billion euros.<br>That makes Hermes the third largest luxury group by market capitalisation behind LVMH and Richemont.<br>(Reporting by Pascale Denis and Astrid Wendlandt; Editing by Sophie Walker)
-The following Representer Theorem and its proof are due to [[Bernhard Schölkopf|Schölkopf]], Herbrich, and Smola:
-'''Theorem:''' Let <math>\mathcal{X}</math> be a nonempty set and <math>k</math> a positive-definite real-valued kernel on <math>\mathcal{X} \times \mathcal{X}</math> with corresponding reproducing kernel Hilbert space <math>H_k</math>.  Given a training sample <math>(x_1, y_1), \dotsc, (x_n, y_n) \in \mathcal{X} \times \R</math>, a strictly monotonically increasing real-valued function <math>g \colon [0, \infty) \to \R</math>, and an arbitrary empirical risk function <math>E \colon (\mathcal{X} \times \R^2)^m \to \R \cup \lbrace \infty \rbrace</math>, then for any <math>f^{*} \in H_k</math> satisfying
-:<math>
- f^{*} = \operatorname{arg min}_{f \in H_k} \left\lbrace E\left( (x_1, y_1, f(x_1)), ..., (x_n, y_n, f(x_n)) \right) + g\left( \lVert f \rVert \right) \right \rbrace, \quad (*)
-</math>
-<math>f^{*}</math> admits a representation of the form:
-:<math>
- f^{*}(\cdot) = \sum_{i = 1}^n \alpha_i k(\cdot, x_i),
-</math>
-where <math>\alpha_i \in \R</math> for all <math>1 \le i \le n</math>.
-'''Proof:'''
-Define a mapping
-:<math>
-\begin{align}
- \varphi \colon \mathcal{X} &\to \R^{\mathcal{X}} \\
-\varphi(x) &= k(\cdot, x)
-\end{align}
-</math>
-(so that <math>\varphi(x) = k(\cdot, x)</math> is itself a map <math>\mathcal{X} \to \R</math>).  Since <math>k</math> is reproducing kernel, then
-:<math>
- \varphi(x)(x') = k(x', x) = \langle \varphi(x'), \varphi(x) \rangle,
-</math>
-where <math>\langle \cdot, \cdot \rangle</math> is the inner product on <math>H_k</math>.
-Given any <math>x_1, ..., x_n</math>, one can use orthogonal projection to decompose any <math>f \in H_k</math> into a sum of two function, one lying in <math>\operatorname{span} \left \lbrace \varphi(x_1), ..., \varphi(x_n) \right \rbrace</math>, and the other lying in the orthogonal complement:
-:<math>
- f = \sum_{i = 1}^n \alpha_i \varphi(x_i) + v,
-</math>
-where <math>\langle v, \varphi(x_i) \rangle = 0</math> for all <math>i</math>.
-The above orthogonal decomposition and the [[Reproducing kernel Hilbert space#The Reproducing Property|reproducing property]] together show that applying <math>f</math> to any training point <math>x_j</math> produces
-:<math>
- f(x_j) = \left \langle \sum_{i = 1}^n \alpha_i \varphi(x_i) + v, \varphi(x_j) \right \rangle = \sum_{i = 1}^n \alpha_i \langle \varphi(x_i), \varphi(x_j) \rangle,
-</math>
-which we observe is independent of <math>v</math>.  Consequently, the value of the empirical risk <math>E</math> in (*) is likewise independent of <math>v</math>.  For the second term (the regularization term), since <math>v</math> is orthogonal to <math>\sum_{i = 1}^n \alpha_i \varphi(x_i)</math> and <math>g</math> is strictly monotonic, we have
-:<math>
-\begin{align}
- g\left( \lVert f \rVert \right) &= g \left(  \lVert \sum_{i = 1}^n \alpha_i \varphi(x_i) + v \rVert \right) \\
-&= g \left( \sqrt{  \lVert \sum_{i = 1}^n \alpha_i \varphi(x_i)  \rVert^2 + \lVert v \rVert^2} \right) \\
-&\ge g \left(  \lVert \sum_{i = 1}^n \alpha_i \varphi(x_i) \rVert \right).
-\end{align}
-</math>
-Therefore setting <math>v = 0</math> does not affect the first term of (*), while it strictly decreasing the second term.  Consequently, any minimizer <math>f^{*}</math> in (*) must have <math>v = 0</math>, i.e., it must be of the form
-:<math>
- f^{*}(\cdot) = \sum_{i = 1}^n \alpha_i \varphi(x_i) = \sum_{i = 1}^n \alpha_i k(\cdot, x_i),
-</math>
-which is the desired result.
-==Generalizations: Variations on a theme by Kimeldorf and Wahba==
-The Theorem stated above is a particular example of a family of results that are collectively referred to as "Representer Theorems"; here we describe several such.
-The first statement of a Representer Theorem was due to Kimeldorf and Wahba for the special case in which
-:<math>
-\begin{align}
-E\left( (x_1, y_1, f(x_1)), ...,  (x_n, y_n, f(x_n)) \right) &= \frac{1}{n} \sum_{i = 1}^n (f(x_i) - y_i)^2, \\
-g(\lVert f \rVert) &= \lambda \lVert f \rVert^2
-\end{align}
-</math>
-for <math>\lambda > 0</math>.  Schölkopf, Herbrich, and Smola generalized this result by relaxing the assumption of the squared-loss cost and allowing the regularizer to be any strictly monotonically increasing function <math>g(\cdot)</math> of the Hilbert space norm.
-It is possible to generalize further by augmenting the regularized empirical risk function through the addition of unpenalized offset terms.  For example, Schölkopf, Herbrich, and Smola also consider the minimization
-:<math>
- \tilde{f}^{*} = \operatorname{arg min} \left\lbrace E\left( (x_1, y_1, \tilde{f}(x_1)),  ...,  (x_n, y_n, \tilde{f}(x_n)) \right) + g\left( \lVert f \rVert \right) \mid \tilde{f} = f  + h \in H_k \oplus  \operatorname{span} \lbrace \psi_p \mid 1 \le p \le M \rbrace  \right \rbrace, \quad (\dagger)
-</math>
-i.e., we consider functions of the form <math>\tilde{f} = f + h</math>, where <math>f \in H_k</math> and <math>h</math> is an unpenalized function lying in the span of a finite set of real-valued functions <math>\lbrace \psi_p \colon \mathcal{X} \to \R \mid 1 \le p \le M \rbrace</math>.  Under the assumption that the <math>m \times M</math> matrix <math>\left( \psi_p(x_i) \right)_{ip}</math> has rank <math>M</math>, they show that the minimizer <math>\tilde{f}^{*}</math> in <math>(\dagger)</math>
-admits a representation of the form
-:<math>
- \tilde{f}^{*}(\cdot) = \sum_{i = 1}^n \alpha_i k(\cdot, x_i) + \sum_{p = 1}^M \beta_p \psi_p(\cdot)
-</math>
-where <math>\alpha_i, \beta_p \in \R</math> and the <math>\beta_p</math> are all uniquely determined.
-The conditions under which a Representer Theorem exists were investigated by Argyriou, Miccheli, and Pontil, who proved the following:
-'''Theorem:''' Let <math>\mathcal{X}</math> be a nonempty set, <math>k</math> a positive-definite real-valued kernel on <math>\mathcal{X} \times \mathcal{X}</math> with corresponding reproducing kernel Hilbert space <math>H_k</math>, and let <math>R \colon H_k \to \R</math> be a differentiable regularization function.  Then given a training sample <math>(x_1, y_1), ..., (x_n, y_n) \in \mathcal{X} \times \R</math> and an arbitrary empirical risk function <math>E \colon (\mathcal{X} \times \R^2)^m \to \R \cup \lbrace \infty \rbrace</math>, a minimizer
-:<math>
-f^{*} =  \operatorname{arg min}_{f \in H_k} \left\lbrace E\left( (x_1, y_1, f(x_1)), ...,  (x_n, y_n, f(x_n)) \right) + R(f) \right \rbrace \quad (\ddagger)
-</math>
-of the regularized empirical risk minimization problem admits a representation of the form
-:<math>
- f^{*}(\cdot) = \sum_{i = 1}^n \alpha_i k(\cdot, x_i),
-</math>
-where <math>\alpha_i \in \R</math> for all <math>1 \le i \le n</math>, if and only if there exists a nondecreasing function <math>h \colon [0, \infty) \to \R</math> for which
-:<math>
-R(f) = h(\lVert f \rVert).
-</math>
-Effectively, this result provides a necessary and sufficient condition on a differentiable regularizer <math>R(\cdot)</math> under which the corresponding regularized empirical risk minimization <math>(\ddagger)</math> will have a Representer Theorem.  In particular, this shows that a broad class of regularized risk minimizations (much broader than those originally considered by Kimeldorf and Wahba) have Representer Theorems.
-==Applications==
-Representer theorems are useful from a practical standpoint because they dramatically simplify the regularized empirical risk minimization problem <math>(\ddagger)</math>.  In most interesting applications, the search domain <math>H_k</math> for the minimization will be an infinite-dimensional subspace of <math>L^2(\mathcal{X})</math>, and therefore the search (as written) does not admit implementation on finite-memory and finite-precision computers.  In contrast, the representation of <math>f^{*}(\cdot)</math> afforded by a representer theorem reduces the original (infinite-dimensional) minimization problem to a search for the optimal <math>n</math>-dimensional vector of coefficients <math>\alpha = (\alpha_1, ..., \alpha_n) \in \R^n</math>; <math>\alpha</math> can then be obtained by applying any standard function minimization algorithm.  Consequently, representer theorems provide the theoretical basis for the reduction of the general machine learning problem to algorithms that can actually be implemented on computers in practice.
-{{no footnotes|date=June 2012}}
-==See also==
-* [[Mercer's theorem]]
-==References==
-{{reflist}}
-<!--- After listing your sources please cite them using inline citations and place them after the information they cite. Please see http://en.wikipedia.org/wiki/Wikipedia:REFB for instructions on how to add citations. --->
-*{{cite journal
- |first1=Andreas |last1=Argyriou
- |first2=Charles A. |last2=Micchelli
- |first3=Massimiliano |last3=Pontil
- |title=When Is There a Representer Theorem? Vector Versus Matrix Regularizers
- |journal=Journal of Machine Learning Research
- |volume=10 |issue=Dec |pages=2507&ndash;2529  |year=2009
-}}
-*{{cite journal
- |first1=Felipe |last1=Cucker
- |first2=Steve |last2=Smale
- |title=On the Mathematical Foundations of Learning
- |journal=[[Bulletin of the American Mathematical Society]]
- |volume=39 |issue=1 |pages=1&ndash;49 |year=2002
- |doi=10.1090/S0273-0979-01-00923-5
- |mr=1864085
-}}
-*{{cite journal
- |first1=George S. |last1=Kimeldorf
- |first2=Grace |last2=Wahba
- |title=A correspondence between Bayesian estimation on stochastic processes and smoothing by splines
- |journal=The Annals of Mathematical Statistics
- |volume=41 |issue=2 |pages=495&ndash;502 |year=1970
- |doi=10.1214/aoms/1177697089
-}}
-*{{cite journal
- |first1=Bernhard |last1=Schölkopf
- |first2=Ralf |last2=Herbrich
- |first3=Alex J. |last3=Smola
- |title=A Generalized Representer Theorem
- |journal=Computational Learning Theory
- |volume=2111 |pages=416&ndash;426 |year=2001
- |doi=10.1007/3-540-44581-1_27
- |series=Lecture Notes in Computer Science
- |isbn=978-3-540-42343-0
-}}
-[[Category:Computational learning theory]]
-[[Category:Theoretical computer science]]
-[[Category:Machine learning]]
-[[Category:Hilbert space]]

Bayesian interpretation of regularization: Difference between revisions

Latest revision as of 18:36, 7 December 2014

Navigation menu

Search