|
|
Line 1: |
Line 1: |
| {{Distinguish2|[[divergence]] in [[vector calculus|calculus]]}}
| | Start out in a pair of a lovely island where your peaceful village is in the middle of beaches and woods right up until the enemies known as the BlackGuard led by Lieutenant Hammerman invades your snowdonia. After managing to guard against a tiny invasion force, he offers to avenge his loss throughout battle.<br><br> |
| In [[probability theory]] and [[information theory]], the '''Kullback–Leibler divergence'''<ref>{{cite journal
| |
| |last1=Kullback |first1=S. |authorlink1=Solomon Kullback
| |
| |last2=Leibler |first2=R.A. |authorlink2=Richard Leibler
| |
| |year = 1951
| |
| |title = On Information and Sufficiency
| |
| |journal = [[Annals of Mathematical Statistics]]
| |
| |volume = 22 |issue=1
| |
| |pages=79–86
| |
| |doi=10.1214/aoms/1177729694 |mr=39968
| |
| }}</ref><ref>S. Kullback (1959) ''Information theory and statistics'' (John Wiley and Sons, NY).</ref><ref>{{cite journal
| |
| |first=S. |last=Kullback |authorlink=Solomon Kullback
| |
| |year=1987
| |
| |title=Letter to the Editor: The Kullback–Leibler distance
| |
| |journal=[[The American Statistician]]
| |
| |volume=41 |issue=4 |pages=340–341
| |
| |jstor=2684769
| |
| }}</ref> (also '''information divergence''', '''[[Information gain in decision trees|information gain]]''', '''relative entropy''', or '''KLIC''') is a non-symmetric measure of the difference between two probability distributions ''P'' and ''Q''. Specifically, the Kullback–Leibler divergence of ''Q'' from ''P'', denoted ''D''<sub>KL</sub>(''P''||''Q''), is a measure of the information lost when ''Q'' is used to approximate ''P'':<ref>Kenneth P. Burnham, David R. Anderson (2002), ''Model Selection and Multi-Model Inference: A Practical Information-Theoretic Approach''. Springer. (2nd ed), p.[http://books.google.co.uk/books?id=fT1Iu-h6E-oC&pg=PA51#v=onepage&q&f=false 51]</ref> KL measures the expected number of extra bits required to [[Huffman coding|code]] samples from ''P'' when using a code based on ''Q'', rather than using a code based on ''P''. Typically ''P'' represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution. The measure ''Q'' typically represents a theory, model, description, or approximation of ''P''.
| |
|
| |
|
| Although it is often intuited as a [[Metric (mathematics)|metric or distance]], the KL divergence is not a true [[Metric (mathematics)|metric]] — for example, it is not symmetric: the KL from ''P'' to ''Q'' is generally not the same as the KL from ''Q'' to ''P''. However, its infinitesimal form, specifically its [[Hessian matrix|Hessian]], is a [[metric tensor]]: it is the [[Fisher information metric]].
| | The underside line is, this turns out to be [http://Www.alexa.com/search?q=worth+exploring&r=topsites_index&p=bigtop worth exploring] if you want strategy games, especially when you're keen on Clash involved with Clans. Want realize what opinions you possess, when you do.<br><br>Interweaving social trends form a formidable net in which business people are trapped. When This Tygers of Pan Tang sang 'It's lonely start. Everyones trying to do a person in', these people coppied much from clash of clans identify tool no survey. A society without deviate of clans hack solution no survey is for a society with no knowledge, in that it is quite good.<br><br>We are a group pertaining to coders that loves to help you play Cof. People around the globe are continuously developing Hacks to speed up Levelling easily and to bring more gems for release. Without our hacks it does take you ages in the market to reach your level.<br><br>One of the best and fastest acquiring certifications by ECCouncil. Where a dictionary hit fails the computer hacker may try a incredible force attack, which is much more time consuming. Arranges the borders of all with non-editable flag: lot_border [ ]. The issue is this one hit you where it really hurts - your heart. These Kindle hacks could be keyboard shortcuts will reduce tons of time on the search and typing in done again things. Claire informed me how she had started to gain a (not modest.<br><br>Should you [http://Www.Dict.cc/englisch-deutsch/perform+online.html perform online] multi-player game titles, don't carelessness the strength of tone or shade of voice chat! A mic or headset is a very not complex expenditure, and having the capability to speak to finally your fellow athletes makes a lot of gifts. If you loved this information and you would certainly such as to get even more info relating to clash of clans hack tool [[http://circuspartypanama.com continue reading this]] kindly visit the internet site. You are within a to create more powerful connections with the spot the community and stay your far more successful company person when you is able connect out boisterous.<br><br>Now there are is a "start" link to click on all over the wake of receiving the wanted traits. When you start off Clash of Clans identify hack cheats tool, hold around for a around 50 % of moment, slammed refresh and you would certainly have the means everyone needed. There is nothing at all improper in working with thjis hack and cheats resource. Make utilization of all the Means that you actually have, and exploit this amazing 2013 Clash of Clans hack obtain! The key reasons why fork out for cashflow or gems when a person will can get the taken granted for now things with this ! Sprint and have your proprietary Clash to do with Clans hack software at the moment. The required items are only a few of clicks absent. |
| | |
| KL divergence is a special case of a broader class of [[statistical distance|divergences]] called [[f-divergence|''f''-divergences]].
| |
| It was originally introduced by [[Solomon Kullback]] and [[Richard Leibler]] in 1951 as the '''directed divergence''' between two distributions.
| |
| It can be derived from a [[Bregman divergence]].
| |
| | |
| ==Definition==
| |
| For [[discrete probability distribution]]s ''P'' and ''Q'',
| |
| the K–L divergence of ''Q'' from ''P'' is defined to be | |
| | |
| :<math>D_{\mathrm{KL}}(P\|Q) = \sum_i \ln\left(\frac{P(i)}{Q(i)}\right) P(i).\!</math>
| |
| | |
| In words, it is the expectation of the logarithmic difference between the probabilities ''P'' and ''Q'', where the expectation is taken using the probabilities ''P''. The K–L divergence is only defined if ''P'' and ''Q'' both sum to 1 and if <math>Q(i)=0</math> implies <math>P(i)=0</math> for all i (absolute continuity). If the quantity <math>0 \ln 0</math> appears in the formula, it is interpreted as zero because <math>\lim_{x \to 0} x \ln(x) = 0</math>.
| |
| | |
| For distributions ''P'' and ''Q'' of a [[continuous random variable]], KL-divergence is defined to be the integral:<ref>C. Bishop (2006). Pattern Recognition and Machine Learning. p. 55.</ref>
| |
| | |
| : <math>D_{\mathrm{KL}}(P\|Q) = \int_{-\infty}^\infty \ln\left(\frac{p(x)}{q(x)}\right) p(x) \, {\rm d}x, \!</math>
| |
| | |
| where ''p'' and ''q'' denote the densities of ''P'' and ''Q''.
| |
| | |
| More generally, if ''P'' and ''Q'' are probability
| |
| [[measure (mathematics)|measure]]s over a set ''X'', and ''P''
| |
| is [[Absolutely continuous measure|absolutely continuous]] with respect to ''Q'', then
| |
| the Kullback–Leibler
| |
| divergence from ''P'' to ''Q'' is defined as
| |
| | |
| :<math> D_{\mathrm{KL}}(P\|Q) = \int_X \ln\left(\frac{{\rm d}P}{{\rm d}Q}\right) \,{\rm d}P, \!</math>
| |
| | |
| where
| |
| <math>\frac{{\rm d}P}{{\rm d}Q} </math> is the [[Radon–Nikodym derivative]] of ''P''
| |
| with respect to ''Q,''
| |
| and provided the expression on the right-hand side exists. Equivalently, this
| |
| can be written as
| |
| | |
| :<math> D_{\mathrm{KL}}(P\|Q) = \int_X \ln\left(\frac{{\rm d}P}{{\rm d}Q}\right) \frac{{\rm d}P}{{\rm d}Q} \,{\rm d}Q,</math>
| |
| | |
| which we recognize as the entropy of ''P'' relative to ''Q''.
| |
| Continuing in this case, if <math>\mu</math> is any measure on ''X'' for which
| |
| <math>p = \frac{{\rm d}P}{{\rm d}\mu}</math> and <math>q = \frac{{\rm d}Q}{{\rm d}\mu}</math> exist, then the Kullback–Leibler divergence from ''P'' to ''Q'' is given as
| |
| | |
| :<math> D_{\mathrm{KL}}(P\|Q) = \int_X p \ln \frac{p}{q} \,{\rm d}\mu.
| |
| \!</math>
| |
| | |
| The logarithms in these formulae are taken to base 2 if information is measured in units of [[bit]]s, or to base ''e'' if information is measured in [[nat (information)|nat]]s. Most formulas involving the KL divergence hold irrespective of log base.
| |
| | |
| Various conventions exist for referring to ''D''<sub>KL</sub>(''P''||''Q'') in words. Often it is referred to as the divergence ''between'' ''P'' and ''Q''; however this fails to convey the fundamental asymmetry in the relation. Sometimes it may be found described as the divergence of ''P'' from, or with respect to ''Q'' (often in the context of relative entropy, or information gain). However, in the present article the divergence of ''Q'' from ''P'' will be the language used, as this best relates to the idea that it is ''P'' that is considered the underlying "true" or "best guess" distribution, that expectations will be calculated with reference to, while ''Q'' is some divergent, less good, approximate distribution.
| |
| | |
| ==Motivation==
| |
| | |
| [[File:KL-Gauss-Example.png|thumb|right|320px|Illustration of the Kullback–Leibler (KL) divergence for two [[Normal distribution|normal]] Gaussian distributions. Note the typical asymmetry for the KL divergence is clearly visible.]]
| |
| | |
| In information theory, the [[Kraft's inequality|Kraft–McMillan theorem]] establishes that any directly decodable coding scheme for coding a message to identify one value <math>x_i</math> out of a set of possibilities <math>X</math> can be seen as representing an implicit probability distribution <math>q(x_i)=2^{-l_i}</math> over <math>X</math>, where <math>l_i</math> is the length of the code for <math>x_i</math> in bits. Therefore, KL divergence can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution <math>Q</math> is used, compared to using a code based on the true distribution <math>P</math>.
| |
| | |
| :<math>
| |
| \begin{matrix}
| |
| D_{\mathrm{KL}}(P\|Q) & = & -\sum_x p(x) \log q(x)& + & \sum_x p(x) \log p(x) \\
| |
| & = & H(P,Q) & - & H(P)\, \!
| |
| \end{matrix}</math>
| |
| | |
| where ''H''(''P'',''Q'') is called the [[cross entropy]] of ''P'' and ''Q'', and ''H''(''P'') is the [[information entropy|entropy]] of ''P''.
| |
| | |
| Note also that there is a relation between the Kullback–Leibler divergence and the "[[rate function]]" in the theory of [[large deviations]].<ref name="Sanov">Sanov I.N. (1957) "On the probability of large deviations of random magnitudes". ''Matem. Sbornik'', v. 42 (84), 11–44.</ref><ref name="Novak">Novak S.Y. (2011) ch. 14.5, "Extreme value methods with applications to finance". Chapman & Hall/CRC Press. ISBN 978-1-4398-3574-6.</ref>
| |
| | |
| Kullback brings together all notions of information in his historic text, Information Theory and Statistics. For instance he shows that the mean discriminating information between two hypotheses is the basis for all of the various measures of information, from Shannon to Fisher. Shannon's rate is the mean information between the hypotheses of dependence and independence of processes. Fisher's information is second order term and dominant in the Taylor approximation of the discriminating information between two models of the same parametric family.<ref name="Kulback">Kullback(1959), Information Theory and Statistics, Dover Press. ISBN 0-486-69684-7.</ref>
| |
| | |
| ==Computing the closed form==
| |
| For many common families of distributions, the KL-divergence between two distributions in the family can be derived in closed form. This can often be done most easily using the form of the KL-divergence in terms of [[expected value]]s or in terms of [[information entropy]]:
| |
| | |
| :<math>
| |
| \begin{align}
| |
| D_{\mathrm{KL}}(P\|Q) & = - \operatorname{E}(\ln q(x)) + \operatorname{E}(\ln p(x)) \\
| |
| & = H(P,Q) - H(P)
| |
| \end{align}
| |
| </math>
| |
| | |
| where <math>H(P) = -\operatorname{E}(\ln p(x))</math> is the information entropy of <math>P ,</math> and <math>H(P,Q)</math> is the [[cross entropy]] of <math>P</math> and <math>Q .</math>
| |
| | |
| ==Properties==
| |
| | |
| The Kullback–Leibler divergence is always non-negative,
| |
| :<math>D_{\mathrm{KL}}(P\|Q) \geq 0, \,</math>
| |
| a result known as [[Gibbs' inequality]], with ''D''<sub>KL</sub>(''P''||''Q'') zero if and only if ''P'' = ''Q'' [[almost everywhere]]. The entropy ''H(P)'' thus sets a minimum value for the cross-entropy ''H''(''P'',''Q''), the expected number of bits required when using a code based on ''Q'' rather than ''P''; and the KL divergence therefore represents the expected number of extra bits that must be transmitted to identify a value ''x'' drawn from ''X'', if a code is used corresponding to the probability distribution ''Q'', rather than the "true" distribution ''P''.
| |
| | |
| The Kullback–Leibler divergence remains well-defined for continuous distributions, and furthermore is invariant under parameter transformations. For example, if a transformation is made from variable ''x'' to variable ''y''(''x''), then, since ''P''(''x'') ''dx'' = ''P''(''y'') ''dy'' and ''Q''(''x'') ''dx'' = ''Q''(''y'') ''dy'' the Kullback–Leibler divergence may be rewritten:
| |
| | |
| :<math>
| |
| \begin{align}
| |
| D_{\mathrm{KL}}(P\|Q)
| |
| & = \int_{x_a}^{x_b}P(x)\log\left(\frac{P(x)}{Q(x)}\right)\,dx
| |
| = \int_{y_a}^{y_b}P(y)\log\left(\frac{P(y)dy/dx}{Q(y)dy/dx}\right)\,dy \\
| |
| & = \int_{y_a}^{y_b}P(y)\log\left(\frac{P(y)}{Q(y)}\right)\,dy
| |
| \end{align}
| |
| </math>
| |
| | |
| where <math>y_a=y(x_a)</math> and <math>y_b=y(x_b)</math>. Although it was assumed that the transformation was continuous, this need not be the case. This also shows that the Kullback–Leibler divergence produces a [[Dimensional analysis|dimensionally consistent]] quantity, since if ''x'' is a dimensioned variable, ''P''(''x'') and ''Q''(''x'') are also dimensioned, since e.g. ''P''(''x'') ''dx'' is dimensionless. The argument of the logarithmic term is and remains dimensionless, as it must. It can therefore be seen as in some ways a more fundamental quantity than some other properties in information theory<ref name="VerduLecture">See the section "differential entropy - 4" in [http://videolectures.net/nips09_verdu_re/ Relative Entropy] video lecture by [[Sergio Verdú]] [[NIPS]] 2009</ref> (such as [[self-information]] or [[Shannon entropy]]), which can become undefined or negative for non-discrete probabilities.
| |
| | |
| The Kullback–Leibler divergence is additive for independent distributions in much the same way as Shannon entropy.
| |
| If <math>P_1, P_2</math> are independent distributions, with the joint distribution <math>P(x,y) = P_1(x)P_2(y)</math>, and <math>Q, Q_1, Q_2</math> likewise, then
| |
| | |
| :<math> D_{\mathrm{KL}}(P \| Q) = D_{\mathrm{KL}}(P_1 \| Q_1) + D_{\mathrm{KL}}(P_2 \| Q_2).</math>
| |
| | |
| ==KL divergence for the normal distributions==
| |
| The Kullback–Leibler divergence between two [[multivariate normal distribution]]s of the dimension <math>k</math> with the means <math>\mu_0, \mu_1</math> and their corresponding nonsingular [[covariance matrix|covariance matrices]] <math>\Sigma_0, \Sigma_1</math> is:
| |
| | |
| :<math>
| |
| D_\text{KL}(\mathcal{N}_0 \| \mathcal{N}_1) = { 1 \over 2 } \left( \mathrm{tr} \left( \Sigma_1^{-1} \Sigma_0 \right) + \left( \mu_1 - \mu_0\right)^\top \Sigma_1^{-1} ( \mu_1 - \mu_0 ) - k - \ln \left( { \det \Sigma_0 \over \det \Sigma_1 } \right) \right).
| |
| </math><ref>Penny & Roberts, PARG-00-12, (2000) [http://www.allisons.org/ll/MML/KL/Normal]. pp. 18</ref>
| |
| | |
| The [[logarithm]] in the last term must be taken to base ''[[e (mathematical constant)|e]]'' since all terms apart from the last are base-''e'' logarithms of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives a result measured in [[nat (information)|nats]]. Dividing the entire expression above by log<sub>''e''</sub> 2 yields the divergence in [[bit]]s.
| |
| | |
| ==Relation to metrics==
| |
| | |
| One might be tempted to call it a "[[metric space|distance metric]]" on the space of probability distributions, but this would not be correct as the Kullback–Leibler divergence is not [[symmetric]] – that is, <math>D_{\mathrm{KL}}(P\|Q) \neq D_{\mathrm{KL}}(Q\|P)</math>, – nor does it satisfy the [[triangle inequality]]. Still, being a [[premetric]], it generates a [[topology#Mathematical definition|topology]] on the space of [[generalized probability distributions]], of which [[probability distributions]] proper are a special case. More concretely, if <math>\{P_1,P_2,\cdots\}</math> is a sequence of distributions such that
| |
| | |
| :<math>\lim_{n \rightarrow \infty} D_{\mathrm{KL}}(P_n\|Q) = 0</math>
| |
| | |
| then it is said that <math>P_n \xrightarrow{D} Q</math>. [[Pinsker's inequality]] entails that <math>P_n \xrightarrow{\mathrm{D}} P \Rightarrow P_n \xrightarrow{\mathrm{TV}} P</math>, where the latter stands for the usual convergence in [[total variation]].
| |
| | |
| Following [[Alfréd Rényi|Rényi]] (1970, 1961)<ref>{{cite book | author=A. Rényi |title =Probability Theory|year=1970|pages=Appendix, Sec.4|nopp = y|publisher=Elsevier|place=New York|isbn = 0-486-45867-9}}</ref><ref>{{cite conference | author=A. Rényi | title=On measures of information and entropy | booktitle=Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability 1960 | year=1961 | pages=547–561 | url = http://digitalassets.lib.berkeley.edu/math/ucb/text/math_s4_v1_article-27.pdf }}
| |
| </ref> the term is sometimes also called the '''information gain''' about ''X'' achieved if ''P'' can be used instead of ''Q''. It is also called the '''relative entropy''', for using ''Q'' instead of ''P''.
| |
| | |
| ===Fisher information metric===
| |
| However, the Kullback–Leibler divergence is rather directly related to a metric, specifically, the [[Fisher information metric]]. This can be made explicit as follows. Assume that the probability distributions ''P'' and ''Q'' are both parameterized by some (possibly multi-dimensional) parameter <math>\theta</math>. Consider then two close by values of <math>P = P(\theta)</math> and <math>Q = P(\theta_0)</math> so that the parameter <math>\theta</math> differs by only a small amount of the parameter value <math>\theta_0</math>. Specifically, up to first order one has (using the [[Einstein summation convention]])
| |
| | |
| :<math>P(\theta) = P(\theta_0) + \Delta\theta^jP_j(\theta_0) + \cdots</math>
| |
| | |
| with <math>\Delta\theta^j = (\theta - \theta_0)^j</math> a small change of <math>\theta</math> in the ''j'' direction, and <math>P_{j}(\theta_0) = \frac{\partial P}{\partial \theta^j}(\theta_0)</math> the corresponding rate of change in the probability distribution. Since the KL divergence has an absolute minimum 0 for ''P'' = ''Q'', i.e. <math> \theta = \theta_0 </math>, it changes only to ''second'' order in the small parameters <math>\Delta\theta^j</math>. More formally, as for any minimum, the first derivatives of the divergence vanish
| |
| | |
| :<math> \left.\frac{\partial}{\partial \theta^j}\right|_{\theta = \theta_0} D_{KL}(P(\theta) \| P(\theta_0)) = 0,</math>
| |
|
| |
| and by the [[Taylor series|Taylor expansion]] one has up to second order
| |
|
| |
| :<math>D_{\mathrm{KL}}(P(\theta)\|P(\theta_0)) = \Delta\theta^j\Delta\theta^k g_{jk}(\theta_0) + \cdots</math>
| |
| | |
| where the [[Hessian matrix]] of the divergence
| |
| | |
| :<math>g_{jk}(\theta_0) = \left.\frac{\partial^2}{\partial \theta^j\partial \theta^k}\right|_{\theta = \theta_0} D_{KL}(P(\theta)\|P(\theta_0))</math>
| |
| | |
| must be [[Positive-definite matrix|positive semidefinite]]. Letting <math>\theta_0</math> vary (and dropping the subindex 0) the Hessian <math>g_{jk}(\theta)</math> defines a (possibly degenerate) [[Riemannian metric]] on the <math>\theta</math> parameter space, called the Fisher information metric.
| |
| | |
| ==Relation to other quantities of information theory==
| |
| Many of the other quantities of information theory can be interpreted as applications of the KL divergence to specific cases.
| |
| | |
| The [[self-information]],
| |
| | |
| :<math>I(m) = D_{\mathrm{KL}}(\delta_{im} \| \{ p_i \}), </math>
| |
| | |
| is the KL divergence of the probability distribution ''P''(''i'') from a [[Kronecker delta]] representing certainty that ''i'' = ''m'' — i.e. the number of extra bits that must be transmitted to identify ''i'' if only the probability distribution ''P''(''i'') is available to the receiver, not the fact that ''i'' = ''m''.
| |
| | |
| The [[mutual information]],
| |
| | |
| :<math>\begin{align}I(X;Y) & = D_{\mathrm{KL}}(P(X,Y) \| P(X)P(Y) ) \\
| |
| & = \mathbb{E}_X \{D_{\mathrm{KL}}(P(Y|X) \| P(Y) ) \} \\
| |
| & = \mathbb{E}_Y \{D_{\mathrm{KL}}(P(X|Y) \| P(X) ) \}\end{align} </math>
| |
| | |
| is the KL divergence of the product ''P''(''X'')''P''(''Y'') of the two [[marginal probability]] distributions from the [[joint probability distribution]] ''P''(''X'',''Y'') — i.e. the expected number of extra bits that must be transmitted to identify ''X'' and ''Y'' if they are coded using only their marginal distributions instead of the joint distribution. Equivalently, if the joint probability ''P''(''X'',''Y'') ''is'' known, it is the expected number of extra bits that must on average be sent to identify ''Y'' if the value of ''X'' is not already known to the receiver.
| |
| | |
| The [[Shannon entropy]],
| |
| | |
| :<math>\begin{align}H(X) & = \mathrm{(i)} \, \mathbb{E}_x \{I(x)\} \\
| |
| & = \mathrm{(ii)} \log N - D_{\mathrm{KL}}(P(X) \| P_U(X) )\end{align}</math>
| |
| | |
| is the number of bits which would have to be transmitted to identify ''X'' from ''N'' equally likely possibilities, ''less'' the KL divergence of the uniform distribution ''P''<sub>U</sub>''(X)'' from the true distribution ''P''(''X'') — i.e. ''less'' the expected number of bits saved, which would have had to be sent if the value of ''X'' were coded according to the uniform distribution ''P''<sub>U</sub>(''X'') rather than the true distribution ''P''(''X'').
| |
| | |
| The [[conditional entropy]],
| |
| | |
| :<math>\begin{align}H(X\mid Y) & = \log N - D_{\mathrm{KL}}(P(X,Y) \| P_U(X) P(Y) ) \\
| |
| & = \mathrm{(i)} \,\, \log N - D_{\mathrm{KL}}(P(X,Y) \| P(X) P(Y) ) - D_{\mathrm{KL}}(P(X) \| P_U(X)) \\
| |
| & = H(X) - I(X;Y) \\
| |
| & = \mathrm{(ii)} \, \log N - \mathbb{E}_Y \{ D_{\mathrm{KL}}(P(X|Y) \| P_U(X)) \}\end{align}</math>
| |
| | |
| is the number of bits which would have to be transmitted to identify ''X'' from ''N'' equally likely possibilities, ''less'' the KL divergence of the product distribution ''P''<sub>U</sub>(''X'') ''P''(''Y'') from the true joint distribution ''P''(''X'',''Y'') — i.e. ''less'' the expected number of bits saved which would have had to be sent if the value of ''X'' were coded according to the uniform distribution ''P''<sub>U</sub>(''X'') rather than the conditional distribution ''P''(''X'' | ''Y'')'' of ''X'' given ''Y''.
| |
| | |
| The [[cross entropy]] between two [[probability distribution]]s measures the average number of [[bit]]s needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution <math>q</math>, rather than the "true" distribution <math>p</math>.
| |
| The cross entropy for two distributions <math>p</math> and <math>q</math> over the same [[probability space]] is thus defined as follows:
| |
| | |
| :<math>\mathrm{H}(p, q) = \mathrm{E}_p[-\log q] = \mathrm{H}(p) + D_{\mathrm{KL}}(p \| q).\!</math>
| |
| | |
| ==KL divergence and Bayesian updating==
| |
| In [[Bayesian statistics]] the KL divergence can be used as a measure of the information gain in moving from a [[prior distribution]] to a [[posterior distribution]]. If some new fact ''Y'' = ''y'' is discovered, it can be used to update the probability distribution for ''X'' from ''p''(''x'' | I) to a new posterior probability distribution ''p''(''x'' | ''y'',I) using [[Bayes' theorem]]:
| |
| :<math>p(x\mid y,I) = \frac{p(y\mid x,I) p(x\mid I)}{p(y\mid I)}</math>
| |
| | |
| This distribution has a new entropy
| |
| | |
| :<math> H\big( p(\cdot\mid y,I) \big) = \sum_x p(x\mid y,I) \log p(x\mid y,I),</math>
| |
| | |
| which may be less than or greater than the original entropy ''H''(''p''(· | ''I'')).
| |
| However, from the standpoint of the new probability distribution one can estimate that to have used the original code based on ''p''(''x'' | I) instead of a new code based on ''p''(''x'' | ''y'',I) would have added an expected number of bits
| |
| | |
| :<math> D_{\mathrm{KL}}\big(p(\cdot\mid y,I) \mid p(\cdot\mid I) \big) = \sum_x p(x\mid y,I) \log \frac{p(x\mid y,I)}{p(x\mid I)}</math>
| |
| | |
| to the message length. This therefore represents the amount of useful information, or information gain, about ''X'', that we can estimate has been learned by discovering ''Y'' = ''y''.
| |
| | |
| If a further piece of data, ''Y''<sub>2</sub> = ''y''<sub>2</sub>, subsequently comes in, the probability distribution for ''x'' can be updated further, to give a new best guess ''p''(''x''|''y''<sub>1</sub>,''y''<sub>2</sub>,I). If one reinvestigates the information gain for using ''p''(''x''|''y''<sub>1</sub>,I) rather than ''p''(''x''|''I''), it turns out that it may be either greater or less than previously estimated:
| |
| | |
| :<math>\sum_x p(x\mid y_1,y_2,I) \log \frac{p(x\mid y_1,y_2,I)}{p(x\mid I)}</math> may be ≤ or > than <math>\displaystyle\sum_x p(x\mid y_1,I) \log \frac{p(x\mid y_1,I)}{p(x\mid I)}</math>
| |
| | |
| and so the combined information gain does ''not'' obey the triangle inequality:
| |
| | |
| :<math>D_{\mathrm{KL}} \big( p(\cdot\mid y_1,y_2,I) \big\| p(\cdot\mid I) \big)</math> may be <, = or > than <math>D_{\mathrm{KL}} \big( p(\cdot\mid y_1,y_2,I)\big\| p(\cdot|y_1,I) \big) + D_{\mathrm{KL}} \big( p(\cdot \mid y_1,I) \big\| p(x\mid I) \big)</math>
| |
| | |
| All one can say is that on ''average'', averaging using ''p''(''y''<sub>2</sub> | ''y''<sub>1</sub>,''x'',''I''), the two sides will average out.
| |
| | |
| ===Bayesian experimental design===
| |
| A common goal in [[Bayesian experimental design]] is to maximise the expected KL divergence between the prior and the posterior.<ref>Chaloner K. and Verdinelli I. (1995) Bayesian Experimental Design: A Review. ''[[Statistical Science]]'' '''10''' (3): 273–304. [http://dx.doi.org/10.1214%2Faoms%2F1177729694 ] {{doi|10.1214/ss/1177009939}}</ref> When posteriors are approximated to be Gaussian distributions, a design maximising the expected KL divergence is called [[Optimal design#D-optimality|Bayes d-optimal]].
| |
| | |
| == Discrimination information ==
| |
| The Kullback–Leibler divergence ''D''<sub>KL</sub>( ''p''(''x''|''H''<sub>1</sub>) || ''p''(''x''|''H''<sub>0</sub>) ) can also be interpreted as the expected '''discrimination information''' for ''H''<sub>1</sub> over ''H''<sub>0</sub>: the mean information per sample for discriminating in favor of a hypothesis ''H''<sub>1</sub> against a hypothesis ''H''<sub>0</sub>, when hypothesis ''H''<sub>1</sub> is true.<ref>{{Cite book | last1=Press | first1=WH | last2=Teukolsky | first2=SA | last3=Vetterling | first3=WT | last4=Flannery | first4=BP | year=2007 | title=Numerical Recipes: The Art of Scientific Computing | edition=3rd | publisher=Cambridge University Press | publication-place=New York | isbn=978-0-521-88068-8 | chapter=Section 14.7.2. Kullback–Leibler Distance | chapter-url=http://apps.nrbook.com/empanel/index.html#pg=756 | postscript=<!-- Bot inserted parameter. Either remove it; or change its value to "." for the cite to end in a ".", as necessary. -->{{inconsistent citations}}}}</ref> Another name for this quantity, given to it by [[I.J. Good]], is the expected [[weight of evidence]] for ''H''<sub>1</sub> over ''H''<sub>0</sub> to be expected from each sample.
| |
| | |
| The expected weight of evidence for ''H''<sub>1</sub> over ''H''<sub>0</sub> is '''not''' the same as the information gain expected per sample about the probability distribution ''p''(''H'') of the hypotheses,
| |
| | |
| :''D''<sub>KL</sub>( ''p''(''x''|''H''<sub>1</sub>) || ''p''(''x''|''H''<sub>0</sub>) ) <math>\neq</math> ''IG'' = ''D''<sub>KL</sub>( ''p''(''H''|x) || ''p''(''H''|I) ).
| |
| | |
| Either of the two quantities can be used as a [[utility function]] in Bayesian experimental design, to choose an optimal next question to investigate: but they will in general lead to rather different experimental strategies.
| |
| | |
| On the entropy scale of ''information gain'' there is very little difference between near certainty and absolute certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty. On the other hand, on the [[logit]] scale implied by weight of evidence, the difference between the two is enormous – infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level) that, say, the [[Riemann hypothesis]] is correct, compared to being certain that it is correct because one has a mathematical proof. These two different scales of [[loss function]] for uncertainty are ''both'' useful, according to how well each reflects the particular circumstances of the problem in question.
| |
| | |
| ===Principle of minimum discrimination information===
| |
| The idea of Kullback–Leibler divergence as discrimination information led Kullback to propose the Principle of '''Minimum Discrimination Information''' (MDI): given new facts, a new distribution ''f'' should be chosen which is as hard to discriminate from the original distribution ''f''<sub>0</sub> as possible; so that the new data produces as small an information gain ''D''<sub>KL</sub>( ''f'' || ''f''<sub>0</sub> ) as possible.
| |
| | |
| For example, if one had a prior distribution ''p''(''x'',''a'') over ''x'' and ''a'', and subsequently learnt the true distribution of ''a'' was ''u''(''a''), the Kullback–Leibler divergence between the new joint distribution for ''x'' and ''a'', ''q''(''x''|''a'') ''u''(''a''), and the earlier prior distribution would be:
| |
| | |
| :<math>D_\mathrm{KL}(q(x|a)u(a)||p(x,a)) = \mathbb{E}_{u(a)}\{D_\mathrm{KL}(q(x|a)||p(x|a))\} + D_\mathrm{KL}(u(a)||p(a)),</math>
| |
| | |
| i.e. the sum of the KL divergence of ''p''(''a'') the prior distribution for ''a'' from the updated distribution ''u''(''a''), plus the expected value (using the probability distribution ''u''(''a'')) of the KL divergence of the prior conditional distribution ''p''(''x''|''a'') from the new conditional distribution ''q''(''x''|''a''). (Note that often the later expected value is called the ''conditional KL divergence'' (or ''conditional relative entropy'') and denoted by ''D''<sub>KL</sub>(''q''(''x''|''a'')||''p''(''x''|''a''))<ref>Thomas M. Cover, Joy A. Thomas (1991) ''Elements of Information Theory'' (John Wiley and Sons, New York, NY), p.22</ref>) This is minimised if ''q''(''x''|''a'') = ''p''(''x''|''a'') over the whole support of ''u''(''a''); and we note that this result incorporates Bayes' theorem, if the new distribution ''u''(''a'') is in fact a δ function representing certainty that ''a'' has one particular value.
| |
| | |
| MDI can be seen as an extension of [[Laplace]]'s [[Principle of Insufficient Reason]], and the [[Principle of Maximum Entropy]] of [[E.T. Jaynes]]. In particular, it is the natural extension of the principle of maximum entropy from discrete to continuous distributions, for which Shannon entropy ceases to be so useful (see ''[[differential entropy]]''), but the KL divergence continues to be just as relevant.
| |
| | |
| In the engineering literature, MDI is sometimes called the '''Principle of Minimum Cross-Entropy''' (MCE) or '''Minxent''' for short. Minimising the KL divergence of ''m'' from ''p'' with respect to ''m'' is equivalent to minimizing the cross-entropy of ''p'' and ''m'', since
| |
| | |
| :<math>H(p,m) = H(p) + D_{\mathrm{KL}}(p\|m),</math>
| |
| | |
| which is appropriate if one is trying to choose an adequate approximation to ''p''. However, this is just as often ''not'' the task one is trying to achieve. Instead, just as often it is ''m'' that is some fixed prior reference measure, and ''p'' that one is attempting to optimise by minimising ''D''<sub>KL</sub>(''p''||''m'') subject to some constraint. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be ''D''<sub>KL</sub>(''p''||''m''), rather than ''H''(''p'',''m'').
| |
| | |
| ==Relationship to available work==
| |
| [[File:ArgonKLdivergence.png|thumb|220px|right|Pressure versus volume plot of available work from a mole of Argon gas relative to ambient, calculated as <math>T_o</math> times KL divergence.]]
| |
| [[Surprisal]]s<ref>Myron Tribus (1961) ''Thermodynamics and thermostatics'' (D. Van Nostrand, New York)</ref> add where probabilities multiply. The surprisal for an event of probability <math>p</math> is defined as <math>s=k \ln(1 / p)</math>. If <math>k</math> is <math>\{ 1, 1/\ln 2, 1.38\times 10^{-23}\}</math> then surprisal is in <math>\{</math>nats, bits, or <math>J/K\}</math> so that, for instance, there are <math>N</math> bits of surprisal for landing all "heads" on a toss of <math>N</math> coins.
| |
| | |
| Best-guess states (e.g. for atoms in a gas) are inferred by maximizing the ''average surprisal'' <math>S</math> ([[entropy]]) for a given set of control parameters (like pressure <math>P</math> or volume <math>V</math>). This constrained [[entropy maximization]], both classically<ref>E. T. Jaynes (1957) [http://bayes.wustl.edu/etj/articles/theory.1.pdf Information theory and statistical mechanics], ''Physical Review'' '''106''':620</ref> and quantum mechanically,<ref>E. T. Jaynes (1957) [http://bayes.wustl.edu/etj/articles/theory.2.pdf Information theory and statistical mechanics II], ''Physical Review'' '''108''':171</ref> minimizes [[Josiah Willard Gibbs|Gibbs]] availability in entropy units<ref>J.W. Gibbs (1873) A method of geometrical representation of thermodynamic properties of substances by means of surfaces, reprinted in ''The Collected Works of J. W. Gibbs, Volume I Thermodynamics'', ed. W. R. Longley and R. G. Van Name (New York: Longmans, Green, 1931) footnote page 52.</ref> <math>A\equiv -k \ln Z</math> where <math>Z</math> is a constrained multiplicity or [[partition function (mathematics)|partition function]].
| |
| | |
| When temperature <math>T</math> is fixed, free energy (<math>T \times A</math>) is also minimized. Thus if <math>T, V</math> and number of molecules <math>N</math> are constant, the [[Helmholtz free energy]] <math>F\equiv U-TS</math> (where <math>U</math> is energy) is minimized as a system "equilibrates." If <math>T</math> and <math>P</math> are held constant (say during processes in your body), the [[Gibbs free energy]] <math>G=U+PV-TS</math> is minimized instead. The change in free energy under these conditions is a measure of available [[Work (thermodynamics)|work]] that might be done in the process. Thus available work for an ideal gas at constant temperature <math>T_o</math> and pressure <math>P_o</math> is <math>W = \Delta G =NkT_o \Theta(V/V_o)</math> where <math>V_o = NkT_o/P_o</math> and <math>\Theta(x)=x-1-\ln x\ge 0</math> (see also [[Gibbs inequality]]).
| |
| | |
| More generally<ref>M. Tribus and E. C. McIrvine (1971) Energy and information, ''Scientific American'' '''224''':179–186.</ref> the [[Exergy|work available]] relative to some ambient is obtained by multiplying ambient temperature <math>T_o</math> by KL-divergence or ''net surprisal'' <math>\Delta I\ge 0</math>, defined as the average value of <math>k\ln(p/p_o)</math> where <math>p_o</math> is the probability of a given state under ambient conditions. For instance, the work available in equilibrating a monatomic ideal gas to ambient values of <math>V_o</math> and <math>T_o</math> is thus <math>W=T_o \Delta I</math>, where KL-divergence <math>\Delta I = Nk[\Theta(V/V_o)+\frac{3}{2}\Theta(T/T_o)]</math>. The resulting contours of constant KL-divergence, shown at right for a mole of Argon at standard temperature and pressure, for example put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to convert boiling-water to ice-water discussed here.<ref>P. Fraundorf (2007) [http://www3.interscience.wiley.com/cgi-bin/abstract/117861985/ABSTRACT Thermal roots of correlation-based complexity], ''Complexity'' '''13''':3, 18–26</ref> Thus KL-divergence measures thermodynamic availability in bits.
| |
| | |
| ==Quantum information theory==
| |
| For [[density matrix|density matrices]] ''P'' and ''Q'' on a Hilbert space
| |
| the K–L divergence (or [[quantum relative entropy]] as it is often called in this case) from ''P'' to ''Q'' is defined to be
| |
| | |
| :<math> D_{\mathrm{KL}}(P\|Q) = \operatorname{Tr}(P( \log(P) - \log(Q))). \!</math>
| |
| | |
| In [[quantum information science]] the minimum of <math> D_{\mathrm{KL}}(P\|Q)</math> over all separable states Q can also be used as a measure of [[quantum entanglement|entanglement]] in the state P.
| |
| | |
| ==Relationship between models and reality==
| |
| Just as KL-divergence of "ambient from actual" measures thermodynamic availability, KL-divergence of "model from reality" is also useful even if the only clues we have about reality are some experimental measurements. In the former case KL-divergence describes ''distance to equilibrium'' or (when multiplied by ambient temperature) the amount of ''available work'', while in the latter case it tells you about surprises that reality has up its sleeve or, in other words, ''how much the model has yet to learn''.
| |
| | |
| Although this tool for evaluating models against systems that are accessible experimentally may be applied in any field, its application to models in ecology via [[Akaike information criterion]] are particularly well described in papers<ref>Kenneth P. Burnham and David R. Anderson (2001) [http://www.publish.csiro.au/paper/WR99107.htm Kullback–Leibler information as a basis for strong inference in ecological studies], ''Wildlife Research'' '''28''':111–119.</ref> and a book<ref>Burnham, K. P. and Anderson D. R. (2002) ''Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, Second Edition'' (Springer Science, New York) ISBN 978-0-387-95364-9.</ref> by Burnham and Anderson. In a nutshell the KL-divergence of a model from reality may be estimated, to within a constant additive term, by a function (like the squares summed) of the deviations observed between data and the model's predictions. Estimates of such divergence for models that share the same additive term can in turn be used to choose between models.
| |
| | |
| When trying to fit parametrized models to data there are various estimators which attempt to minimize Kullback–Leibler divergence, such as [[maximum likelihood]] and [[maximum spacing estimation|maximum spacing]] estimators.
| |
| | |
| == Symmetrised divergence ==
| |
| Kullback and Leibler themselves actually defined the divergence as:
| |
| | |
| :<math> D_{\mathrm{KL}}(P\|Q) + D_{\mathrm{KL}}(Q\|P)\, \!</math>
| |
| | |
| which is symmetric and nonnegative. This quantity has sometimes been used for feature selection in [[statistical classification|classification]] problems, where ''P'' and ''Q'' are the conditional [[Probability density function|pdfs]] of a feature under two different classes.
| |
| | |
| An alternative is given via the λ divergence,
| |
| | |
| :<math> D_{\lambda}(P\|Q) = \lambda D_{\mathrm{KL}}(P\|\lambda P + (1-\lambda)Q) + (1-\lambda) D_{\mathrm{KL}}(Q\|\lambda P + (1-\lambda)Q),\, \!</math>
| |
| | |
| which can be interpreted as the expected information gain about ''X'' from discovering which probability distribution ''X'' is drawn from, ''P'' or ''Q'', if they currently have probabilities λ and (1 − λ) respectively.
| |
| | |
| The value λ = 0.5 gives the [[Jensen–Shannon divergence]], defined by
| |
| | |
| :<math> D_{\mathrm{JS}} = \tfrac{1}{2} D_{\mathrm{KL}} \left (P \| M \right ) + \tfrac{1}{2} D_{\mathrm{KL}}\left (Q \| M \right )\, \!</math>
| |
| | |
| where ''M'' is the average of the two distributions,
| |
| :<math> M = \tfrac{1}{2}(P+Q). \, </math>
| |
| | |
| ''D''<sub>JS</sub> can also be interpreted as the capacity of a noisy information channel with two inputs giving the output distributions ''p'' and ''q''. The Jensen–Shannon divergence, like all f-divergences, is ''locally'' proportional to the [[Fisher information metric]]. It is similar to the Hellinger metric (in the sense that induces the same affine connection on a statistical manifold), and equal to one-half the so-called ''Jeffreys divergence'' (Rubner et al., 2000; Jeffreys 1946).
| |
| | |
| ==Relationship to Hellinger distance==
| |
| If ''P'' and ''Q'' are two probability measures, then the squared [[Hellinger distance]] is the quantity given by
| |
| | |
| :<math>H^2(P,Q) = \frac{1}{2}\displaystyle \int \left(\sqrt{\frac{{\rm d}P}{{\rm d}\lambda}} - \sqrt{\frac{{\rm d}Q}{{\rm d}\lambda}}\right)^2 {\rm d}\lambda. </math>
| |
| | |
| The Kullback–Leibler divergence can be lower bounded in terms of the [[Hellinger distance]]<ref>[http://www.stat.yale.edu/~pollard/Books/Asymptopia/Metrics.pdf]</ref>
| |
| | |
| :<math> D_{KL}(Q||P) \geq 2 H^2(P,Q). </math>
| |
| | |
| ==Other probability-distance measures==
| |
| Other measures of [[Statistical distance|probability distance]] are the ''histogram intersection'', ''[[chi-squared test|Chi-squared statistic]]'', ''quadratic form distance'', ''[[matching distance|match distance]]'', ''[[Kolmogorov–Smirnov test|Kolmogorov–Smirnov distance]]'', and ''[[Earth Mover's Distance|earth mover's distance]]''.<ref name="earth">Rubner, Y., Tomasi, C., and [[Leonidas J. Guibas|Guibas, L. J.]], 2000. The Earth Mover's distance as a metric for image retrieval. ''International Journal of Computer Vision'', '''40'''(2): 99–121.
| |
| </ref>
| |
| | |
| ==Data differencing==
| |
| {{Main|Data differencing}}
| |
| Just as ''absolute'' entropy serves as theoretical background for [[data compression|data ''compression'']], ''relative'' entropy serves as theoretical background for [[data differencing|data ''differencing'']] – the absolute entropy of a set of data in this sense being the data required to reconstruct it (minimum compressed size), while the relative entropy of a target set of data, given a source set of data, is the data required to reconstruct the target ''given'' the source (minimum size of a [[patch (computing)|patch]]).
| |
| | |
| ==See also==
| |
| *[[Bregman divergence]]
| |
| *[[Jensen–Shannon divergence]]
| |
| *[[Deviance information criterion]]
| |
| *[[Bayesian information criterion]]
| |
| *[[Quantum relative entropy]]
| |
| *[[Information gain in decision trees]]
| |
| *[[Solomon Kullback]] and [[Richard Leibler]]
| |
| *[[Information theory and measure theory]]
| |
| *[[Entropy power inequality]]
| |
| *[[Information gain ratio]]
| |
| *[[Entropic value at risk]]
| |
| | |
| ==References==
| |
| {{Reflist}}
| |
| | |
| ==External links==
| |
| * [https://bitbucket.org/szzoli/ite/ Information Theoretical Estimators Toolbox]
| |
| * [https://github.com/evansenter/diverge Ruby gem for calculating KL divergence]
| |
| * [http://www.snl.salk.edu/~shlens/kl.pdf Jon Shlens' tutorial on Kullback–Leibler divergence and likelihood theory]
| |
| * [http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=13089&objectType=file Matlab code for calculating KL divergence for discrete distributions]
| |
| * [[Sergio Verdú]], [http://videolectures.net/nips09_verdu_re/ Relative Entropy], [[NIPS]] 2009. One-hour video lecture.
| |
| * [http://arxiv.org/abs/math/0604246 A modern summary of info-theoretic divergence measures]
| |
| | |
| {{DEFAULTSORT:Kullback-Leibler Divergence}}
| |
| [[Category:Statistical theory]]
| |
| [[Category:Entropy and information]]
| |
| [[Category:F-divergences]]
| |
| [[Category:Thermodynamics]]
| |