|
|
Line 1: |
Line 1: |
| {{Distinguish2|the [[exponential distribution]]}}
| | I am Marty from Bolivar. I am learning to play the Mandolin. Other hobbies are Target Shooting.<br><br>my web blog - comment pirater un Compte skyrock ([http://www.Dailymotion.com/video/x247sdz_pirater-un-compte-skyrock-comment-pirater-un-compte-skyrock-en-ligne_videogames dailymotion.com]) |
| :''"Natural parameter" links here. For the usage of this term in differential geometry, see [[differential geometry of curves]].''
| |
| In [[theory of probability|probability]] and [[statistics]], the '''exponential family''' is an important class of [[probability distribution]]s sharing a certain form, specified below. This special form is chosen for mathematical convenience, on account of some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural distributions to consider. The concept of exponential families is credited to<ref>{{cite journal
| |
| | last = Andersen
| |
| | first = Erling
| |
| |date=September 1970
| |
| | title = Sufficiency and Exponential Families for Discrete Sample Spaces
| |
| | journal = [[Journal of the American Statistical Association]]
| |
| | volume = 65
| |
| | issue = 331
| |
| | pages = 1248–1255
| |
| |mr=268992
| |
| | doi = 10.2307/2284291
| |
| | jstor = 2284291
| |
| | publisher = Journal of the American Statistical Association, Vol. 65, No. 331
| |
| }}</ref> [[E. J. G. Pitman]],<ref>{{cite journal
| |
| | last = Pitman | first = E. |authorlink=E. J. G. Pitman
| |
| | year = 1936
| |
| | title = Sufficient statistics and intrinsic accuracy
| |
| | journal = [[Mathematical Proceedings of the Cambridge Philosophical Society]]
| |
| | volume = 32
| |
| | pages = 567–579
| |
| | doi = 10.1017/S0305004100019307
| |
| | last2 = Wishart
| |
| | first2 = J.
| |
| | issue = 4
| |
| }}</ref> [[Georges Darmois|G. Darmois]],<ref>{{cite journal
| |
| | last = Darmois
| |
| | first = G.
| |
| | year = 1935
| |
| | title = Sur les lois de probabilites a estimation exhaustive
| |
| | journal = C.R. Acad. Sci. Paris
| |
| | volume = 200
| |
| | pages = 1265–1266
| |
| | language = French
| |
| }}</ref> and [[Bernard Koopman|B. O. Koopman]]<ref>{{cite journal
| |
| | last = Koopman | first = B |authorlink=Bernard Koopman
| |
| | year = 1936
| |
| | title = On distribution admitting a sufficient statistic
| |
| | journal = [[Transactions of the American Mathematical Society]]
| |
| | volume = 39 |issue=3
| |
| | pages = 399–409
| |
| |mr=1501854
| |
| | doi = 10.2307/1989758
| |
| | jstor = 1989758
| |
| | publisher = Transactions of the American Mathematical Society, Vol. 39, No. 3
| |
| }}</ref> in 1935–36. The term '''exponential class''' is sometimes used in place of "exponential family".<ref>Kupperman, M. (1958) "Probabilities of Hypotheses and Information-Statistics in Sampling from Exponential-Class Populations", [[Annals of Mathematical Statistics]], 9 (2), 571–575 {{JSTOR|2237349}}</ref>
| |
| | |
| The exponential families include many of the most common distributions, including the [[normal distribution|normal]], [[exponential distribution|exponential]], [[gamma distribution|gamma]], [[chi-squared distribution|chi-squared]], [[beta distribution|beta]], [[Dirichlet distribution|Dirichlet]], [[Bernoulli distribution|Bernoulli]], [[categorical distribution|categorical]], [[Poisson distribution|Poisson]], [[Wishart distribution|Wishart]], [[Inverse Wishart distribution|Inverse Wishart]] and many others. A number of common distributions are exponential families only when certain parameters are considered fixed and known, e.g. [[binomial distribution|binomial]] (with fixed number of trials), [[multinomial distribution|multinomial]] (with fixed number of trials), and [[negative binomial distribution|negative binomial]] (with fixed number of failures). Examples of common distributions that are not exponential families are [[Student's t distribution|Student's t]], most [[mixture distribution]]s, and even the family of [[uniform distribution (continuous)|uniform distribution]]s with unknown bounds. See the section below on [[#Examples|examples]] for more discussion.
| |
| | |
| Consideration of exponential-family distributions provides a general framework for selecting a possible alternative parameterisation of the distribution, in terms of '''natural parameters''', and for defining useful [[sample statistic]]s, called the '''natural sufficient statistics''' of the family. See below for more information.
| |
| | |
| == Definition ==
| |
| The following is a sequence of increasingly general definitions of an exponential family. A casual reader may wish to restrict attention to the first and simplest definition, which corresponds to a single-parameter family of [[discrete probability distribution|discrete]] or [[continuous probability distribution|continuous]] probability distributions.
| |
| | |
| === Scalar parameter ===
| |
| A single-parameter exponential family is a set of probability distributions whose [[probability density function]] (or [[probability mass function]], for the case of a [[discrete distribution]]) can be expressed in the form
| |
| :<math> f_X(x|\theta) = h(x) \left (\eta(\theta) \cdot T(x) -A(\theta)\right )</math>
| |
| where ''T''(''x''), ''h''(''x''), η(θ), and ''A''(θ) are known functions.
| |
| | |
| An alternative, equivalent form often given is
| |
| :<math> f_X(x|\theta) = h(x) g(\theta) \exp \left ( \eta(\theta) \cdot T(x) \right )</math>
| |
| or equivalently
| |
| :<math> f_X(x|\theta) = \exp \left (\eta(\theta) \cdot T(x) - A(\theta) + B(x) \right )</math>
| |
| | |
| The value θ is called the parameter of the family.
| |
| | |
| Note that ''x'' is often a vector of measurements, in which case ''T''(''x'') is a function from the space of possible values of ''x'' to the real numbers.
| |
| | |
| If η(θ) = θ, then the exponential family is said to be in ''[[canonical form]]''. By defining a transformed parameter η = η(θ), it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since η(θ) can be multiplied by any nonzero constant, provided that ''T''(''x'') is multiplied by that constant's reciprocal.
| |
| | |
| Even when ''x'' is a scalar, and there is only a single parameter, the functions η(θ) and ''T''(''x'') can still be vectors, as described below.
| |
| | |
| Note also that the function ''A''(θ) or equivalently ''g''(θ) is automatically determined once the other functions have been chosen, and assumes a form that causes the distribution to be [[normalizing constant|normalized]] (sum or integrate to one over the entire domain). Furthermore, both of these functions can always be written as functions of η, even when η(θ) is not a [[bijection|one-to-one]] function, i.e. two or more different values of θ map to the same value of η(θ), and hence η(θ) cannot be inverted. In such a case, all values of θ mapping to the same η(θ) will also have the same value for ''A''(θ) and ''g''(θ).
| |
| | |
| Further down the page is the example of [[#Normal_distribution:_Unknown_mean.2C_known_variance|a normal distribution with unknown mean and known variance]].
| |
| | |
| === Factorization of the variables involved ===
| |
| What is important to note, and what characterizes all exponential family variants, is that the parameter(s) and the observation variable(s) must [[factorize]] (can be separated into products each of which involves only one type of variable), either directly or within either part (the base or exponent) of an [[exponentiation]] operation. Generally, this means that all of the factors constituting the density or mass function must be of one of the following forms:
| |
| | |
| :<math>f(x), g(\theta), c^{f(x)}, c^{g(\theta)}, {[f(x)]}^c, {[g(\theta)]}^c, {[f(x)]}^{g(\theta)}, {[g(\theta)]}^{f(x)}, {[f(x)]}^{h(x)g(\theta)}, \text{ or } {[g(\theta)]}^{h(x)j(\theta)},</math>
| |
| | |
| where ''f'' and ''h'' are arbitrary functions of ''x''; ''g'' and ''j'' are arbitrary functions of θ; and ''c'' is an arbitrary "constant" expression (i.e. an expression not involving ''x'' or θ).
| |
| | |
| There are further restrictions on how many such factors can occur. For example, the two expressions:
| |
| | |
| :<math>{[f(x) g(\theta)]}^{h(x)j(\theta)}, \qquad {[f(x)]}^{h(x)j(\theta)} [g(\theta)]^{h(x)j(\theta)},</math>
| |
| | |
| are the same, i.e. a product of two "allowed" factors. However, when rewritten into the factorized form,
| |
| | |
| :<math>{[f(x) g(\theta)]}^{h(x)j(\theta)} = {[f(x)]}^{h(x)j(\theta)} [g(\theta)]^{h(x)j(\theta)} = e^{[h(x) \ln f(x)] j(\theta) + h(x) [j(\theta) \ln g(\theta)]},</math>
| |
| | |
| it can be seen that it cannot be expressed in the required form. (However, a form of this sort is a member of a ''curved exponential family'', which allows multiple factorized terms in the exponent.{{citation needed|date=June 2011}})
| |
| | |
| To see why an expression of the form
| |
| | |
| :<math>{[f(x)]}^{g(\theta)}</math>
| |
| | |
| qualifies, note that
| |
| | |
| :<math>{[f(x)]}^{g(\theta)} = e^{g(\theta) \ln f(x)}</math>
| |
| | |
| and hence factorizes inside of the exponent. Similarly,
| |
| | |
| :<math>{[f(x)]}^{h(x)g(\theta)} = e^{h(x)g(\theta)\ln f(x)} = e^{[h(x) \ln f(x)] g(\theta)}</math>
| |
| | |
| and again factorizes inside of the exponent.
| |
| | |
| Note also that a factor consisting of a sum where both types of variables are involved (e.g. a factor of the form <math>1+f(x)g(\theta)</math>) cannot be factorized in this fashion (except in some cases where occurring directly in an exponent); this is why, for example, the [[Cauchy distribution]] and [[Student's t distribution]] are not exponential families.
| |
| | |
| === Vector parameter ===
| |
| The definition in terms of one ''real-number'' parameter can be extended to one ''real-vector'' parameter
| |
| | |
| :<math>{\boldsymbol \theta} = \left(\theta_1, \theta_2, \cdots, \theta_s \right )^T.</math>
| |
| | |
| A family of distributions is said to belong to a vector exponential family if the probability density function (or probability mass function, for discrete distributions) can be written as
| |
| | |
| :<math> f_X(x|\boldsymbol \theta) = h(x) \exp\left(\sum_{i=1}^s \eta_i({\boldsymbol \theta}) T_i(x) - A({\boldsymbol \theta}) \right)</math>
| |
| | |
| Or in a more compact form,
| |
| | |
| :<math> f_X(x|\boldsymbol \theta) = h(x) \exp\Big(\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(x) - A({\boldsymbol \theta}) \Big) </math>
| |
| | |
| This form writes the sum as a [[dot product]] of vector-valued functions <math>\boldsymbol\eta({\boldsymbol \theta})</math> and <math>\mathbf{T}(x)</math>.
| |
| | |
| An alternative, equivalent form often seen is
| |
| | |
| :<math> f_X(x|\boldsymbol \theta) = h(x) g(\boldsymbol \theta) \exp\Big(\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(x)\Big)</math>
| |
| | |
| As in the scalar valued case, the exponential family is said to be in canonical form if
| |
| | |
| :<math>\forall i: \quad \eta_i({\boldsymbol \theta}) = \theta_i.</math>
| |
| | |
| A vector exponential family is said to be ''curved'' if the dimension of
| |
| | |
| :<math>{\boldsymbol \theta} = \left (\theta_1, \theta_2, \ldots, \theta_d \right )^T</math>
| |
| | |
| is less than the dimension of the vector
| |
| | |
| :<math>{\boldsymbol \eta}(\boldsymbol \theta) = \left (\eta_1(\boldsymbol \theta), \eta_2(\boldsymbol \theta), \ldots, \eta_s(\boldsymbol \theta) \right )^T.</math>
| |
| | |
| That is, if the ''dimension'' of the parameter vector is less than the ''number of functions'' of the parameter vector in the above representation of the probability density function. Note that most common distributions in the exponential family are ''not'' curved, and many algorithms designed to work with any member of the exponential family implicitly or explicitly assume that the distribution is not curved.
| |
| | |
| Note that, as in the above case of a scalar-valued parameter, the function <math>A(\boldsymbol \theta)</math> or equivalently <math>g(\boldsymbol \theta)</math> is automatically determined once the other functions have been chosen, so that the entire distribution is normalized. In addition, as above, both of these functions can always be written as functions of <math>\boldsymbol\eta</math>, regardless of the form of the transformation that generates <math>\boldsymbol\eta</math> from <math>\boldsymbol\theta</math>. Hence an exponential family in its "natural form" (parametrized by its natural parameter) looks like
| |
| | |
| :<math> f_X(x|\boldsymbol \eta) = h(x) \exp\Big(\boldsymbol\eta \cdot \mathbf{T}(x) - A({\boldsymbol \eta})\Big)</math>
| |
| | |
| or equivalently
| |
| | |
| :<math> f_X(x|\boldsymbol \eta) = h(x) g(\boldsymbol \eta) \exp\Big(\boldsymbol\eta \cdot \mathbf{T}(x)\Big)</math>
| |
| | |
| Note that the above forms may sometimes be seen with <math>\boldsymbol\eta^T \mathbf{T}(x)</math> in place of <math>\boldsymbol\eta \cdot \mathbf{T}(x)</math>. These are exactly equivalent formulations, merely using different notation for the [[dot product]].
| |
| | |
| Further down the page is the example of [[#Normal distribution: Unknown mean and unknown variance|a normal distribution with unknown mean and variance]].
| |
| | |
| === Vector parameter, vector variable ===
| |
| The vector-parameter form over a single scalar-valued random variable can be trivially expanded to cover a joint distribution over a vector of random variables. The resulting distribution is simply the same as the above distribution for a scalar-valued random variable with each occurrence of the scalar ''x'' replaced by the vector
| |
| | |
| :<math>\mathbf{x} = \left (x_1, x_2, \cdots, x_k \right).</math>
| |
| | |
| Note that the dimension ''k'' of the random variable need not match the dimension ''d'' of the parameter vector, nor (in the case of a curved exponential function) the dimension ''s'' of the [[natural parameter]] <math>\boldsymbol\eta</math> and [[sufficient statistic]] ''T''('''x''').
| |
| | |
| The distribution in this case is written as
| |
| | |
| :<math>f_X(\mathbf{x}|\boldsymbol \theta) = h(\mathbf{x})\exp\left(\sum_{i=1}^s \eta_i({\boldsymbol \theta}) T_i(\mathbf{x}) - A({\boldsymbol \theta}) \right)</math>
| |
| | |
| Or more compactly as
| |
| | |
| :<math> f_X(\mathbf{x}|\boldsymbol \theta) = h(\mathbf{x}) \exp\Big(\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(\mathbf{x}) - A({\boldsymbol \theta})\Big)</math>
| |
| | |
| Or alternatively as
| |
| | |
| :<math> f_X(\mathbf{x}|\boldsymbol \theta) = h(\mathbf{x})\ g(\boldsymbol \theta)\ \exp\Big(\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(\mathbf{x})\Big)</math>
| |
| | |
| === Measure-theoretic formulation ===
| |
| We use [[cumulative distribution function]]s (cdf) in order to encompass both discrete and continuous distributions.
| |
| | |
| Suppose ''H'' is a non-decreasing function of a real variable. Then [[Lebesgue–Stieltjes integral]]s with respect to ''dH''(''x'') are integrals with respect to the "reference measure" of the exponential family generated by ''H''.
| |
| | |
| Any member of that exponential family has cumulative distribution function
| |
| | |
| :<math>dF(\mathbf{x}|\boldsymbol\eta) = e^{\boldsymbol\eta^{\rm T} \mathbf{T}(\mathbf{x}) - A(\boldsymbol\eta)} dH(\mathbf{x}).</math>
| |
| | |
| If ''F'' is a continuous distribution with a density, one can write ''dF''(''x'') = ''f''(''x'') ''dx''.
| |
| | |
| ''H''(''x'') is a [[Lebesgue–Stieltjes integral|Lebesgue–Stieltjes integrator]] for the ''reference measure''. When the reference measure is finite, it can be normalized and ''H'' is actually the [[cumulative distribution function]] of a probability distribution. If ''F'' is absolutely continuous with a density, then so is ''H'', which can then be written ''dH''(''x'') = ''h''(''x'') ''dx''. If ''F'' is discrete, then ''H'' is a [[step function]] (with steps on the [[support (mathematics)|support]] of ''F'').
| |
| | |
| == Interpretation ==
| |
| In the definitions above, the functions ''T''(''x''), η(θ) and ''A''(η) were apparently arbitrarily defined. However, these functions play a significant role in the resulting probability distribution.
| |
| | |
| * ''T''(''x'') is a ''[[sufficiency (statistics)|sufficient statistic]]'' of the distribution. For exponential families, the sufficient statistic is a function of the data that fully summarizes the data ''x'' within the density function. This means that, for any data sets ''x'' and ''y'', the density value is the same if ''T''(''x'') = ''T''(''y''). This is true even if ''x'' and ''y'' are quite different -- that is, <math> d(x,y)>0 </math>. The dimension of ''T''(''x'') equals the number of parameters of θ and encompasses all of the information regarding the data related to the parameter θ. The sufficient statistic of a set of [[independent identically distributed]] data observations is simply the sum of individual sufficient statistics, and encapsulates all the information needed to describe the [[posterior distribution]] of the parameters, given the data (and hence to derive any desired estimate of the parameters). This important property is further discussed [[#Classical estimation: sufficiency|below]].
| |
| | |
| * η is called the ''natural parameter''. The set of values of η for which the function <math>f_X(x;\theta)</math> is finite is called the ''natural parameter space''. It can be shown that the natural parameter space is always [[convex set|convex]].
| |
| | |
| * ''A''(η) is called the ''log-[[partition function (mathematics)|partition function]]'' because it is the [[logarithm]] of a [[normalization factor]], without which <math>f_X(x;\theta)</math> would not be a probability distribution ("partition function" is often used in statistics as a synonym of "normalization factor"):
| |
| ::<math> A(\eta) = \ln\left ( \int_x h(x) \exp (\eta(\theta) \cdot T(x)) \operatorname{d}x \right )</math>
| |
| The function ''A'' is important in its own right, because the [[mean]], [[variance]] and other [[moment (mathematics)|moment]]s of the sufficient statistic ''T''(''x'') can be derived simply by differentiating ''A''(η). For example, because ln(''x'') is one of the components of the sufficient statistic of the [[gamma distribution]], <math>\mathbb{E}[\ln x]</math> can be easily determined for this distribution using ''A''(η). Technically, this is true because
| |
| ::<math>K(u|\eta) = A(\eta+u) - A(\eta),</math>
| |
| is the [[cumulant generating function]] of the sufficient statistic.
| |
| | |
| == Properties ==
| |
| Exponential families have a large number of properties that make them extremely useful for statistical analysis. In many cases, it can be shown that, except in a few exceptional cases, ''only'' exponential families have these properties. Examples:
| |
| *Exponential families have [[sufficient statistic]]s that can summarize arbitrary amounts of [[independent identically distributed]] data using a fixed number of values.
| |
| *Exponential families have [[conjugate prior]]s, an important property in [[Bayesian statistics]].
| |
| *The [[posterior predictive distribution]] of an exponential-family random variable with a conjugate prior can always be written in closed form (provided that the [[normalizing factor]] of the exponential-family distribution can itself be written in closed form). Note that these distributions are often not themselves exponential families. Common examples of non-exponential families arising from exponential ones are the [[Student's t-distribution]], [[beta-binomial distribution]] and [[Dirichlet-multinomial distribution]].
| |
| *In the mean-field approximation in [[variational Bayes]] (used for approximating the [[posterior distribution]] in large [[Bayesian network]]s), the best approximating posterior distribution of an exponential-family node (a node is a random variable in the context of Bayesian networks) with a conjugate prior is in the same family as the node. {{citation needed|date=July 2012}}
| |
| | |
| == Examples ==
| |
| It is critical, when considering the examples in this section, to remember the discussion above about what it means to say that a "distribution" is an exponential family, and in particular to keep in mind that the set of parameters that are allowed to vary is critical in determining whether a "distribution" is or is not an exponential family.
| |
| | |
| The [[normal distribution|normal]], [[exponential distribution|exponential]], [[log-normal distribution|log-normal]], [[gamma distribution|gamma]], [[chi-squared distribution|chi-squared]], [[beta distribution|beta]], [[Dirichlet distribution|Dirichlet]], [[Bernoulli distribution|Bernoulli]], [[categorical distribution|categorical]], [[Poisson distribution|Poisson]], [[geometric distribution|geometric]], [[inverse Gaussian distribution|inverse Gaussian]], [[von Mises distribution|von Mises]] and [[von Mises-Fisher distribution|von Mises-Fisher]] distributions are all exponential families.
| |
| | |
| Some distributions are exponential families only if some of their parameters are held fixed. The family of [[Pareto distribution]]s with a fixed minimum bound ''x''<sub>m</sub> form an exponential family. The families of [[binomial distribution|binomial]] and [[multinomial distribution|multinomial]] distributions with fixed number of trials ''n'' but unknown probability parameter(s) are exponential families. The family of [[negative binomial distribution]]s with fixed number of failures (a.k.a. stopping-time parameter) ''r'' is an exponential family. However, when any of the above-mentioned fixed parameters are allowed to vary, the resulting family is not an exponential family.
| |
| | |
| As mentioned above, as a general rule, the [[support (mathematics)|support]] of an exponential family must remain the same across all parameter settings in the family. This is why the above cases (e.g. binomial with varying number of trials, Pareto with varying minimum bound) are not exponential families — in all of the cases, the parameter in question affects the support (particularly, changing the minimum or maximum possible value). For similar reasons, neither the [[discrete uniform distribution]] nor [[continuous uniform distribution]] are exponential families regardless of whether one of the bounds is held fixed. (If both bounds are held fixed, the result is a single distribution, not a family at all.)
| |
| | |
| The [[Weibull distribution]] with fixed shape parameter ''k'' is an exponential family. Unlike in the previous examples, the shape parameter does not affect the support; the fact that allowing it to vary makes the Weibull non-exponential is due rather to the particular form of the Weibull's [[probability density function]] (''k'' appears in the exponent of an exponent).
| |
| | |
| In general, distributions that result from a finite or infinite [[mixture distribution|mixture]] of other distributions, e.g. [[mixture model]] densities and [[compound probability distribution]]s, are ''not'' exponential families. Examples are typical Gaussian [[mixture model]]s as well as many [[heavy-tailed distribution]]s that result from [[compound probability distribution|compounding]] (i.e. infinitely mixing) a distribution with a [[prior distribution]] over one of its parameters, e.g. the [[Student's t-distribution]] (compounding a [[normal distribution]] over a [[gamma distribution|gamma-distributed]] precision prior), and the [[beta-binomial distribution|beta-binomial]] and [[Dirichlet-multinomial distribution|Dirichlet-multinomial]] distributions. Other examples of distributions that are not exponential families are the [[F-distribution]], [[Cauchy distribution]], [[hypergeometric distribution]] and [[logistic distribution]].
| |
| | |
| Following are some detailed examples of the representation of some useful distribution as exponential families.
| |
| | |
| === Normal distribution: Unknown mean, known variance ===
| |
| As a first example, consider a random variable distributed normally with unknown mean μ and ''known'' variance σ<sup>2</sup>. The probability density function is then
| |
| | |
| :<math>f_\sigma(x;\mu) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}.</math>
| |
| | |
| This is a single-parameter exponential family, as can be seen by setting
| |
| | |
| :<math>\begin{align}
| |
| h_\sigma(x) &= \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x^2}{2\sigma^2}} \\
| |
| T_\sigma(x) &= \frac{x}{\sigma} \\
| |
| A_\sigma(\mu) &= \frac{\mu^2}{2\sigma^2}\\
| |
| \eta_\sigma(\mu) &= \frac{\mu}{\sigma}.
| |
| \end{align}</math>
| |
| | |
| If σ = 1 this is in canonical form, as then η(μ) = μ.
| |
| | |
| === Normal distribution: Unknown mean and unknown variance ===
| |
| Next, consider the case of a normal distribution with unknown mean and unknown variance. The probability density function is then
| |
| | |
| :<math>f(x;\mu,\sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}}.</math>
| |
| | |
| This is an exponential family which can be written in canonical form by defining
| |
| | |
| :<math>\begin{align}
| |
| \boldsymbol {\eta} &= \left(\frac{\mu}{\sigma^2}, -\frac{1}{2\sigma^2} \right)^{\rm T} \\
| |
| h(x) &= \frac{1}{\sqrt{2 \pi}} \\
| |
| T(x) &= \left( x, x^2 \right)^{\rm T} \\
| |
| A({\boldsymbol \eta}) &= \frac{\mu^2}{2 \sigma^2} + \ln |\sigma| = -\frac{\eta_1^2}{4\eta_2} + \frac{1}{2}\ln\left|\frac{1}{2\eta_2} \right|
| |
| \end{align}</math>
| |
| | |
| === Binomial distribution ===
| |
| As an example of a discrete exponential family, consider the [[binomial distribution]] with ''known'' number of trials ''n''. The [[probability mass function]] for this distribution is
| |
| :<math>f(x)={n \choose x}p^x (1-p)^{n-x}, \quad x \in \{0, 1, 2, \ldots, n\}.</math>
| |
| This can equivalently be written as
| |
| :<math>f(x)={n \choose x}\exp\left(x \log\left(\frac{p}{1-p}\right) + n \log(1-p)\right),</math>
| |
| which shows that the binomial distribution is an exponential family, whose natural parameter is
| |
| :<math>\eta = \log\frac{p}{1-p}.</math>
| |
| This function of ''p'' is known as [[logit]].
| |
| | |
| == Table of distributions ==
| |
| The following table shows how to rewrite a number of common distributions as exponential-family distributions with natural parameters. Refer to the flashcards<ref>{{cite journal
| |
| | last1 = Nielsen
| |
| | first1 = Frank
| |
| | last2 = Garcia
| |
| | first2 = Vincent
| |
| | year = 2009
| |
| | title = Statistical exponential families: A digest with flash cards
| |
| | journal = [[arxiv]]
| |
| | number = 0911.4863
| |
| | url = http://arxiv.org/abs/0911.4863
| |
| }}</ref> for main exponential families.
| |
| | |
| For a scalar variable and scalar parameter, the form is as follows:
| |
| | |
| :<math> f_X(\mathbf{x}|\boldsymbol \theta) = h(\mathbf{x}) \exp\Big(\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(\mathbf{x}) - A({\boldsymbol \eta})\Big) </math>
| |
| | |
| For a scalar variable and vector parameter:
| |
| | |
| :<math> f_X(x|\boldsymbol \theta) = h(x) \exp\Big(\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(x) - A({\boldsymbol \theta})\Big)</math>
| |
| :<math> f_X(x|\boldsymbol \theta) = h(x) g(\boldsymbol \theta) \exp\Big(\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(x)\Big)</math>
| |
| | |
| For a vector variable and vector parameter:
| |
| | |
| :<math> f_X(\mathbf{x}|\boldsymbol \theta) = h(\mathbf{x}) \exp\Big(\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(\mathbf{x}) - A({\boldsymbol \eta})\Big)</math>
| |
| | |
| The above formulas choose the functional form of the exponential-family with a log-partition function <math>A({\boldsymbol \eta})</math>. The reason for this is so that the [[#Moments and cumulants of the sufficient statistic|moments of the sufficient statistics]] can be calculated easily, simply by differentiating this function. Alternative forms involve either parameterizing this function in terms of the normal parameter <math>\boldsymbol\theta</math> instead of the natural parameter, and/or using a factor <math>g(\boldsymbol\eta)</math> outside of the exponential. The relation between the latter and the former is:
| |
| :<math>A(\boldsymbol\eta) = -\ln g(\boldsymbol\eta)</math>
| |
| :<math>g(\boldsymbol\eta) = e^{-A(\boldsymbol\eta)}</math>
| |
| To convert between the representations involving the two types of parameter, use the formulas below for writing one type of parameter in terms of the other.
| |
| | |
| {|class="wikitable"
| |
| ! Distribution !! Parameter(s) !! Natural parameter(s) !! Inverse parameter mapping !! Base measure <math>h(x)</math> !! Sufficient statistic <math>T(x)</math> !! Log-partition <math>A(\boldsymbol\eta)</math> !! Log-partition <math>A(\boldsymbol\theta)</math>
| |
| |-
| |
| | [[Bernoulli distribution]] || p
| |
| | <math>\ln\frac{p}{1-p}</math>
| |
| *This is the [[logit function]].
| |
| | <math>\frac{1}{1+e^{-\eta}} = \frac{e^\eta}{1+e^{\eta}}</math>
| |
| *This is the [[logistic function]].
| |
| | <math> 1 </math>
| |
| | <math> x </math>
| |
| | <math> \ln (1+e^{\eta})</math>
| |
| | <math> -\ln (1-p)</math>
| |
| |-
| |
| | [[binomial distribution]]<br />with known number of trials ''n'' || p
| |
| | <math>\ln\frac{p}{1-p}</math>
| |
| | <math>\frac{1}{1+e^{-\eta}} = \frac{e^\eta}{1+e^{\eta}}</math>
| |
| | <math> {n \choose x} </math>
| |
| | <math> x </math>
| |
| | <math> n \ln (1+e^{\eta})</math>
| |
| | <math> -n \ln (1-p)</math>
| |
| |-
| |
| | [[Poisson distribution]] || λ
| |
| | <math>\ln\lambda</math>
| |
| | <math>e^\eta</math>
| |
| | <math> \frac{1}{x!} </math>
| |
| | <math> x </math>
| |
| | <math> e^{\eta}</math>
| |
| | <math> \lambda</math>
| |
| |-
| |
| | [[negative binomial distribution]]<br />with known number of failures ''r'' || p
| |
| | <math>\ln p</math>
| |
| | <math>e^\eta</math>
| |
| | <math> {x+r-1 \choose x} </math>
| |
| | <math> x </math>
| |
| | <math> -r \ln (1-e^{\eta})</math>
| |
| | <math> -r \ln (1-p)</math>
| |
| |-
| |
| | [[exponential distribution]] || λ
| |
| | <math>-\lambda </math>
| |
| | <math>-\eta </math>
| |
| | <math> 1 </math>
| |
| | <math> x </math>
| |
| | <math> -\ln(-\eta)</math>
| |
| | <math> -\ln\lambda</math>
| |
| |-
| |
| | [[Pareto distribution]]<br />with known minimum value ''x''<sub>m</sub> || α
| |
| | <math>-\alpha-1</math>
| |
| | <math>-1-\eta</math>
| |
| | <math> 1 </math>
| |
| | <math> \ln x </math>
| |
| | <math> -\ln (-1-\eta) + (1+\eta) \ln x_{\mathrm m}</math>
| |
| | <math> -\ln \alpha - \alpha \ln x_{\mathrm m}</math>
| |
| |-
| |
| | [[Weibull distribution]]<br />with known shape ''k'' || λ
| |
| | <math>-\frac{1}{\lambda^k}</math>
| |
| | <math>(-\eta)^{\frac{1}{k}}</math>
| |
| | <math> x^{k-1} </math>
| |
| | <math> x^k </math>
| |
| | <math> \ln(-\eta) -\ln k</math>
| |
| | <math> k\ln\lambda -\ln k</math>
| |
| |-
| |
| | [[Laplace distribution]]<br />with known mean ''μ'' || b
| |
| | <math>-\frac{1}{b}</math>
| |
| | <math>-\frac{1}{\eta}</math>
| |
| | <math> 1 </math>
| |
| | <math> |x-\mu| </math>
| |
| | <math> \ln\left(-\frac{2}{\eta}\right)</math>
| |
| | <math> \ln 2b</math>
| |
| |-
| |
| | [[chi-squared distribution]] || ν
| |
| | <math>\frac{\nu}{2}-1 </math>
| |
| | <math>2(\eta+1) </math>
| |
| | <math> e^{-\frac{x}{2}} </math>
| |
| | <math> \ln x </math>
| |
| | <math> \ln \Gamma(\eta+1)+(\eta+1)\ln 2</math>
| |
| | <math> \ln \Gamma\left(\frac{\nu}{2}\right)+\frac{\nu}{2}\ln 2</math>
| |
| |-
| |
| | [[normal distribution]]<br />known variance || μ
| |
| | <math>\frac{\mu}{\sigma} </math>
| |
| | <math>\sigma\eta </math>
| |
| | <math> \frac{e^{-\frac{x^2}{2\sigma^2}}}{\sqrt{2\pi}\sigma} </math>
| |
| | <math> \frac{x}{\sigma} </math>
| |
| | <math> \frac{\eta^2}{2}</math>
| |
| | <math> \frac{\mu^2}{2\sigma^2}</math>
| |
| |-
| |
| | [[normal distribution]] || μ,σ<sup>2</sup>
| |
| | <math>\begin{bmatrix} \dfrac{\mu}{\sigma^2} \\[10pt] -\dfrac{1}{2\sigma^2} \end{bmatrix} </math>
| |
| | <math>\begin{bmatrix} -\dfrac{\eta_1}{2\eta_2} \\[15pt] -\dfrac{1}{2\eta_2} \end{bmatrix} </math>
| |
| | <math> \frac{1}{\sqrt{2\pi}} </math>
| |
| | <math> \begin{bmatrix} x \\ x^2 \end{bmatrix} </math>
| |
| | <math> -\frac{\eta_1^2}{4\eta_2} - \frac12\ln(-2\eta_2)</math>
| |
| | <math> \frac{\mu^2}{2\sigma^2} + \ln \sigma</math>
| |
| |-
| |
| | [[lognormal distribution]] || μ,σ<sup>2</sup>
| |
| | <math>\begin{bmatrix} \dfrac{\mu}{\sigma^2} \\[10pt] -\dfrac{1}{2\sigma^2} \end{bmatrix} </math>
| |
| | <math>\begin{bmatrix} -\dfrac{\eta_1}{2\eta_2} \\[15pt] -\dfrac{1}{2\eta_2} \end{bmatrix} </math>
| |
| | <math> \frac{1}{\sqrt{2\pi}x} </math>
| |
| | <math> \begin{bmatrix} \ln x \\ (\ln x)^2 \end{bmatrix} </math>
| |
| | <math> -\frac{\eta_1^2}{4\eta_2} - \frac12\ln(-2\eta_2)</math>
| |
| | <math> \frac{\mu^2}{2\sigma^2} + \ln \sigma</math>
| |
| |-
| |
| | [[inverse Gaussian distribution]] || μ,λ</sup>
| |
| | <math>\begin{bmatrix} -\dfrac{\lambda}{2\mu^2} \\[15pt] -\dfrac{\lambda}{2} \end{bmatrix} </math>
| |
| | <math>\begin{bmatrix} \sqrt{\dfrac{\eta_2}{\eta_1}} \\[15pt] -2\eta_2 \end{bmatrix} </math>
| |
| | <math> \frac{1}{\sqrt{2\pi}x^{\frac{3}{2}}} </math>
| |
| | <math> \begin{bmatrix} x \\[5pt] \dfrac{1}{x} \end{bmatrix} </math>
| |
| | <math> -2\sqrt{\eta_1\eta_2} -\frac12\ln(-2\eta_2)</math>
| |
| | <math> -\frac{\lambda}{\mu} -\frac12\ln\lambda</math>
| |
| |-
| |
| | rowspan=2|[[gamma distribution]] || α,β
| |
| | <math>\begin{bmatrix} \alpha-1 \\ -\beta \end{bmatrix} </math>
| |
| | <math>\begin{bmatrix} \eta_1+1 \\ -\eta_2 \end{bmatrix} </math>
| |
| | rowspan=2|<math> 1 </math>
| |
| | rowspan=2|<math> \begin{bmatrix} \ln x \\ x \end{bmatrix} </math>
| |
| | rowspan=2|<math> \ln \Gamma(\eta_1+1)-(\eta_1+1)\ln(-\eta_2)</math>
| |
| | <math> \ln \Gamma(\alpha)-\alpha\ln\beta</math>
| |
| |-
| |
| | ''k'', θ
| |
| | <math>\begin{bmatrix} k-1 \\[5pt] -\dfrac{1}{\theta} \end{bmatrix} </math>
| |
| | <math>\begin{bmatrix} \eta_1+1 \\[5pt] -\dfrac{1}{\eta_2} \end{bmatrix} </math>
| |
| | <math> \ln \Gamma(k)+k\ln\theta</math>
| |
| |-
| |
| | [[inverse gamma distribution]] || α,β
| |
| | <math>\begin{bmatrix} -\alpha-1 \\ -\beta \end{bmatrix} </math>
| |
| | <math>\begin{bmatrix} -\eta_1-1 \\ -\eta_2 \end{bmatrix} </math>
| |
| | <math> 1 </math>
| |
| | <math> \begin{bmatrix} \ln x \\ \frac{1}{x} \end{bmatrix} </math>
| |
| | <math> \ln \Gamma(-\eta_1-1)-(-\eta_1-1)\ln(-\eta_2)</math>
| |
| | <math> \ln \Gamma(\alpha)-\alpha\ln\beta</math>
| |
| |-
| |
| | [[scaled inverse chi-squared distribution]] || ν,σ<sup>2</sup>
| |
| | <math>\begin{bmatrix} -\dfrac{\nu}{2}-1 \\[10pt] -\dfrac{\nu\sigma^2}{2} \end{bmatrix} </math>
| |
| | <math>\begin{bmatrix} -2(\eta_1+1) \\[10pt] \dfrac{\eta_2}{\eta_1+1} \end{bmatrix} </math>
| |
| | <math> 1 </math>
| |
| | <math>\begin{bmatrix} \ln x \\ \frac{1}{x} \end{bmatrix} </math>
| |
| | <math> \ln \Gamma(-\eta_1-1)-(-\eta_1-1)\ln(-\eta_2)</math>
| |
| | <math> \ln \Gamma\left(\frac{\nu}{2}\right)-\frac{\nu}{2}\ln\frac{\nu\sigma^2}{2}</math>
| |
| |-
| |
| | [[beta distribution]] || α,β
| |
| | <math>\begin{bmatrix} \alpha \\ \beta \end{bmatrix} </math>
| |
| | <math>\begin{bmatrix} \eta_1 \\ \eta_2 \end{bmatrix} </math>
| |
| | <math> \frac{1}{x(1-x)} </math>
| |
| | <math> \begin{bmatrix} \ln x \\ \ln (1-x) \end{bmatrix} </math>
| |
| | <math> \ln \Gamma(\eta_1) + \ln \Gamma(\eta_2) - \ln \Gamma(\eta_1+\eta_2)</math>
| |
| | <math> \ln \Gamma(\alpha) + \ln \Gamma(\beta) - \ln \Gamma(\alpha+\beta)</math>
| |
| |-
| |
| | [[multivariate normal distribution]] || '''μ''','''Σ'''
| |
| | <math>\begin{bmatrix} \boldsymbol\Sigma^{-1}\boldsymbol\mu \\[5pt] -\frac12\boldsymbol\Sigma^{-1} \end{bmatrix}</math>
| |
| | <math>\begin{bmatrix} -\frac12\boldsymbol\eta_2^{-1}\boldsymbol\eta_1 \\[5pt] -\frac12\boldsymbol\eta_2^{-1} \end{bmatrix}</math>
| |
| | <math>(2\pi)^{-\frac{k}{2}}</math>
| |
| | <math>\begin{bmatrix} \mathbf{x} \\[5pt] \mathbf{x}\mathbf{x}^\mathrm{T} \end{bmatrix}</math>
| |
| | <math> -\frac{1}{4}\boldsymbol\eta_1^{\rm T}\boldsymbol\eta_2^{-1}\boldsymbol\eta_1 - \frac12\ln\left|-2\boldsymbol\eta_2\right|</math>
| |
| | <math> \frac12\boldsymbol\mu^{\rm T}\boldsymbol\Sigma^{-1}\boldsymbol\mu + \frac12 \ln |\boldsymbol\Sigma|</math>
| |
| |-
| |
| | [[categorical distribution]] (variant 1) || p<sub>1</sub>,...,p<sub>k</sub><br /><br />where <math>\textstyle\sum_{i=1}^k p_i=1</math>
| |
| | <math>\begin{bmatrix} \ln p_1 \\ \vdots \\ \ln p_k \end{bmatrix}</math>
| |
| | <math>\begin{bmatrix} e^{\eta_1} \\ \vdots \\ e^{\eta_k} \end{bmatrix}</math><br /><br />where <math>\textstyle\sum_{i=1}^k e^{\eta_i}=1</math>
| |
| | <math> 1 </math>
| |
| | <math>\begin{bmatrix} [x=1] \\ \vdots \\ {[x=k]} \end{bmatrix} </math>
| |
| *<math>[x=i]</math> is the [[Iverson bracket]] (1 if <math>x=i</math>, 0 otherwise).
| |
| | <math> 0</math>
| |
| | <math> 0</math>
| |
| |-
| |
| | [[categorical distribution]] (variant 2) || p<sub>1</sub>,...,p<sub>k</sub><br /><br />where <math>\textstyle\sum_{i=1}^k p_i=1</math>
| |
| | <math>\begin{bmatrix} \ln p_1+C \\ \vdots \\ \ln p_k+C \end{bmatrix}</math>
| |
| | <math>\begin{bmatrix} \dfrac{1}{C}e^{\eta_1} \\ \vdots \\ \dfrac{1}{C}e^{\eta_k} \end{bmatrix} =</math><br />
| |
| <math>\begin{bmatrix} \dfrac{e^{\eta_1}}{\sum_{i=1}^{k}e^{\eta_i}} \\[10pt] \vdots \\[5pt] \dfrac{e^{\eta_k}}{\sum_{i=1}^{k}e^{\eta_i}} \end{bmatrix}</math><br /><br />where <math>\textstyle\sum_{i=1}^k e^{\eta_i}=C</math>
| |
| | <math> 1 </math>
| |
| | <math>\begin{bmatrix} [x=1] \\ \vdots \\ {[x=k]} \end{bmatrix} </math>
| |
| *<math>[x=i]</math> is the [[Iverson bracket]] (1 if <math>x=i</math>, 0 otherwise).
| |
| | <math> 0</math>
| |
| | <math> 0</math>
| |
| |-
| |
| | [[categorical distribution]] (variant 3) || p<sub>1</sub>,...,p<sub>k</sub><br /><br />where <math>p_k = 1 - \textstyle\sum_{i=1}^{k-1} p_i</math>
| |
| | <math>\begin{bmatrix} \ln \dfrac{p_1}{p_k} \\[10pt] \vdots \\[5pt] \ln \dfrac{p_{k-1}}{p_k} \\[15pt] 0 \end{bmatrix} =</math><br /><br /><math>\begin{bmatrix} \ln \dfrac{p_1}{1-\sum_{i=1}^{k-1}p_i} \\[10pt] \vdots \\[5pt] \ln \dfrac{p_{k-1}}{1-\sum_{i=1}^{k-1}p_i} \\[15pt] 0 \end{bmatrix}</math>
| |
| *This is the inverse [[softmax function]], a generalization of the [[logit function]].
| |
| | <math>\begin{bmatrix} \dfrac{e^{\eta_1}}{\sum_{i=1}^{k}e^{\eta_i}} \\[10pt] \vdots \\[5pt] \dfrac{e^{\eta_k}}{\sum_{i=1}^{k}e^{\eta_i}} \end{bmatrix} =</math><br /><br />
| |
| <math>\begin{bmatrix} \dfrac{e^{\eta_1}}{1+\sum_{i=1}^{k-1}e^{\eta_i}} \\[10pt] \vdots \\[5pt] \dfrac{e^{\eta_{k-1}}}{1+\sum_{i=1}^{k-1}e^{\eta_i}} \\[15pt] \dfrac{1}{1+\sum_{i=1}^{k-1}e^{\eta_i}} \end{bmatrix}</math>
| |
| *This is the [[softmax function]], a generalization of the [[logistic function]].
| |
| | <math> 1 </math>
| |
| | <math>\begin{bmatrix} [x=1] \\ \vdots \\ {[x=k]} \end{bmatrix} </math>
| |
| *<math>[x=i]</math> is the [[Iverson bracket]] (1 if <math>x=i</math>, 0 otherwise).
| |
| | <math> \ln \left(\sum_{i=1}^{k} e^{\eta_i}\right) = \ln \left(1+\sum_{i=1}^{k-1} e^{\eta_i}\right)
| |
| </math>
| |
| | <math> -\ln p_k = -\ln \left(1 - \sum_{i=1}^{k-1} p_i\right)</math>
| |
| |-
| |
| | [[multinomial distribution]] (variant 1)<br />with known number of trials ''n'' || p<sub>1</sub>,...,p<sub>k</sub><br /><br />where <math>\textstyle\sum_{i=1}^k p_i=1</math>
| |
| | <math>\begin{bmatrix} \ln p_1 \\ \vdots \\ \ln p_k \end{bmatrix}</math>
| |
| | <math>\begin{bmatrix} e^{\eta_1} \\ \vdots \\ e^{\eta_k} \end{bmatrix}</math><br /><br />where <math>\textstyle\sum_{i=1}^k e^{\eta_i}=1</math>
| |
| | <math> \frac{n!}{\prod_{i=1}^{k} x_i!} </math>
| |
| | <math>\begin{bmatrix} x_1 \\ \vdots \\ x_k \end{bmatrix} </math>
| |
| | <math> 0</math>
| |
| | <math> 0</math>
| |
| |-
| |
| | [[multinomial distribution]] (variant 2)<br />with known number of trials ''n'' || p<sub>1</sub>,...,p<sub>k</sub><br /><br />where <math>\textstyle\sum_{i=1}^k p_i=1</math>
| |
| | <math>\begin{bmatrix} \ln p_1+C \\ \vdots \\ \ln p_k+C \end{bmatrix}</math>
| |
| | <math>\begin{bmatrix} \dfrac{1}{C}e^{\eta_1} \\ \vdots \\ \dfrac{1}{C}e^{\eta_k} \end{bmatrix} =</math><br />
| |
| <math>\begin{bmatrix} \dfrac{e^{\eta_1}}{\sum_{i=1}^{k}e^{\eta_i}} \\[10pt] \vdots \\[5pt] \dfrac{e^{\eta_k}}{\sum_{i=1}^{k}e^{\eta_i}} \end{bmatrix}</math><br /><br />where <math>\textstyle\sum_{i=1}^k e^{\eta_i}=C</math>
| |
| | <math> \frac{n!}{\prod_{i=1}^{k} x_i!} </math>
| |
| | <math>\begin{bmatrix} x_1 \\ \vdots \\ x_k \end{bmatrix} </math>
| |
| | <math> 0</math>
| |
| | <math> 0</math>
| |
| |-
| |
| | [[multinomial distribution]] (variant 3)<br />with known number of trials ''n''
| |
| | p<sub>1</sub>,...,p<sub>k</sub><br /><br />where <math>p_k = 1 - \textstyle\sum_{i=1}^{k-1} p_i</math>
| |
| | <math>\begin{bmatrix} \ln \dfrac{p_1}{p_k} \\[10pt] \vdots \\[5pt] \ln \dfrac{p_{k-1}}{p_k} \\[15pt] 0 \end{bmatrix} =</math><br /><br /><math>\begin{bmatrix} \ln \dfrac{p_1}{1-\sum_{i=1}^{k-1}p_i} \\[10pt] \vdots \\[5pt] \ln \dfrac{p_{k-1}}{1-\sum_{i=1}^{k-1}p_i} \\[15pt] 0 \end{bmatrix}</math>
| |
| | <math>\begin{bmatrix} \dfrac{e^{\eta_1}}{\sum_{i=1}^{k}e^{\eta_i}} \\[10pt] \vdots \\[5pt] \dfrac{e^{\eta_k}}{\sum_{i=1}^{k}e^{\eta_i}} \end{bmatrix} =</math><br /><br />
| |
| <math>\begin{bmatrix} \dfrac{e^{\eta_1}}{1+\sum_{i=1}^{k-1}e^{\eta_i}} \\[10pt] \vdots \\[5pt] \dfrac{e^{\eta_{k-1}}}{1+\sum_{i=1}^{k-1}e^{\eta_i}} \\[15pt] \dfrac{1}{1+\sum_{i=1}^{k-1}e^{\eta_i}} \end{bmatrix}</math>
| |
| | <math> \frac{n!}{\prod_{i=1}^{k} x_i!} </math>
| |
| | <math>\begin{bmatrix} x_1 \\ \vdots \\ x_k \end{bmatrix} </math>
| |
| | <math> \ln \left(\sum_{i=1}^{k} e^{\eta_i}\right) = \ln \left(1+\sum_{i=1}^{k-1} e^{\eta_i}\right)</math>
| |
| | <math> -\ln p_k = -\ln \left(1 - \sum_{i=1}^{k-1} p_i\right)</math>
| |
| |-
| |
| | [[Dirichlet distribution]] || α<sub>1</sub>,...,α<sub>k</sub>
| |
| | <math>\begin{bmatrix} \alpha_1-1 \\ \vdots \\ \alpha_k-1 \end{bmatrix}</math>
| |
| | <math>\begin{bmatrix} \eta_1+1 \\ \vdots \\ \eta_k+1 \end{bmatrix}</math>
| |
| | <math> 1 </math>
| |
| | <math> \begin{bmatrix} \ln x_1 \\ \vdots \\ \ln x_k \end{bmatrix} </math>
| |
| | <math> \sum_{i=1}^k \ln \Gamma(\eta_i+1) - \ln \Gamma\left(\sum_{i=1}^k\Big(\eta_i+1\Big)\right)
| |
| </math>
| |
| | <math> \sum_{i=1}^k \ln \Gamma(\alpha_i) - \ln \Gamma\left(\sum_{i=1}^k\alpha_i\right)
| |
| </math>
| |
| |-
| |
| | rowspan=2|[[Wishart distribution]] || '''V''',n</sub>
| |
| | <math>\begin{bmatrix} -\frac12\mathbf{V}^{-1} \\[5pt] \dfrac{n-p-1}{2} \end{bmatrix}</math>
| |
| | <math>\begin{bmatrix} -\frac12{\boldsymbol\eta_1}^{-1} \\[5pt] 2\eta_2+p+1 \end{bmatrix}</math>
| |
| | <math> 1 </math>
| |
| | <math> \begin{bmatrix} \mathbf{X} \\ \ln|\mathbf{X}| \end{bmatrix} </math>
| |
| | rowspan=2|<math>-\left(\eta_2+\frac{p+1}{2}\right)\ln|-\boldsymbol\eta_1|</math><br />
| |
| <math>+ \ln\Gamma_p\left(\eta_2+\frac{p+1}{2}\right) =</math><br />
| |
| <math>-\frac{n}{2}\ln|-\boldsymbol\eta_1| + \ln\Gamma_p\left(\frac{n}{2}\right) =</math><br />
| |
| <math>\left(\eta_2+\frac{p+1}{2}\right)(p\ln 2 + \ln|\mathbf{V}|)</math><br />
| |
| <math>+ \ln\Gamma_p\left(\eta_2+\frac{p+1}{2}\right)</math>
| |
| *Three variants with different parameterizations are given, to facilitate computing moments of the sufficient statistics.
| |
| |rowspan=2|<math> \frac{n}{2}(p\ln 2 + \ln|\mathbf{V}|) + \ln\Gamma_p\left(\frac{n}{2}\right)</math>
| |
| |-
| |
| | colspan=5|'''NOTE''': Uses the fact that <math>{\rm tr}(\mathbf{A}^{\rm T}\mathbf{B}) = \operatorname{vec}(\mathbf{A}) \cdot \operatorname{vec}(\mathbf{B}),</math> i.e. the [[trace (linear algebra)|trace]] of a [[matrix product]] is much like a [[dot product]]. The matrix parameters are assumed to be [[vectorization (mathematics)|vectorized]] (laid out in a vector) when inserted into the exponential form. Also, '''V''' and '''X''' are symmetric, so e.g. <math>\mathbf{V}^{\rm T} = \mathbf{V}.</math>
| |
| |-
| |
| | [[inverse Wishart distribution]] || '''Ψ''',m</sub>
| |
| | <math>\begin{bmatrix} -\frac12\boldsymbol\Psi \\[5pt] -\dfrac{m+p+1}{2} \end{bmatrix}</math>
| |
| | <math>\begin{bmatrix} -2\boldsymbol\eta_1 \\[5pt] -(2\eta_2+p+1) \end{bmatrix}</math>
| |
| | <math> 1 </math>
| |
| | <math> \begin{bmatrix} \mathbf{X}^{-1} \\ \ln|\mathbf{X}| \end{bmatrix} </math>
| |
| | <math> \left(\eta_2 + \frac{p + 1}{2}\right)\ln|-\boldsymbol\eta_1|</math><br />
| |
| <math> + \ln\Gamma_p\left(-\Big(\eta_2 + \frac{p + 1}{2}\Big)\right) =</math><br />
| |
| <math> -\frac{m}{2}\ln|-\boldsymbol\eta_1| + \ln\Gamma_p\left(\frac{m}{2}\right) =</math><br />
| |
| <math> -\left(\eta_2 + \frac{p + 1}{2}\right)(p\ln 2 - \ln|\boldsymbol\Psi|)</math><br />
| |
| <math> + \ln\Gamma_p\left(-\Big(\eta_2 + \frac{p + 1}{2}\Big)\right)</math>
| |
| |<math>\frac{m}{2}(p\ln 2 - \ln|\boldsymbol\Psi|) + \ln\Gamma_p\left(\frac{m}{2}\right)</math>
| |
| |-
| |
| | [[normal-gamma distribution]] || α,β,μ,λ
| |
| | <math>\begin{bmatrix} \alpha-\frac12 \\ -\beta-\dfrac{\lambda\mu^2}{2} \\ \lambda\mu \\ -\dfrac{\lambda}{2}\end{bmatrix} </math>
| |
| | <math>\begin{bmatrix} \eta_1+\frac12 \\ -\eta_2 + \dfrac{\eta_3^2}{4\eta_4} \\ -\dfrac{\eta_3}{2\eta_4} \\ -2\eta_4 \end{bmatrix} </math>
| |
| | <math> \dfrac{1}{\sqrt{2\pi}} </math>
| |
| | <math> \begin{bmatrix} \ln \tau \\ \tau \\ \tau x \\ \tau x^2 \end{bmatrix} </math>
| |
| | <math> \ln \Gamma\left(\eta_1+\frac12\right) - \frac12\ln\left(-2\eta_4\right) - </math><br />
| |
| <math> - \left(\eta_1+\frac12\right)\ln\left(-\eta_2 + \dfrac{\eta_3^2}{4\eta_4}\right)</math>
| |
| | <math> \ln \Gamma\left(\alpha\right)-\alpha\ln\beta-\frac12\ln\lambda</math>
| |
| |}
| |
| | |
| The three variants of the [[categorical distribution]] and [[multinomial distribution]] are due to the fact that the parameters <math>p_i</math> are constrained, such that
| |
| | |
| :<math>\sum_{i=1}^{k} p_i = 1.</math>
| |
| | |
| Thus, there are only ''k''−1 independent parameters.
| |
| *Variant 1 uses ''k'' natural parameters with a simple relation between the standard and natural parameters; however, only ''k''−1 of the natural parameters are independent, and the set of ''k'' natural parameters is [[nonidentifiable]]. The constraint on the usual parameters translates to a similar constraint on the natural parameters.
| |
| *Variant 2 demonstrates the fact that the entire set of natural parameters is nonidentifiable: Adding any constant value to the natural parameters has no effect on the resulting distribution. However, by using the constraint on the natural parameters, the formula for the normal parameters in terms of the natural parameters can be written in a way that is independent on the constant that is added.
| |
| *Variant 3 shows how to make the parameters identifiable in a convenient way by setting <math>C = -\ln p_k .</math> This effectively "pivots" around ''p<sub>k</sub>'' and causes the last natural parameter to have the constant value of 0. All the remaining formulas are written in a way that does not access ''p<sub>k</sub>'', so that effectively the model has only ''k''−1 parameters, both of the usual and natural kind.
| |
| Note also that variants 1 and 2 are not actually standard exponential families at all. Rather they are ''curved exponential families'', i.e. there are ''k''−1 independent parameters embedded in a ''k''-dimensional parameter space. Many of the standard results for exponential families do not apply to curved exponential families. An example is the log-partition function ''A''(''x''), which has the value of 0 in the curved cases. In standard exponential families, the derivatives of this function correspond to the moments (more technically, the [[cumulant]]s) of the sufficient statistics, e.g. the mean and variance. However, a value of 0 suggests that the mean and variance of all the sufficient statistics are uniformly 0, whereas in fact the mean of the ''i''th sufficient statistic should be ''p<sub>i</sub>''. (This does emerge correctly when using the form of ''A''(''x'') in variant 3.)
| |
| | |
| == Moments and cumulants of the sufficient statistic ==
| |
| === Normalization of the distribution ===
| |
| | |
| We start with the normalization of the probability distribution. In general, an arbitrary function ''f''(''x'') that serves as the [[kernel (statistics)|kernel]] of a probability distribution (the part encoding all dependence on ''x'') can be made into a proper distribution by [[normalization factor|normalizing]]: i.e.
| |
| | |
| :<math>p(x) = \frac{1}{Z} f(x)</math>
| |
| | |
| where
| |
| | |
| :<math>Z = \int_x f(x) dx.</math>
| |
| | |
| The factor ''Z'' is sometimes termed the ''normalizer'' or ''[[partition function (mathematics)|partition function]]'', based on an analogy to [[statistical physics]].
| |
| | |
| In the case of an exponential family where
| |
| :<math>p(x; \boldsymbol\eta) = g(\boldsymbol\eta) h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)},</math>
| |
| | |
| the kernel is
| |
| :<math>K(x) = h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)}</math>
| |
| and the partition function is
| |
| :<math>Z = \int_x h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)} dx.</math>
| |
| | |
| Since the distribution must be normalized, we have
| |
| | |
| :<math>1 = \int_x g(\boldsymbol\eta) h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)} dx = g(\boldsymbol\eta) \int_x h(x) e^{\boldsymbol\eta \cdot \mathbf{T}(x)} dx = g(\boldsymbol\eta) Z.</math>
| |
| | |
| In other words,
| |
| :<math>g(\boldsymbol\eta) = \frac{1}{Z}</math>
| |
| or equivalently
| |
| :<math>A(\boldsymbol\eta) = - \ln g(\boldsymbol\eta) = \ln Z.</math>
| |
| | |
| This justifies calling ''A'' the ''log-normalizer'' or ''log-partition function''.
| |
| | |
| === Moment generating function of the sufficient statistic ===
| |
| Now, the [[moment generating function]] of ''T''(''x'') is
| |
| | |
| :<math>M_T(u) \equiv E[e^{u^{\rm T} T(x)}|\eta] = \int_x h(x) e^{(\eta+u)^{\rm T} T(x)-A(\eta)} dx = e^{A(\eta + u)-A(\eta)}</math>
| |
| | |
| proving the earlier statement that
| |
| | |
| :<math>K(u|\eta) = A(\eta+u) - A(\eta)</math>
| |
| | |
| is the [[cumulant generating function]] for ''T''.
| |
| | |
| An important subclass of the exponential family the [[natural exponential family]] has a similar form for the moment generating function for the distribution of ''x''.
| |
| | |
| ==== Differential identities for cumulants ====
| |
| In particular, using the properties of the cumulant generating function,
| |
| | |
| :<math> E(T_{j}) = \frac{ \partial A(\eta) }{ \partial \eta_{j} } </math>
| |
| | |
| and
| |
| | |
| :<math> \mathrm{cov}\left (T_i,T_j \right) = \frac{ \partial^2 A(\eta) }{ \partial \eta_{i} \partial \eta_{j} }. </math>
| |
| | |
| The first two raw moments and all mixed second moments can be recovered from these two identities. Higher order moments and cumulants are obtained by higher derivatives. This technique is often useful when ''T'' is a complicated function of the data, whose moments are difficult to calculate by integration.
| |
| | |
| Another way to see this that does not rely on the theory of [[cumulant]]s is to begin from the fact that the distribution of an exponential family must be normalized, and differentiate. We illustrate using the simple case of a one-dimensional parameter, but an analogous derivation holds more generally.
| |
| | |
| In the one-dimensional case, we have
| |
| :<math>p(x) = g(\eta) h(x) e^{\eta T(x)} .</math>
| |
| | |
| This must be normalized, so
| |
| :<math>1 = \int_x p(x) dx = \int_x g(\eta) h(x) e^{\eta T(x)} dx = g(\eta) \int_x h(x) e^{\eta T(x)} dx .</math>
| |
| Take the [[derivative]] of both sides with respect to η:
| |
| | |
| :<math>\begin{align}
| |
| 0 &= g(\eta) \frac{d}{d\eta} \int_x h(x) e^{\eta T(x)} dx + g'(\eta)\int_x h(x) e^{\eta T(x)} dx \\
| |
| &= g(\eta) \int_x h(x) \left(\frac{d}{d\eta} e^{\eta T(x)}\right) dx + g'(\eta)\int_x h(x) e^{\eta T(x)} dx \\
| |
| &= g(\eta) \int_x h(x) e^{\eta T(x)} T(x) dx + g'(\eta)\int_x h(x) e^{\eta T(x)} dx \\
| |
| &= \int_x T(x) g(\eta) h(x) e^{\eta T(x)} dx + \frac{g'(\eta)}{g(\eta)}\int_x g(\eta) h(x) e^{\eta T(x)} dx \\
| |
| &= \int_x T(x) p(x) dx + \frac{g'(\eta)}{g(\eta)}\int_x p(x) dx \\
| |
| &= \mathbb{E}[T(x)] + \frac{g'(\eta)}{g(\eta)} \\
| |
| &= \mathbb{E}[T(x)] + \frac{d}{d\eta} \ln g(\eta)
| |
| \end{align}</math>
| |
| | |
| Therefore,
| |
| :<math>\mathbb{E}[T(x)] = - \frac{d}{d\eta} \ln g(\eta) = \frac{d}{d\eta} A(\eta).</math>
| |
| | |
| ==== Example 1 ====
| |
| | |
| As an introductory example, consider the [[gamma distribution]], whose distribution is defined by
| |
| | |
| :<math>p(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}e^{-\beta x}.</math>
| |
| | |
| Referring to the above table, we can see that the natural parameter is given by
| |
| | |
| :<math>\eta_1 = \alpha-1,</math>
| |
| :<math>\eta_2 = -\beta,</math>
| |
| | |
| the reverse substitutions are
| |
| | |
| :<math>\alpha = \eta_1+1,</math>
| |
| :<math>\beta = -\eta_2,</math>
| |
| | |
| the sufficient statistics are <math>(\ln x, x),</math> and the log-partition function is
| |
| | |
| :<math>A(\eta_1,\eta_2) = \ln \Gamma(\eta_1+1)-(\eta_1+1)\ln(-\eta_2).</math>
| |
| | |
| We can find the mean of the sufficient statistics as follows. First, for η<sub>1</sub>:
| |
| | |
| :<math>\begin{align}
| |
| \mathbb{E}[\ln x] &= \frac{ \partial A(\eta_1,\eta_2) }{ \partial \eta_1 } = \frac{ \partial }{ \partial \eta_1 } \left(\ln\Gamma(\eta_1+1) - (\eta_1+1) \ln(-\eta_2)\right) \\
| |
| &= \psi(\eta_1+1) - \ln(-\eta_2) \\
| |
| &= \psi(\alpha) - \ln \beta,
| |
| \end{align}</math>
| |
| | |
| Where <math>\psi(x)</math> is the [[digamma function]] (derivative of log gamma), and we used the reverse substitutions in the last step.
| |
| | |
| Now, for η<sub>2</sub>:
| |
| | |
| :<math>\begin{align}
| |
| \mathbb{E}[x] &= \frac{ \partial A(\eta_1,\eta_2) }{ \partial \eta_2 } = \frac{ \partial }{ \partial \eta_2 } \left(\ln \Gamma(\eta_1+1)-(\eta_1+1)\ln(-\eta_2)\right) \\
| |
| &= -(\eta_1+1)\frac{1}{-\eta_2}(-1) = \frac{\eta_1+1}{-\eta_2} \\
| |
| &= \frac{\alpha}{\beta},
| |
| \end{align}</math>
| |
| | |
| again making the reverse substitution in the last step.
| |
| | |
| To compute the variance of ''x'', we just differentiate again:
| |
| | |
| :<math>\begin{align}
| |
| \operatorname{Var}(x) &= \frac{\partial^2 A\left(\eta_1,\eta_2 \right)}{\partial \eta_2^2} = \frac{\partial}{\partial \eta_2} \frac{\eta_1+1}{-\eta_2} \\
| |
| &= \frac{\eta_1+1}{\eta_2^2} \\
| |
| &= \frac{\alpha}{\beta^2}.
| |
| \end{align}</math>
| |
| | |
| All of these calculations can be done using integration, making use of various properties of the [[gamma function]], but this requires significantly more work.
| |
| | |
| ==== Example 2 ====
| |
| As another example consider a real valued random variable ''X'' with density
| |
| | |
| :<math> p_\theta (x) = \frac{ \theta e^{-x} }{\left(1 + e^{-x} \right)^{\theta + 1} } </math>
| |
| | |
| indexed by shape parameter <math> \theta \in (0,\infty) </math> (this is called the [[skew-logistic distribution]]). The density can be rewritten as
| |
| | |
| :<math> \frac{ e^{-x} } { 1 + e^{-x} } \exp\left( -\theta \log\left(1 + e^{-x} \right) + \log(\theta)\right) </math>
| |
| | |
| Notice this is an exponential family with natural parameter
| |
| | |
| :<math> \eta = -\theta,</math>
| |
| | |
| sufficient statistic
| |
| | |
| :<math> T = \log\left (1 + e^{-x} \right),</math>
| |
| | |
| and log-partition function
| |
| | |
| :<math> A(\eta) = -\log(\theta) = -\log(-\eta)</math>
| |
| | |
| So using the first identity,
| |
| | |
| :<math> E(\log(1 + e^{-X})) = E(T) = \frac{ \partial A(\eta) }{ \partial \eta } = \frac{ \partial }{ \partial \eta } [-\log(-\eta)] = \frac{1}{-\eta} = \frac{1}{\theta}, </math>
| |
| | |
| and using the second identity
| |
| | |
| :<math> \mathrm{var}(\log\left(1 + e^{-X} \right)) = \frac{ \partial^2 A(\eta) }{ \partial \eta^2 } = \frac{ \partial }{ \partial \eta } \left[\frac{1}{-\eta}\right] = \frac{1}{(-\eta)^2} = \frac{1}{\theta^2}.</math>
| |
| | |
| This example illustrates a case where using this method is very simple, but the direct calculation would be nearly impossible.
| |
| | |
| ==== Example 3 ====
| |
| The final example is one where integration would be extremely difficult. This is the case of the [[Wishart distribution]], which is defined over matrices. Even taking derivatives is a bit tricky, as it involves [[matrix calculus]], but the respective identities are listed in that article.
| |
| | |
| From the above table, we can see that the natural parameter is given by
| |
| | |
| :<math>\boldsymbol\eta_1 = -\frac12\mathbf{V}^{-1},</math>
| |
| :<math>\eta_2 = \frac{n-p-1}{2},</math>
| |
| | |
| the reverse substitutions are
| |
| | |
| :<math>\mathbf{V} = -\frac12{\boldsymbol\eta_1}^{-1},</math>
| |
| :<math>n = 2\eta_2+p+1,</math>
| |
| | |
| and the sufficient statistics are <math>(\mathbf{X}, \ln|\mathbf{X}|).</math>
| |
| | |
| The log-partition function is written in various forms in the table, to facilitate differentiation and back-substitution. We use the following forms:
| |
| | |
| :<math>A(\boldsymbol\eta_1, n) = -\frac{n}{2}\ln|-\boldsymbol\eta_1| + \ln\Gamma_p\left(\frac{n}{2}\right),</math>
| |
| :<math>A(\mathbf{V},\eta_2) = \left(\eta_2+\frac{p+1}{2}\right)(p\ln 2 + \ln|\mathbf{V}|) + \ln\Gamma_p\left(\eta_2+\frac{p+1}{2}\right).</math>
| |
| | |
| ;Expectation of '''X''' (associated with '''η'''<sub>1</sub>)
| |
| To differentiate with respect to '''η'''<sub>1</sub>, we need the following [[matrix calculus]] identity:
| |
| | |
| : <math>\frac{\partial \ln |a\mathbf{X}|}{\partial \mathbf{X}} =(\mathbf{X}^{-1})^{\rm T}</math>
| |
| | |
| Then:
| |
| | |
| :<math>\begin{align}
| |
| \mathbb{E}[\mathbf{X}] &= \frac{ \partial A\left(\boldsymbol\eta_1,\cdots \right) }{ \partial \boldsymbol\eta_1 } \\
| |
| &= \frac{ \partial }{ \partial \boldsymbol\eta_1 } \left[-\frac{n}{2}\ln|-\boldsymbol\eta_1| + \ln\Gamma_p\left(\frac{n}{2}\right) \right] \\
| |
| &= -\frac{n}{2}(\boldsymbol\eta_1^{-1})^{\rm T} \\
| |
| &= \frac{n}{2}(-\boldsymbol\eta_1^{-1})^{\rm T} \\
| |
| &= n(\mathbf{V})^{\rm T} \\
| |
| &= n\mathbf{V}
| |
| \end{align}</math>
| |
| | |
| The last line uses the fact that '''V''' is symmetric, and therefore it is the same when transposed.
| |
| | |
| ;Expectation of ln |'''X'''| (associated with η<sub>2</sub>)
| |
| Now, for η<sub>2</sub>, we first need to expand the part of the log-partition function that involves the [[multivariate gamma function]]:
| |
| | |
| :<math> \ln \Gamma_p(a)= \ln \left(\pi^{\frac{p(p-1)}{4}}\prod_{j=1}^p \Gamma\left(a+\frac{1-j}{2}\right)\right) = \frac{p(p-1)}{4} \ln \pi + \sum_{j=1}^p \ln \Gamma\left[ a+\frac{1-j}{2}\right] </math>
| |
| | |
| We also need the [[digamma function]]:
| |
| | |
| :<math>\psi(x) = \frac{d}{dx} \ln \Gamma(x).</math>
| |
| | |
| Then:
| |
| | |
| :<math>\begin{align}
| |
| \mathbb{E}[\ln |\mathbf{X}|] &= \frac{\partial A\left (\cdots,\eta_2 \right)}{\partial \eta_2} \\
| |
| &= \frac{\partial}{\partial \eta_2} \left[-\left(\eta_2+\frac{p+1}{2}\right)(p\ln 2 + \ln|\mathbf{V}|) + \ln\Gamma_p\left(\eta_2+\frac{p+1}{2}\right) \right] \\
| |
| &= \frac{\partial}{\partial \eta_2} \left[ \left(\eta_2+\frac{p+1}{2}\right)(p\ln 2 + \ln|\mathbf{V}|) + \frac{p(p-1)}{4} \ln \pi + \sum_{j=1}^p \ln \Gamma\left(\eta_2+\frac{p+1}{2}+\frac{1-j}{2}\right) \right] \\
| |
| &= p\ln 2 + \ln|\mathbf{V}| + \sum_{j=1}^p \psi\left(\eta_2+\frac{p+1}{2}+\frac{1-j}{2}\right) \\
| |
| &= p\ln 2 + \ln|\mathbf{V}| + \sum_{j=1}^p \psi\left(\frac{n-p-1}{2}+\frac{p+1}{2}+\frac{1-j}{2}\right) \\
| |
| &= p\ln 2 + \ln|\mathbf{V}| + \sum_{j=1}^p \psi\left(\frac{n+1-j}{2}\right)
| |
| \end{align}</math>
| |
| | |
| This latter formula is listed in the [[Wishart distribution#Log-expectation|Wishart distribution]] article. Both of these expectations are needed when deriving the [[variational Bayes]] update equations in a [[Bayes network]] involving a Wishart distribution (which is the [[conjugate prior]] of the [[multivariate normal distribution]]).
| |
| | |
| Computing these formulas using integration would be much more difficult. The first one, for example, would require matrix integration.
| |
| | |
| == Maximum entropy derivation ==
| |
| The exponential family arises naturally as the answer to the following question: what is the [[principle of maximum entropy|maximum-entropy]] distribution consistent with given constraints on expected values?
| |
| | |
| The [[information entropy]] of a probability distribution ''dF''(''x'') can only be computed with respect to some other probability distribution (or, more generally, a positive measure), and both [[measure (mathematics)|measure]]s must be mutually [[absolutely continuous]]. Accordingly, we need to pick a ''reference measure'' ''dH''(''x'') with the same support as ''dF''(''x'').
| |
| | |
| The entropy of ''dF''(''x'') relative to ''dH''(''x'') is
| |
| | |
| :<math>S[dF|dH]=-\int \frac{dF}{dH}\ln\frac{dF}{dH}\,dH</math>
| |
| | |
| or
| |
| | |
| :<math>S[dF|dH]=\int\ln\frac{dH}{dF}\,dF</math>
| |
| | |
| where ''dF''/''dH'' and ''dH''/''dF'' are [[Radon–Nikodym derivative]]s. Note that the ordinary definition of entropy for a discrete distribution supported on a set ''I'', namely
| |
| | |
| :<math>S=-\sum_{i\in I} p_i\ln p_i</math>
| |
| | |
| ''assumes'', though this is seldom pointed out, that ''dH'' is chosen to be the [[counting measure]] on ''I''.
| |
| | |
| Consider now a collection of observable quantities (random variables) ''T<sub>i</sub>''. The probability distribution ''dF'' whose entropy with respect to ''dH'' is greatest, subject to the conditions that the expected value of ''T<sub>i</sub>'' be equal to ''t<sub>i</sub>'', is a member of the exponential family with ''dH'' as reference measure and (''T''<sub>1</sub>, ..., ''T<sub>n</sub>'') as sufficient statistic.
| |
| | |
| The derivation is a simple [[calculus of variations|variational calculation]] using [[Lagrange multipliers]]. Normalization is imposed by letting ''T''<sub>0</sub> = 1 be one of the constraints. The natural parameters of the distribution are the Lagrange multipliers, and the normalization factor is the Lagrange multiplier associated to ''T''<sub>0</sub>.
| |
| | |
| For examples of such derivations, see [[Maximum entropy probability distribution]].
| |
| | |
| == Role in statistics ==
| |
| === Classical estimation: sufficiency ===
| |
| | |
| According to the '''[[E. J. G. Pitman|Pitman]]–[[Bernard Koopman|Koopman]]–[[Georges Darmois|Darmois]] theorem''', among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a [[sufficient statistic]] whose dimension remains bounded as sample size increases.
| |
| | |
| Less tersely, suppose ''X<sub>k</sub>'', (where ''k'' = 1, 2, 3, ... ''n'') are [[statistical independence|independent]], identically distributed random variables. Only if their distribution is one of the ''exponential family'' of distributions is there a [[sufficient statistic]] '''''T'''''(''X''<sub>1</sub>, ..., ''X<sub>n</sub>'') whose [[dimension|number]] of [[Random variable#Introduction|scalar components]] does not increase as the sample size ''n'' increases; the statistic '''''T''''' may be a [[Multivariate random variable|vector]] or a [[Random variable#Introduction|single scalar number]], but whatever it is, its [[dimension|size]] will neither grow nor shrink when more data are obtained.
| |
| | |
| === Bayesian estimation: conjugate distributions ===
| |
| Exponential families are also important in [[Bayesian statistics]]. In Bayesian statistics a [[prior distribution]] is multiplied by a [[likelihood function]] and then normalised to produce a [[posterior distribution]]. In the case of a likelihood which belongs to the exponential family there exists a [[conjugate prior]], which is often also in the exponential family. A conjugate prior π for the parameter <math>\boldsymbol\eta</math> of an exponential family is given by
| |
| | |
| :<math>p_\pi(\boldsymbol\eta|\boldsymbol\chi,\nu) = f(\boldsymbol\chi,\nu) \exp \left (\boldsymbol\eta^{\rm T} \boldsymbol\chi - \nu A(\boldsymbol\eta) \right ),</math>
| |
| | |
| or equivalently
| |
| | |
| :<math>p_\pi(\boldsymbol\eta|\boldsymbol\chi,\nu) = f(\boldsymbol\chi,\nu) g(\boldsymbol\eta)^\nu \exp \left (\boldsymbol\eta^{\rm T} \boldsymbol\chi \right ), \qquad \boldsymbol\chi \in \mathbb{R}^s</math>
| |
| | |
| where ''s'' is the dimension of <math>\boldsymbol\eta</math> and ν > 0 are [[hyperparameter]]s (parameters controlling parameters). ν corresponds to the effective number of observations that the prior distribution contributes, and <math>\boldsymbol\chi</math> corresponds to the total amount that these pseudo-observations contribute to the [[sufficient statistic]] over all observations and pseudo-observations. <math>f(\boldsymbol\chi,\nu)</math> is a [[normalization constant]] that is automatically determined by the remaining functions and serves to ensure that the given function is a [[probability density function]] (i.e. it is [[Normalization (statistics)|normalized]]). <math>A(\boldsymbol\eta)</math> and equivalently <math>g(\boldsymbol\eta)</math> are the same functions as in the definition of the distribution over which π is the conjugate prior.
| |
| | |
| A conjugate prior is one which, when combined with the likelihood and normalised, produces a posterior distribution which is of the same type as the prior. For example, if one is estimating the success probability of a binomial distribution, then if one chooses to use a beta distribution as one's prior, the posterior is another beta distribution. This makes the computation of the posterior particularly simple. Similarly, if one is estimating the parameter of a [[Poisson distribution]] the use of a gamma prior will lead to another gamma posterior. Conjugate priors are often very flexible and can be very convenient. However, if one's belief about the likely value of the theta parameter of a binomial is represented by (say) a bimodal (two-humped) prior distribution, then this cannot be represented by a beta distribution. It can however be represented by using a [[mixture density]] as the prior, here a combination of two beta distributions; this is a form of [[hyperprior]].
| |
| | |
| An arbitrary likelihood will not belong to the exponential family, and thus in general no conjugate prior exists. The posterior will then have to be computed by numerical methods.
| |
| | |
| To show that the above prior distribution is a conjugate prior, we can derive the posterior.
| |
| | |
| First, assume that the probability of a single observation follows an exponential family, parameterized using its natural parameter:
| |
| | |
| :<math> p_F(x|\boldsymbol \eta) = h(x) g(\boldsymbol\eta) \exp\left(\boldsymbol\eta^{\rm T} \mathbf{T}(x)\right)</math>
| |
| | |
| Then, for data <math>\mathbf{X} = (x_1,\ldots,x_n)</math>, the likelihood is computed as follows:
| |
| | |
| :<math>p(\mathbf{X}|\boldsymbol\eta) =\left(\prod_{i=1}^n h(x_i) \right) g(\boldsymbol\eta)^n \exp\left(\boldsymbol\eta^{\rm T}\sum_{i=1}^n \mathbf{T}(x_i) \right)</math>
| |
| | |
| Then, for the above conjugate prior:
| |
| | |
| :<math> p_\pi(\boldsymbol\eta|\boldsymbol\chi,\nu) &= f(\boldsymbol\chi,\nu) g(\boldsymbol\eta)^\nu \exp(\boldsymbol\eta^{\rm T} \boldsymbol\chi) \propto g(\boldsymbol\eta)^\nu \exp(\boldsymbol\eta^{\rm T} \boldsymbol\chi)</math>
| |
| | |
| We can then compute the posterior as follows:
| |
| | |
| :<math>\begin{align}
| |
| p(\boldsymbol\eta|\mathbf{X},\boldsymbol\chi,\nu)& \propto p(\mathbf{X}|\boldsymbol\eta) p_\pi(\boldsymbol\eta|\boldsymbol\chi,\nu) \\
| |
| &= \left(\prod_{i=1}^n h(x_i) \right) g(\boldsymbol\eta)^n \exp\left(\boldsymbol\eta^{\rm T} \sum_{i=1}^n \mathbf{T}(x_i)\right)
| |
| f(\boldsymbol\chi,\nu) g(\boldsymbol\eta)^\nu \exp(\boldsymbol\eta^{\rm T} \boldsymbol\chi) \\
| |
| &\propto g(\boldsymbol\eta)^n \exp\left(\boldsymbol\eta^{\rm T}\sum_{i=1}^n \mathbf{T}(x_i)\right) g(\boldsymbol\eta)^\nu \exp(\boldsymbol\eta^{\rm T} \boldsymbol\chi) \\
| |
| &\propto g(\boldsymbol\eta)^{\nu + n} \exp\left(\boldsymbol\eta^{\rm T} \left(\boldsymbol\chi + \sum_{i=1}^n \mathbf{T}(x_i)\right)\right)
| |
| \end{align}</math>
| |
| | |
| The last line is the [[kernel (statistics)|kernel]] of the prior distribution, i.e.
| |
| | |
| :<math>p(\boldsymbol\eta|\mathbf{X},\boldsymbol\chi,\nu) = p_\pi\left(\boldsymbol\eta|\boldsymbol\chi + \sum_{i=1}^n \mathbf{T}(x_i), \nu + n \right)</math>
| |
| | |
| This shows that the posterior has the same form as the prior.
| |
| | |
| Note in particular that the data '''X''' enters into this equation ''only'' in the expression
| |
| | |
| :<math>\mathbf{T}(\mathbf{X}) = \sum_{i=1}^n \mathbf{T}(x_i),</math>
| |
| | |
| which is termed the [[sufficient statistic]] of the data. That is, the value of the sufficient statistic is sufficient to completely determine the posterior distribution. The actual data points themselves are not needed, and all sets of data points with the same sufficient statistic will have the same distribution. This is important because the dimension of the sufficient statistic does not grow with the data size — it has only as many components as the components of <math>\boldsymbol\eta</math> (equivalently, the number of parameters of the distribution of a single data point).
| |
| | |
| The update equations are as follows:
| |
| | |
| <math>\begin{align}
| |
| \boldsymbol\chi' &= \boldsymbol\chi + \mathbf{T}(\mathbf{X}) \\
| |
| &= \boldsymbol\chi + \sum_{i=1}^n \mathbf{T}(x_i) \\
| |
| \nu' &= \nu + n
| |
| \end{align} </math>
| |
| | |
| This shows that the update equations can be written simply in terms of the number of data points and the [[sufficient statistic]] of the data. This can be seen clearly in the various examples of update equations shown in the [[conjugate prior]] page. Note also that because of the way that the sufficient statistic is computed, it necessarily involves sums of components of the data (in some cases disguised as products or other forms — a product can be written in terms of a sum of [[logarithm]]s). The cases where the update equations for particular distributions don't exactly match the above forms are cases where the conjugate prior has been expressed using a different [[parameterization]] than the one that produces a conjugate prior of the above form — often specifically because the above form is defined over the natural parameter <math>\boldsymbol\eta</math> while conjugate priors are usually defined over the actual parameter <math>\boldsymbol\theta .</math>
| |
| | |
| === Hypothesis testing: Uniformly most powerful tests ===
| |
| {{further2|[[Uniformly most powerful test#Important case: The exponential family|Uniformly most powerful test]]}}
| |
| The one-parameter exponential family has a monotone non-decreasing likelihood ratio in the [[Sufficiency (statistics)|sufficient statistic]] ''T''(''x''), provided that η(θ) is non-decreasing. As a consequence, there exists a [[uniformly most powerful test]] for [[hypothesis testing|testing the hypothesis]] ''H''<sub>0</sub>: θ ≥ θ<sub>0</sub> ''vs''. ''H''<sub>1</sub>: θ < θ<sub>0</sub>.
| |
| | |
| === Generalized linear models ===
| |
| The exponential family forms the basis for the distribution function used in [[generalized linear models]], a class of model that encompass many of the commonly used regression models in statistics.
| |
| | |
| == See also ==
| |
| | |
| * [[Natural exponential family]]
| |
| * [[Exponential dispersion model]]
| |
| * [[Gibbs measure]]
| |
| | |
| {{more footnotes|date=November 2010}}
| |
| | |
| == References ==
| |
| {{reflist}}
| |
| | |
| == Further reading ==
| |
| * {{cite book
| |
| | last = Lehmann
| |
| | first = E. L.
| |
| | coauthors = Casella, G.
| |
| | title = Theory of Point Estimation
| |
| | year = 1998
| |
| | pages = 2nd ed., sec. 1.5 }}
| |
| * {{cite book
| |
| | last = Keener
| |
| | first = Robert W.
| |
| | title = Statistical Theory: Notes for a Course in Theoretical Statistics
| |
| | publisher = Springer
| |
| | year = 2006
| |
| | pages = 27–28, 32–33 }}
| |
| * {{cite book
| |
| | last = Fahrmeier
| |
| | first = Ludwig
| |
| | coauthors = Tutz, G.
| |
| | title = Multivariate statistical modelling based on generalized linear models
| |
| | publisher = Springer
| |
| | year = 1994
| |
| | pages = 18–22, 345–349 }}
| |
| | |
| ==External links==
| |
| * [http://www.casact.org/pubs/dpp/dpp04/04dpp117.pdf A primer on the exponential family of distributions]
| |
| * [http://jeff560.tripod.com/e.html Exponential family of distributions] on the [http://jeff560.tripod.com/mathword.html Earliest known uses of some of the words of mathematics]
| |
| * [http://vincentfpgarcia.github.com/jMEF/ jMEF: A Java library for exponential families]
| |
| | |
| {{ProbDistributions|families}}
| |
| {{DEFAULTSORT:Exponential Family}}
| |
| [[Category:Exponentials]]
| |
| [[Category:Continuous distributions]]
| |
| [[Category:Discrete distributions]]
| |
| [[Category:Types of probability distributions]]
| |