Talk:Beta distribution

From formulasearchengine
Jump to navigation Jump to search

Template:Maths rating Template:WPStatistics


Ohanian 01:26, 2005 Apr 8 (UTC)

why the difference in notation?

Consider the expression

Fixing (n,p) it is the binomial distribution of i. Fixing (n,i) it is the (unnormalized) beta distribution of p. The article does not clarify this.

Bo Jacoby 10:02, 15 September 2005 (UTC)

This is mentioned only implicitly in the current version, which describes the beta distribution as the conjugate prior for the binomial. You could add a section on occurrence and uses of the beta distribution that would clarify this point further. --MarkSweep 12:50, 15 September 2005 (UTC)

I don't see what makes you think the article is not explicit about this point. You wrote this on Sepember 15th, when the version of September 6th was there, and that version is perfectly explicit about it. It says the density f(x) is defined on the interval [0, 1], and x where it appears in that formula is the same as what you're calling p above. How explicit can you get? Michael Hardy 23:06, 16 December 2005 (UTC)

... or did you mean it fails to clarify that the same expression defines both functions? OK, maybe you did mean that ... Michael Hardy 23:08, 16 December 2005 (UTC)

Parameter Estimation

Should the paragraph on computing and from moments be moved from the main section to a new section on Parameter Estimation? When looking for MOM parameter estimates, I missed that paragraph, and other distributions have their own parameter estimation section (e.g. gamma).

What is there now is not about parameter estimation, but it would be a short step from there to estimation by the method of moments. If there is to be a section on parameter estimation, the certainly MLEs should be there too. Also, the initial "e" in "estimation" should not be capitalized in the section heading unless it's the first letter in the section heading. Michael Hardy 22:43, 16 December 2005 (UTC)
The method of moments calculation was moved back to the intro, without any discussion. I've moved it back to the parameter estimation section. Please don't undo edits that have been discussed and agreed upon without discussing it first. User:Hilgerdenaar December 5 2006
I added a section on maximum likelihood estimators. I also changed the parameters to conform with the terminology in the section on "parametrization, four parameters", at the end of the article. Dr. J. Rodal (talk) 20:37, 17 August 2012 (UTC)
In the section on maximum likelihood estimators that I added, concerning the maximum likelihood estimator, for the case of known , with unknown parameter , the correct equation is: . The classic reference by N.L.Johnson and S.Kotz, in their (1970) first edition of "Continuous Univariate Distributions Vol. 2" , Wiley, Chapter 21:Beta Distributions, page 46, contains an error: they have the incorrect sign for this equation. That's why I added: "Recall that the beta distribution has support [0,1], therefore , and hence , and therefore :"Dr. J. Rodal (talk) 20:08, 26 August 2012 (UTC)

Method of moments - four parameter estimation case - another error in first edition of Johnson and Kotz

I added a section for the four parameter estimation case using the method of moments. Elderton, (see section titled "History"), in his 1906 monograph "Frequency curves and correlation," fully discusses the four parameter case and contains the correct equations. These equations have been repeated by other authors in other books. It is curious that the classic reference by N.L.Johnson and S.Kotz, in their (1970) first edition of "Continuous Univariate Distributions Vol. 2" , Wiley, Chapter 21:Beta Distributions, page 44, equation (15) contains an important error. The support interval range is given as follows:

Where Johnson and Kotz use the identical nomenclature used by Elderton in 1906: for the sample kurtosis, for sample variance and "r" for

This is incorrect. The correct equation is given by Elderton in 1906:

This range can also be expressed in terms of the excess kurtosis or the kurtosis, as I have done in the article (but it will read very differently than in Johnson and Kotz, whether one uses the kurtosis or the excess kurtosis). Dr. J. Rodal (talk) 19:15, 7 September 2012 (UTC)


What is the entropy of a beta distribution? This document: has a formula (and it refers to "Cover and Thomas (1991)"). Can someone verify it? They say psi is the derivative of the gamma function, but usually psi represents the digamma function which is the derivative of the log of the gamma function. So I'm wondering if they have a typo. A5 01:13, 22 May 2006 (UTC)

Sorry the link was bad, I've updated it. The relevant formula is on p. 9 and is where . A5 17:32, 23 May 2006 (UTC)
psi really is the digamma function. You can write the entropy in terms of expected value of ln x and ln (1-x), and those can be written in terms of psi(a), psi(b), and psi(a+b). Aaron Denney (talk) 18:19, 16 March 2010 (UTC)
The formula for the entropy evaluates to 0 when alpha=beta=1, which is very odd, because those parameters give the maximum-entropy (flat) form of the distribution. For alpha>1, beta>1, it evaluates to a negative quantity. Entropy, as I understand it (Shannon entropy) is always positive. RandyGallistel (talk) 01:07, 22 September 2010 (UTC)
This equation is clearly wrong. The entropy of X~Be(1,1) is given as 0, when it should be the max (as Be(1,1) is the standard uniform dist.) I haven't been able to find the paper linked above or the reference to "Cover and Thomas (1991)" so until then I think it would be wise to remove a clearly incorrect equation. Hawkmp4 (talk) 18:27, 16 September 2011 (UTC)
No, the equation is correct. The entropy of X~Be(1,1) is 0, which is the maximum entropy for a distribution on [0 1] (all other distributions have negative differential entropy, unlike in the discrete case). — Preceding unsigned comment added by (talk) 22:17, 7 June 2012 (UTC)

Great question, great comments and great response. Yes, it is counter-intuitive that the differential entropy of the beta distribution is negative (for all values of α and β except α=β=1).

  1. To explain this, I added <<The differential entropy of the beta distribution is negative for all values of α and β greater than zero, except at α = β = 1 (for which values the beta distribution is the same as the uniform distribution), where the differential entropy reaches its maximum value of zero. It is to be expected that the maximum entropy should take place when the beta distribution becomes equal to the uniform distribution, since uncertainty is maximal when all possible events are equiprobable. The (continuous case) differential entropy was introduced by Shannon in his original paper (where he named it the "entropy of a continuous distribution"), as the concluding part [1] of the same paper where he defined the discrete entropy. It is known since then that the differential entropy may differ from the infinitesimal limit of the discrete entropy by an infinite offset, therefore the differential entropy can be negative (as it is for the beta distribution). What really matters is the relative value of entropy......The relative entropy, or Kullback–Leibler divergence, is always non-negative>>
  2. I added <<(measured in nats)>>
  3. The comment by is correct: the equation : for the differential entropy is correct. I have separately verified this by using Mathematica and computing the differential energy by Integration (as per the original definition by Shannon).
  4. I added the original reference by Shannon (Shannon, Claude E., "A Mathematical Theory of Communication," Bell System Technical Journal, 27 (4):623–656,1948) concerning the (continuous) differential energy
  5. I added charts showing the behavior of the differential entropy as a function of α and β
  6. I added a new subsection titled Alternative parametrizations -Two parameters -Mean and Variance, that includes charts of the differential entropy as a function of the mean and the variance.
  7. Concerning "Cover and Thomas (1991)," this is their book Hardcover: 776 pages Publisher: Wiley-Interscience; 2 edition (July 18, 2006) Language: English ISBN-10: 0471241954 ISBN-13: 978-0471241959, which actually quotes A. C. G. Verdugo Lazo and P. N. Rathie. "On the entropy of continuous probability distributions," IEEE Trans. Inf. Theory, IT-24:120–122,1978 for the equation for the differential entropyDr. J. Rodal (talk) 19:41, 4 August 2012 (UTC)
There is an interesting thread on, discussing the meandering history of the cross-entropy H. One of the writers in this thread correctly states that "T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, 2006" does not (explicitly) use the term "cross entropy": it cannot be found explicitly in the index or in the text. However, Cover and Thomas do discuss it implicitly as the addition of differential entropy and Kullback divergence (both with negative signs). See for example problem #11.17 on page 405 of Cover and Thomas, where the relationship between cross-entropy and maximum likelihood is (implicitly) discussed.Dr. J. Rodal (talk) 18:07, 4 September 2012 (UTC)


I was asked the following question: I've been trying to derive the expression for the differential entropy of the beta distribution. According to, you derived it using mathematica. I have just installed mathematica, but I cannot reproduce your result (my attempt is below). Please can you send me your 'code'. I tried

Expectation[-q*Log[q], q \[Distributed] BetaDistribution[a, b]]

and got

(a (HarmonicNumber[a] - HarmonicNumber[a + b]))/(a + b)

ANSWER: The following explicit integration gives the answer in terms of the PolyGamma functions

In[1]:= FullSimplify[ Integrate[-Expand[(((1 - x)^(-1 + beta) x^(-1 + alpha) ) ((-1 + beta) Log[1 - x] + (-1 + alpha) Log[x] - Log[Beta[alpha, beta]]))/Beta[alpha, beta]], {x, 0, 1}]]

Out[1]= ConditionalExpression[ Log[Beta[alpha, beta]] - (-1 + alpha) PolyGamma[0, alpha] - (-1 + beta) PolyGamma[0, beta] + (-2 + alpha + beta) PolyGamma[0, alpha + beta], Re[beta] > 0 && Re[alpha] > 0]

Dr. J. Rodal (talk) 14:45, 6 November 2013 (UTC)


1) Following Claude E. Shannon, the differential entropy is h = E[ - ln(q)] and not h = E[- q ln(q)] (with q \[Distributed] BetaDistribution[a, b] )

2) One has to guide Mathematica for this integration. There are issues as x approaches 0 and 1, and for alpha and beta approaching 0. The Mathematica (Versions 8 and 9) Expectation function is too general to use without conditioning the expression variables. Therefore it is best to use the Mathematica Integration function and perform an explicit integration term by term as:

instead of directly using the Mathematica (Versions 8 and 9) Expectation function as:

Dr. J. Rodal (talk) 13:49, 7 November 2013 (UTC)

UPDATE (Nov 11, 2013): I sent the above to Wolfram Support and received the following answer (bold added for emphasis): "thank you for your feedback . Mathematica is not able to compute the Expectation (of the differential entropy of the beta distribution) even under the condition {alpha>0, beta>0} but is able to do so by the command you applied, will file a suggestion on this for Mathematica to have better support on computing the differential entropy of beta distribution (on future versions). Please let us know if you have any other comments or questions to our product and we will be glad to help you."

Dr. J. Rodal (talk) 00:47, 12 November 2013 (UTC)



It would be better to move some of the application section to the introduction to give people an idea of why this is usefull instead of its mathematical definition.

Agreed Shae 18:27, 6 June 2007 (UTC)

I agree with both comments, that discussing actual applications in the introduction would be appealing to a wide range of users, therefore:

  1. I added the following paragraph to the introduction: <<The beta distribution has been applied to model the behavior of random variables limited to intervals of finite length. It has been used in population genetics for a statistical description of the allele frequencies in the components of a sub-divided population. It has also been used extensively in PERT, critical path method (CPM) and other project management / control systems to describe the statistical distributions of the time to completion and the cost of a task. It has also been applied in acoustic analysis to assess damage to gears, as the kurtosis of the beta distribution has been reported as a good indicator of the condition of gears[2] . It has also been used to model sunshine data for application to solar renewable energy utilization[3]. It has also been used for parametrizing variability of soil properties at the regional level for crop yield estimation, modeling crop response over the area of the association[4] . It has also been used to determine well-log shale parameters, to describe the proportions of the mineralogical components existing in a certain stratigraphic interval[5] . The model allows the calculation of well-logging parameters, such as GRma, GRsh, and shale density, without having to introduce them by "eye." It also allows the probabilistic calculation of the rock composition at each depth when there are more mineralogical components than logs: that is, there is a shortage of equations. In addition to this, the beta model can be used to test the hypothesis that the relationship between any two components can be regarded as random, which should have applications in reservoir characterization. It is used extensively in Bayesian inference, since beta distributions provide a family of conjugate prior distributions for binomial and geometric distributions. For example, the beta distribution can be used in Bayesian analysis to describe initial knowledge concerning probability of success such as the probability that a space vehicle will successfully complete a specified mission. The beta distribution is a suitable model for the random behavior of percentages.>>
  2. I added corresponding references (quoted in the above paragraph) to the list of references.Dr. J. Rodal (talk) 19:56, 4 August 2012 (UTC)
Hi Dr. J. Rodal. The 2 comments are dated year=2007. And in 5 years, it became like this. Which I think is a much less cluttered introduction that survived 5 years of revisions. I would revert back to it. Your comments belongs to Section Application.

I also think that you have added far too many images to the articles, which do not add too much information but instead reduce the readability of the article. Their number should be definitely reduced. Also captions should be removed by the image itself and placed as text to improve the readability. One could generate an infinite number of plots for each formula, with all possible combinations of the parameters, but what's the purpose?--Mpaa (talk) 21:01, 4 August 2012 (UTC)

Distribution Function

I don't know the correct formula, but in the current formula, the summand does not depend on j. So, I assume it is wrong.

Beta distribution of the second kind

There are two forms for the Beta distribution. At present only the so-called 'Beta distribution of the first kind' is discussed. The Beta distribution of the second kind does not seem to be discussed in Wikipedia as I write. Rwb001 06:26, 30 September 2006 (UTC)


Why is there SO much blank, and therefore wasted, space on this page? —The preceding unsigned comment was added by Algebra man (talkcontribs) 20:09, 8 December 2006 (UTC).

The infobox on the right hand side can sometimes cause those problems. Widen your browser window, and the blank space should go away. Baccyak4H (Yak!) 20:11, 8 December 2006 (UTC)

median of beta distribution?

Is there a simple closed form for the median of the beta distribution?

(apart from quantile(half)?)

Paul A Bristow 13:33, 21 December 2006 (UTC) Paul A. Bristow

Only for very specialized cases, such as or equal to one, or equal to each other, or both integer and summing to less than 6. The median is a root of , with being the distribution function. The Abel–Ruffini theorem precludes algebraic solutions to general polynomials with degree greater than 4, and, when , the distribution function will be a polynomial with degree five or higher. Lovibond (talk) 02:13, 23 January 2012 (UTC)

$$$$$$$$$$$$$$$$$$$$$ Very good question byPaul A Bristow and an excellent answer by Lovibond (talk). I added a new section as follows [the exact solutions for α=3 and β=2 and vice-versa are lengthy expressions containing cubic and square roots, so I included it as follows to save space] (Dr. J. Rodal (talk) 19:12, 6 August 2012 (UTC)) :

(*NEW Section Begins


The median of the beta distribution is the unique real number x for which the regularized incomplete beta function . There is no general closed-form expression for the median of the beta distribution for arbitrary values of α and β. Closed-form expressions for particular values of the parameters α and β follow:

Median for Beta distribution for 0≤α≤5 and 0≤β≤5

A reasonable approximation, in the range α ≥ 1 and β ≥ 1, of the value of the median ν is given by the formula[6]

For α ≥ 1 and β ≥ 1, the relative error (the absolute error divided by the median) in this approximation is less than 4% and for both α ≥ 2 and β ≥ 2 it is less than 1%. The absolute error divided by the difference between the mean and the mode is similarly small:

Abs[(Median-Appr.)/Median] for Beta distribution for 1≤α≤5 and 1≤β≤5Abs[(Median-Appr.)/(Mean-Mode)] for Beta distribution for 1≤α≤5 and 1≤β≤5

NEW Section Ends*) $$$$$$$$$$$$$$$$$$$$$

Characteristic Function (Sum of independent betas)

I dont'know their distribution? What is F1 in the Char Function? —The preceding unsigned comment was added by (talk) 18:40, 12 February 2007 (UTC).

I agree with this question, what is in the characteristic function. I can find nothing on it at Mathworld, nor in Peebles, nor in Papoulis, nor in Zwillinger, and that's all my references. Anybody know?--Phays 22:14, 17 August 2007 (UTC)

Retraction, I found this on Wolfram as the Kummer confluent hypergeometric function of the first kind. Not too helpful, but if this is silent for awhile or if someone wants, I'll add a note and link to Confluent_hypergeometric_functions--Phays 22:25, 17 August 2007 (UTC)

Anybody still interested in the characteristic function of the beta distribution, the confluent hypergeometric function of the first kind , may also find section 9.2 of Gradshteyn,I. S. , and I. M. Ryzhik, Table of Integrals, Series, and Products, year=2000, Academic Press; 6th edition, isbn=78-0122947575, of benefit . Presently, this classic reference is not cited among the references in Wikipedia's article on Confluent hypergeometric functions — Preceding unsigned comment added by Dr. J. Rodal (talkcontribs) 17:23, 8 August 2012 (UTC)

I added a new section on the characteristic function of the beta distribution: the confluent hypergeometric function (of the first kind) where I point out that in the symmetric case α = β it simplifies to a Bessel function using Kummer's second transformation as follows:

I also included accompanying plots, the real part (Re) of the characteristic function of the beta distribution displayed for symmetric (α = β) and skewed (αβ) cases.Dr. J. Rodal (talk) 11:39, 16 August 2012 (UTC)

Generating beta-distributed random variates

Is anyone able to add a section how you would draw random samples from the beta distribution? Is there a direct method like a transform from uniform variates, or do you have to use rejection sampling? 08:50, 22 January 2007 (UTC) 18:45, 12 February 2007 (UTC)CRIstinaGH

For deviates from a Beta(a,b) random variable, where a and b strictly positive. Sample x1 from Gamma(a) and x2 from Gamma(b), the deviate is then x1/(x1+x2). You can find this in "Numerical Analysis for Statisticians" by Kenneth Lange 1999, chapter 20. The method I described is for a general Dirichlet (the N dimensional extension of the Beta) however Lange also gives a short description of rejection methods appropriate for Beta distributions with parameters > 1. 02:18, 13 March 2007 (UTC)

For another solution that involves rejection sampling, see Cheng "Generating Beta Variates with Nonintegral Shape Parameters", Communications of the ACM 1978. This is the algorithm cited by the R implementation of draws from a standard beta distribution. (talk) 01:38, 9 September 2010 (UTC)

This article gives the formula for a random number generator that produces results that fit within the beta distribution. I was wondering if that could be included somewhere in the article, as well. (talk) 05:08, 21 May 2008 (UTC)

Note that we need to assume independence of X and Y in order to conclude that X/(X+Y) generates a beta distribution when X and Y each follow a gamma distribution with the same scale parameter. For example X/(X+X) = 1/2 with probability one, which is not a proper Beta distribution. — Preceding unsigned comment added by Tthrall (talkcontribs) 17:33, 24 November 2011 (UTC)

The name of a special case?

For the beta distribution:

has the form:

Has this form also a special name? —Preceding unsigned comment added by (talk) 09:13, 3 October 2008 (UTC)

Standard power function —Preceding unsigned comment added by Mochan Shrestha (talkcontribs) 15:59, 9 March 2010 (UTC)
Almost, but not quite. For me a standard power function distribution is when , so with pdf of the form . That is similar but reversed. --Rumping (talk) 15:39, 12 March 2010 (UTC)

Good point to bring up, and good answers as well. I added the β=1, α>1 case as the power function distribution among the "Shapes" cases. I also added the reverse case β>1, α=1 as the "reverse (mirror-image) power function distribution," This power function distribution is one of the few cases in which there is a closed-form solution for the Median,and therefore I also added it in that section.Dr. J. Rodal (talk) 21:33, 8 August 2012 (UTC)

related distributions?

Is it worth mentioning that for large values of α + β the beta distribution converges towards a normal distribution?

in terms of a weighted-coin toss: if heads are worth 1 and tails are worth 0:

Dividing the sample-variance by the number of samples gives the expected variance of our sample's mean. —Preceding unsigned comment added by Sukisuki (talkcontribs) 22:41, 10 April 2010 (UTC)

Range of Beta function

The interpretation of the domain of the beta function is defined in the first paragraph, but the interpretation of the range is not clearly defined. If this is a probability distribution, why can the range values go over 1.0? (I understand that the cumulative distribution reaches a range of 1.0 at a domain value of 1.0.) Please can some clarification be given? —Preceding unsigned comment added by (talk) 20:21, 24 October 2010 (UTC)

It is not quite clear what you are saying or asking. The following points may help (or not) --Rumping (talk) 11:26, 25 October 2010 (UTC)
  • There is a distinction between the Beta function and Beta distributions.
  • For a probability distribution defined on (part of) the real line, the cumulative distribution function increases from 0 to 1. This is true of Beta distributions.
  • There is no requirement for probability densities of continuous random variables to be always below 1. Most Beta distributions have some densities greater than 1.
  • Typical (two-parameter) Beta distributions have support on the interval [0,1] and so a range of 1. This can easily be generalised to four-parameter Beta distributions to give different support and ranges (see Beta_distribution#Four_parameters).


The present version has a supposed formula for the "kurtosis excess" that starts off with the expession for the 4th central moment only, so that part is wrong. But is the rest of the expression correct for either the excess kurtosis or the 4th moment? It doesn't coincide with any formula I can conveniently find. .... and it would be good provide explicit citations for this stuff. Melcombe (talk) 13:03, 25 October 2010 (UTC)

Addressing the helpful points raised by User Melcombe:

  1. I have provided explicit citations for the Kurtosis of the Beta Distribution;
  2. I have provided some interesting applications of the Kurtosis of the Beta Distribution,
  3. I have provided charts showing the behavior of the excess Kurtosis as a function of α and β, and also, (in the new section that I added titled "Alternative parametrizations - Two parameters - Mean and Variance") as a function of the mean and the variance
  4. Both formulas for the excess kurtosis are correct: and

The formula is the formula found in most books, and it can also be obtained, for example, from mathematical software like Mathematica. However, the formula : is also correct (one can verify this by expansion, or by using Mathematica for example) and it has the advantages to be more compact and to have the numerator in terms of the difference and the sum of α and β, the term in the numerator proportional to (α - β) vanishes for the (important) symmetric case α=β.

Dr. J. Rodal (talk) 16:54, 4 August 2012 (UTC)

Difference between two beta-distributed RVs

In contingency table analysis, and I'm sure in many other situations, the need to test the difference between two beta-distributed random variables arises. The posterior distribution of the difference is the convolution of the two original beta distributions. However I have seen unsubstantiated claims in a couple of places that the convolution of two beta distributions has no analytical form. Has it been proved that an analytical form doesn't exist, or is there just no known analytical form? I'm assuming you can still estimate the mean of the posterior as the difference of the means of the two beta distributions, and the variance of the posterior as the sum of the variances (if independent), or the sum of the variance plus twice the covariance (if the covariance is non-zero)? Is there a closed form for covariance between two beta distributions? I *think* that the 1st order Taylor series approximation of the covariance is given by as shown in Eq (4) of "An algorithm for generating positively correlated Beta-distributed random variables with known marginal distributions and a specified correlation", but I'm not sure if it's OK to drop the and variables.

It would be great if there could be some treatment of these topics in the main article. LukeH (talk) 01:34, 27 October 2010 (UTC)

There is a paper on this very topic by Pham-Gia and Turkkan (1993), "Bayesian analysis of the difference of two proportions", Commun. Statist.-Theory Meth, 22(6), 1755-1771. There is an expression in the paper for the posterior. IapetusWave (talk) 18:08, 29 January 2012 (UTC)

Alternative Parameterization in terms of "sample size" (in Kruschke's book)

A useful parameterization of the Beta distribution is in terms of its mean and sample size. This is useful for Bayesian estimation... for example, one would typically place a uniform(0, 1) prior over the mean of a Beta distribution, and a vague prior over the sample size. This is much easier than specifying an equivalent prior over alpha and beta. I'm not up on wiki math code, so perhaps someone could add this information. The two parameter sets are related via: alpha = (mean) x (sample size), beta = (1 - mean) x (sample size). The text "Doing Bayesian Data Analysis" by Kruschke provides a reference for this (p. 83), I am sure there are others. (talk) 17:54, 16 February 2011 (UTC)

Shouldn't the alpha and beta be 1 more than those you wrote? E.g. alpha = (mean) x (sample size) + 1; beta = (1 - mean) x (sample size) + 1 (talk) 03:30, 2 April 2012 (UTC)
(Dr. Rodal responds:) I looked into this new book by Kruschke to trace the origin of this nomenclature (sample size). I find the wording in page 81-83 of Kruschke's book to be confusing, as he precedes the use of his nomenclature by motivating the shape parameters α ("a" in Kruschke's book) and β ("b" in Kruschke's book) as "heads" and "tails" in a total number of flips. So far so good, as "heads" and "tails" corresponds to a Bernoulli distribution Beta(0,0). But then Kruschke writes "that's tantamount to having previously observed one head and one tail, which corresponds to a=1 and b=1... the uniform distribution." That's why user quoted this sample size as being equivalent to the use of a prior uniform distribution. It is not. It is instead due to the Haldane prior Beta(0,0) (which indeed corresponds to a Bernoulli distribution). See the section titled "6.3 Bayesian inference" for further details. I re-wrote the section on Alternative Parametrization to correct it as follows:
Denoting by αPosterior and βPosterior the shape parameters of the posterior beta distribution resulting from applying Bayes theorem to a binomial likelihood function and a prior probability, the interpretation of the addition of both shape parameter to be sample size = ν = αPosterior + βPosterior is only correct for the Haldane prior probability Beta(0,0). Specifically, for the Bayes (uniform) prior Beta(1,1) the correct interpretation would be sample size= αPosterior + βPosterior - 2, or ν=(sample size)+2. Of course, for sample size much larger than 2, the difference between these two priors becomes negligible. (See section titled "Bayesian inference" for further details.) In the rest of this article ν = α + β will be referred to as "sample size", but one should remember that it is, strictly speaking, the "sample size" only when using a Haldane Beta(0,0) prior.
Bottom line: the equations in the beta distribution article and in Kruschke's book in terms of this parametrization were correct (with ν=α+β as a definition, just meaning the summation of the shape parameters in the beta distribution) but the interpretation of ν=α+β as corresponding to a "sample size" originating from a binomial distribution is ONLY correct for the Haldane prior Beta(0,0).
Thanks to user for pointing out this problem. I find Wikipedia's "Talk Page" a really great feature because it enables exchanges such as this.Dr. J. Rodal (talk) 00:32, 25 September 2012 (UTC)

Intuitive interpretation?

To grasp the meaning of the beta distribution, an intuitive interpretation and/or an illustratory example of application from everyday life would be very helpful for the layperson. --Typofier 08:23, 22 June 2011 (UTC) — Preceding unsigned comment added by Typofier (talkcontribs)

The applications section lists a couple good ones, although framed in mathematical language. The Bayesian interpretation can be described in concrete terms as follows. Suppose you have an Internet ad you place on your webpages and keep track of click-throughs. Each occurrence can be thought of as a Bernoulli trial, and a click-through can be considered a "success". After gathering some amount of observations/data, e.g. 1000 trials and 50 click throughs, you'd like to know what the probability p of a click-through is. This probability p is itself distributed probabilistically as a beta distribution, if you assume that your prior knowledge was zero (mathematically, the prior distribution of p is assumed to be uniform on (0,1) ). --C S (talk) 18:53, 4 July 2012 (UTC)

Multivariate beta distribution and Dirichlet

The article currently defines the Dirichet distribution as "the multivariate generalization of the beta distribution." Certainly, there are other multivariate distributions, such as those for random matrices. Refer, for example, to C. G. Khatri, "On the mutual independence of certain statistics." Annals of Mathematical Statistics, 30 : 4 : 1258--1262 (1959). Accordingly, I have changed "the" to "a" in the quoted portion of the article. Lovibond (talk) 03:54, 22 September 2011 (UTC)

Should the Mean and the Variance based on Ln[X] (rather than X) be included in the statistical property box?

Currently the Mean and the Variance based on Ln[X] are in the property box for the Beta Distribution:

I presume that these have been added because the Ln[X] transformation extends the distribution from a bounded [0,1] domain for X to a semi-infinite Ln[X] domain (for X approaching zero) ?.

QUESTION: Should we have these Ln[X] properties in the statistical property box rather than just in the text?

The same could be done for other statistical distributions on a bounded domain. However, I could not find such an addition in the property box for other distributions. I don't feel strongly about this one way or another, but it would be nice to arrive at some convention as to what properties should be included in the statistical property box for distributions in Wikipedia.

Also, the Ln[X] transformation naturally leads to a discussion of Johnson distributions, which I have not found presently in Wikipedia.


Dr. J. Rodal (talk) 14:53, 11 August 2012 (UTC)

It's quite useful for statisticians to have logarithmic expectations in some kind of look up table, as statistical procedures (eg EM Variational Bayes) sometimes involve taking expectations of log-likelihoods or log-pdfs. That said, as you say most of the standard +ve valued continuous distributions don't have logarithmic expectations listed. Perhaps they should be added? There maybe less of a case to include logarithmic variances: is there a case for a separate article for logarithmic-beta distribution? What is a Johnson distribution? Wjastle (talk) 19:54, 11 August 2012 (UTC)

Great response, Wjastle, you make very good points, and clearly stated. Johnson distributions are a family of distributions proposed by Johnson in 1949 (Biometrika Vol 36 p.148) that are based on the transformation of a standard normal variate. He proposed logarithmic and hyperbolic sine transformations with varying number of parameters. One of them, the Johnson SB transformation overlaps the beta distribution in the Pearson (skewness^2,kurtosis) plane, and actually covers a greater area than the beta distribution. Another, the SU distribution is able to cover regions of the (skewness^2,kurtosis) that are not covered by any of the Pearson distributions ( ). I just made a cursory search with Google (only looking at the first few entries) and I did not find any comprehensive article, but the following one has an amusing description:

Thanks Dr. J. Rodal (talk) 22:26, 11 August 2012 (UTC)

Kurtosis bounded by the square of the skewness

I added a new section titled "Kurtosis bounded by the square of the skewness," which contains the following equation:


The region occupied by the beta distribution is bounded by the following two lines in the (skewness2,kurtosis) plane, or the (skewness2,excess kurtosis) plane:


I am aware that the (otherwise excellent) reference "Gupta (Editor), Arjun K. (2004). Handbook of Beta Distribution and Its Applications. CRC Press. pp. 42. ISBN 978-0824753962." quoted in the rest of this Wikipedia article instead has this equation on page 42 (in section VII of the chapter "Mathematical properties of the Beta distribution" by Gupta and Nadarajah):

Gupta quotes Karian (1996) as the source of this equation. The lower bound for this equation is correct: it is the "impossible region" previously found by K. Pearson. However, the upper bound equation in Gupta:

is incorrect. The correct equation is:

as correctly found by K.Pearson practically a century ago (one can also verify this by numerical examples, as I included in the Wikipedia article).Dr. J. Rodal (talk) 22:20, 12 August 2012 (UTC)

Mean absolute deviation around the mean

I added a new section titled "Mean absolute deviation around the mean". A few notes:

  • At the time of this writing, Wikipedia "Mean deviation" redirects to the article Absolute deviation which very succinctly discusses only the discrete case for the mean absolute deviation, and does not contain formulas for the continuous case.
  • The term "absolute deviation" does not uniquely identify a measure of statistical dispersion, as there are several measures that can be used to measure absolute deviations, and there are several measures of central tendency that can be used as well. Thus, to uniquely identify the absolute deviation it is necessary to specify both the measure of deviation and the measure of central tendency. Unfortunately, the statistical literature has not yet adopted a standard notation, as both the mean absolute deviation around the mean and the median absolute deviation around the median have been denoted by their initials "MAD" in the literature, which may lead to confusion, since in general, they may have values considerably different from each other.
  • The classic reference by N.L.Johnson and S.Kotz, in their (1970) first edition of "Continuous Univariate Distributions Vol. 2" , Wiley, Chapter 21:Beta Distributions, page 40, Eq. (10) contains an error for the mean absolute deviation: the exponent in the denominator should be (p+q+1) instead of (p+q). That is why I quote Gupta instead of Johnson and Kotz for this formula.
  • Weisstein, Eric W. "Mean Deviation." From MathWorld--A Wolfram Web Resource., contains a number of expressions for the mean absolute deviations for a number of distributions. Unfortunately, as of this writing, the expression for the mean absolute deviation in Weisstein, Eric W. "Mean Deviation." From MathWorld--A Wolfram Web Resource (see is unnecessarily lengthy and complicated: it involves the (unnecessary) computation of three different expressions for the beta function (instead of just one). As of this writing, Weisstein, Eric W. "Mean Deviation" article does not quote either Johnson and Kotz or Gupta. It can be shown that Weisstein's expression (although unnecessarily complicated) is completely identical to Gupta's shorter expression and Johnson and Kotz's (correcting their above-mentioned errata). The expression by Weisstein was apparently obtained by using Mathematica (as he includes a Mathematica notebook with the derivation). As of this writing, Mathematica version 8 does not "know" the beta function identities necessary to transform Weisstein's expression into the simpler expression, when using Mathematica's "FullSimplify" standard function. However, it is easy to use Mathematica to derive the simpler expression (by including the relevant beta function identity).

Dr. J. Rodal (talk) 15:36, 23 August 2012 (UTC)

Fisher information matrix

I wrote a section on the Fisher information matrix for the beta distribution. I framed its derivation in terms of the log likelihood function (as done for example by E.T.Jaynes in "Probability theory, the logic of science", A.W.F. Edwards in "Likelihood", and several others) instead of the probability density function conditional on the value of a parameter as done in the Wikipedia article on Fisher Information, to emphasize its main role in parameter estimation. These two ways to frame it are equivalent and whether to chose one or another is a matter of preference and the context in which one is writing. For the four parameter case I quote the recent article by Aryal and Nadarajah because it may be more readily available to an Internet audience, since its ".pdf" is presently freely available through the Internet. This article by Aryal and Nadarajah contains an error for the Fisher information component: the component given by the log likelihood function's second order derivative with respect to the minimum "a" (or "c" in Aryal and Nadarajah's nomenclature). The expression in their article is incorrect in several ways, and it appears incorrectly twice (first as the second order partial derivative of the likelihood function and then again as the expected value for the Fisher information component). I have corrected their error and placed the correct equation in the Wikipedia article. One can verify this by obtaining the correct equation by differentiation of the log likelihood function and carrying out the Integration for the expected value, one can readily see that this Fisher information component is incorrect in Aryal and Nadarajah from symmetry arguments, since the Fisher information terms given by the second order differentiation with respect to the minimum "a" and the Fisher information component given by the second order differentiation with respect to the maximum "c" (or "d" in Aryal and Nadarajah's nomenclature) should be symmetric. It is curious that Aryal and Nadarajah do not use the trigamma function in their expression for the first three Fisher information matrix components, and instead express it in terms of the (more lengthy expressions for the) derivatives of the gamma function. Dr. J. Rodal (talk) 18:11, 13 September 2012 (UTC)

Dr. J. Rodall, In this section, is your expression for the log likelihood correct? You have: log likelihood(p|H) = H*log(p) - (1-H)*log(1-p) Except, shouldn't the likelihood be: log likelihood(p|H) = log[Pr(H|p)] = log[p^H * (1-p)^(1-H)] = log[p^H] + log[(1-p)^(1-H)] = H*log(p) + (1-H)*log(1-p) Sorry if I am missing something. Thanks for the help! — Preceding unsigned comment added by (talk) 22:57, 17 October 2012 (UTC)

Dear User:, thanks for catching this in the section "Jeffreys' prior probability ( Beta(1/2,1/2) for a Bernoulli or for a binomial distribution )", you are correct: the (-) should have been a (+) and it should have read log likelihood(p|H)= H*log(p) + (1-H)*log(1-p). I made this correction in the text as well.(Dr. J. Rodal (talk) 02:56, 19 October 2012 (UTC)

"It has also been" excessively used

"It has also been" is used 5 times in a row in the lead. Boooring.... :-) I guess the intro is too loong and descriptive and most of those uses (all of them?) could go into a (new and first) section of their own. - Nabla (talk) 14:02, 18 September 2012 (UTC)

Introduction "has also been" :-) modified to accommodate your request to eliminate repetition of "has also been"Dr. J. Rodal (talk) 14:12, 19 September 2012 (UTC)

Length of article

Discounting lists, this article is currently one of the longest on Wikipedia. The recent additions by User:Dr. J. Rodal have been fantastic but it is extremely disorienting to attempt to read through the article in its current state. I thought I would create a section to discuss ways in which it can be pruned. Some initial ideas:

  • The four-parameter form could be restricted to a separate article.
  • The parameter estimation section is in general so substantial that I would suggest only giving an overview within this article, with a section link to a separate article.
  • Bayesian estimation using the beta distribution is extremely large for a subsection, and due to its importance as an application, probably warrants being migrated to a separate article where it can be dealt with in a more structured way.

I don't have any experience with writing articles on probability distributions on Wikipedia (but plenty of experience of reading them) so I thought I'd see what others think. --Iae (talk) 12:31, 10 October 2012 (UTC)

Thank you for the feedback. My comments follow:
1) The length of the article on the beta distribution being <<twice the size of the (already substantial) Normal distribution article>> may be justified by the fact that the Beta distribution is a family of statistical distributions that contains a great number of distributions (for example the arcsine distribution, the uniform distribution, the Wigner semi-elliptical distribution, the parabolic distribution, the power distribution, etc). Also, in Pearson's space of squared-skewness vs. kurtosis, the normal distribution is just a POINT, while the Beta distribution occupies a large region of infinite extent. Hence there are many more points to be covered concerning the Beta distribution.
2) As part of the Wikipedia process, we would need to reach a consensus (including the feedback from Mathematics and Statistics editors) as to whether it is necessary to reduce the length of the article on the Beta distribution. My point of view is that Statistics readers come to Wikipedia for specific information, and the more information the better. I agree however, that the more organization, the better. I think that this is accomplished by sections and subsections in the Wikipedia articles.
3) It is not clear that migrating sections into separate articles, makes the reading and understanding better, on the contrary, it makes it more difficult to refer to previous equations and graphs, as one would have to jump from webpage to webpage to compare equations and graphs instead of simply scrolling up and down on the same page. What one may gain in reading speed (by shortening the article), one loses in integrity (in the ability to see the connections and coupling between different sections). For example, the most straightforward division would be to have one article on the two-parameter case and another one on the four-parameter case. However, this would affect the integrity of the discussion of the Fisher information matrix components and the Maximum likelihood estimation for example, as the four-parameter Fisher information case is shown in the article to be intimately linked to the two parameter case (for the αα and ββ components).
4) If there is a consensus on the the length of articles in Wikipedia and that this article on the beta distribution needs to be shortened, User:Iae's ideas are very reasonable, particularly the suggestion to divide the article into the two-parameter case and the four-parameter case (as this would cut by half the sections on parameter estimation, etc.). Dr. J. Rodal (talk) 14:34, 10 October 2012 (UTC)
The Normal distribution article was probably a bad example: I didn't mean to put it forward as a like-for-like comparison, but more as an instance of a potentially longer article that has been condensed using sub-article links, although it doesn't do this as well as I thought.
I hadn't realised the two parametrisations of the distribution were linked so closely but, on the contrary, perhaps this is as much an argument for a split as it is against a split. Because discussion of the four-parameter version is spread sporadically throughout the body of the article (and only really gets introduced towards the end) it's difficult to get a feel for the overall nature of this distribution, whereas, if it were confined to its own article, its relationship to the 2-parameter form could be more fully expounded and emphasised, and specific information relating to it would be more easily navigable.
In any case, as Jason Quinn says, there's no need to rush any decision (particularly as you still appear to be in the process of adding content), but it's worth bearing in mind and getting feedback / ideas in the meantime. --Iae (talk) 11:45, 11 October 2012 (UTC)
Dr. J. Rodal (talk) 14:39, 11 October 2012 (UTC) response: Iae makes very coherent and reasonable arguments. I would also add that other benefits of a new webpage titled "Beta distribution with four parameters" would be that it would be able to have its own "statistical box" at the upper right hand corner (although the length of some of the expressions -for example the Fisher information components- may preclude their inclusion) and we could make a better connection with Pearson's type I distribution (as it can be shown that the four parameter case is really Pearson's Type I, as Karl Pearson himself showed in some of his papers) QUESTIONS: If we were to start new Wikipedia webpage(s) associated with the Beta distribution,
1) Where should we start the process (as initially it would be a webpage "under construction" -as it will take some time to build the new page to the point where it is satisfactory- in a new webpage for example titled "Beta distribution with four parameters" or in the sandbox of the person attempting to start the new webpage)?
2) Should we rename the existing webpage to "Beta distribution with two parameters" (to differentiate it from the new webpge "Beta distribution with four parameters"? If so, how does one edit the name of an existing webpage?
3) Do we need to have this banner "This article may be too long to read and navigate comfortably. Please consider splitting content into sub-articles and/or condensing it. (October 2012)" at the top of the present article? - I don't understand the purpose of these banners :-), and I wonder whether other new Wikipedia users may also feel the same, as the banner with the broom somehow looks like a warning to readers that there is something wrong with the article (particularly the "condensing" part of it may be taken as an invitation to users to come and delete parts of the article). Thanks Dr. J. Rodal (talk) 14:39, 11 October 2012 (UTC)
1) I think it'd be best to go straight ahead with the article. There is already a lot of content / references that can be ported over so I think it'd be fine. I'm sure there's some sort of 'Under construction' banner to stop people closing it until it's in a satisfactory initial state.
Do Johnson and Kotz refer to the four parameter version by any more descriptive name? For instance see "A generalisation of the beta distribution with applications" by McDonald and Xu ( for whom a 4-parameter Beta is called a generalised Beta of either the first or second kind and for which we have an (orphaned and largely copied verbatim from the paper) article at Generalized Beta distribution. But how to unite all the different versions of the Beta in a clean way can be discussed in a new section.
2) "Beta distribution" without qualification is nearly always in reference to the two-parameter version so I don't think this article needs renaming (perhaps just clarification in the lede that it is the "standard beta of the first kind" or however else it is referred to).
3) The banner is essentially just to let people know there is a discussion about the article length here, and to chip in with suggestions. It is a shame the banner is so large and ugly, but I only really used it because as far as I was aware it's 'standard procedure'. --Iae (talk) 15:10, 11 October 2012 (UTC)
1) I do not quite agree with the statement "Beta distribution without qualification is nearly always in reference to the two-parameter version" as the answer to the question "Do Johnson and Kotz refer to the four parameter version by any more descriptive name? " is that this most important reference (Johnson and Kotz) refers to the four-parameter Beta distribution as THE Beta distribution! (see page 37 of Johnson and Kotz's first edition). The chapter on the Beta distribution starts with the four-parameter case as the definition of the Beta distribution and they deal with the two-parameter case as a special case.
2) Also the McDonald/Xu's paper, reveals the perils of referring to these distributions as simply "Beta". It is curious that their paper does not refer to the early (different) generalizations by Johnson (Johnson, N.L. (1949). "Systems of frequency curves generated by methods of translation". Biometrika 36: 149–176) on Johnson distributions (which in my opinion is deserving of its own Wikipedia webpage, yet there doesn't seem to be one at the moment) and McDonald and Xu's distribution is a different generalization (departing from Pearson's original parametrizations). In any case, I do not like undescriptive names (like Generalized Beta distribution, as there are many generalizations). Beta distribution with four parameters is more descriptive, and it is used by a number of references, going back all the way to the beginning of the 20th century.
3) There will be issues of disambiguation with a page simply titled Beta distribution. Perhaps some other experienced user can help establish whether and how it is possible to modify the existing title page from Beta distribution to Beta distribution with two parameters.
1 & 3) I've sadly not had the pleasure of reading (or more accurately, affording ;) the Johnson and Kotz book. I still think despite their influence most readers referring to Wikipedia will expect a search for "Beta distribution" to refer to the simple two parameter form. My approach would be to add a link at the top of this article to a disambiguation page Beta distribution (disambiguation) which in turn links to anything that might conceivably be called a "Beta distribution". May I suggest creating a new talk section for this matter? It may attract feedback more qualified than my own, buried inside this discussion on article length :)
4) You're right, my mistake. I'll remove the banner, seeing as discussion is ongoing here. It would be interesting to get feedback from other editors however. --Iae (talk) 11:41, 12 October 2012 (UTC)
Thank you for your further comments and further suggestions. They make a lot of sense and have been very constructive. In summary, we are in agreement regarding decreasing the length of this article and that the first step should be to split the 4 parameter case into a separate article. At the moment I am leaning towards moving the 2-parameter Beta distribution case to a new page titled "Beta distribution with two parameters" and the 4 parameter case to a new page titled "Beta distribution with four parameters" and changing the Beta distribution page to a page with a brief summary about beta distributions that will link to a number of Beta distributions (including the 2 and 4 parameter cases, beta prime distribution, the MacDonald generalization, the Pearson distribution, the -yet to be constructed- Johnson distributions, etc. I also look forward to further comments. Dr. J. Rodal (talk) 14:27, 12 October 2012 (UTC)

I tend to be suspicious of complaints of this kind ("This article is too (1) technical; (2) long; (3) unimportant; (4) whatever."). I can begin to why someone might think this one is too long. Before opining further, I'll have to look at it further. Michael Hardy (talk) 17:45, 10 October 2012 (UTC)

Comment The relevant guideline is Wikipedia:Article size. I agree that this article is getting too long because the loadtime is noticeably slow and my browser experiences lag when editing it. I think that Iae's ideas are good ones. There is no reason for haste in such matters so splitting the article does not need to be right away. Jason Quinn (talk) 18:23, 10 October 2012 (UTC)


revision 527168631 by placed the Mean and Sample Size Parametrization in the introductory paragraph. This can be justified by the fact that this parametrization is discussed throughout the text in different sections. However, other more (historically) important parametrizations (for example the four parameter parametrization) are also discussed throughout the text. Is there a reason why (only) the Mean and Sample Size should be discussed in the introductory paragraph ? Should all parametrizations be discussed in the introductory paragraph? Isn't it adequate to discuss the parametrizations under their own section? Alternatively, should the parametrization section be moved near the top of the article? Dr. J. Rodal (talk) 17:56, 9 December 2012 (UTC)

Beta regression

Ferrari and Cribari-Neto (2004) propose a Beta regression algorithm. This approach is not very popular (doesn't have a wikipedia page) but I think it's worth mentioning it here. — Preceding unsigned comment added by (talk) 08:29, 19 February 2013 (UTC)

For anybody interested in "Beta Regression", see: Ferrari and Cribari-Neto and Cribari-Neto and Zeileis .Dr. J. Rodal (talk) 14:18, 19 February 2013 (UTC)

What is the Beta Distribution?

Unless you already have enough grounding in mathematics to already know what the Beta Distribution is this article never actually explains it. The first sentence (which is supposed to be accessible to lay people) is just a jargon filled way of saying "its an equation with some variables in it". — Preceding unsigned comment added by (talk) 21:02, 20 February 2013 (UTC)

Length and intelligibility of the article

From one point of view, the article is great as a quite detailed description of the beta distribution and its properties. However, considering wikipedia an encyclopedia, I have serious doubts about the article's usefulness. If I need to find something quite general, I skim through encyclopedia (e.g. wikipedia). For details, I never do this, because it's not the purpose of any encyclopedia to be exhaustive in any entry. I believe that focused books/papers are appropriate for that. In this respect, the beta distribution article on wikipedia fails to deliver the "encyclopedic message". I honestly respect the great effort of the author and I'm not sure about any wiki-consistent modification. Maybe to move the detailed content to another article like "Beta distribution (details)"? — Preceding unsigned comment added by (talk) 09:23, 7 September 2013 (UTC)

This is a philosophical point. Should an encyclopedia be encyclopedic - be far reaching and encompasing as much as possible of a topic - or merely a simple reference work outlining the main points of a topic. My personal bias is towards the former. WP allows for the aggregation of information in a way that is impossible in encyclopedias that have been created before. Another question that might better adress what I think the P was suggesting here would be: could this article be divided into multiple sub articles? Again I am not sure. I actually like long articles as I may find material in them that I would not have thought connected to the original matter. But I do know others have different opinions on these matters so additional opinions would be useful here. DrMicro (talk) 12:21, 17 September 2013 (UTC)

Sensitivity of posterior different priors

This section is unnecessarily wordy, and ends up repeating a lot of the same information. While I appreciate that it's worth pointing out the difference between the Jeffreys, Bayes and Haldane priors is it really necessary to specify their posterior distributions individually? And after that there is a huge block of text which deals with the s=0 and n=0 cases in such a way that there is lots of text that says nothing. Most of this could go. (talk) 20:04, 24 October 2013 (UTC)

  1. Shannon, Claude E., "A Mathematical Theory of Communication," Bell System Technical Journal, 27 (4):623–656,1948.PDF
  2. {{#invoke:Citation/CS1|citation |CitationClass=journal }}
  3. {{#invoke:Citation/CS1|citation |CitationClass=journal }}
  4. {{#invoke:Citation/CS1|citation |CitationClass=journal }}
  5. {{#invoke:Citation/CS1|citation |CitationClass=journal }}
  6. Cite error: Invalid <ref> tag; no text was provided for refs named Kerman2011