# Talk:Likelihood principle

Template:WikiProject Statistics

## Untitled

The likelihood link ( http://www.cimat.mx/reportes/enlinea/D-99-10.html ) given at the end is 404. -- 20050120 03:15.

Hello. Recently the following statement was added -- "By contrast, a likelihood-ratio test is based on the principle." This is not clear to me -- while forming a likelihood ratio is entirely consistent with the likelihood principle, appealing to the usual logic of null-hypothesis tests (rejecting the null hypothesis if the LR is too small) is not. A LR test appears to be similar to other null-hypothesis tests in that events that didn't happen have an effect on the inference, thus it appears to be inconsistent with the likelihood principle. -- I'm inclined to remove this new assertion; perhaps someone would like to argue in its favor? Happy editing, Wile E. Heresiarch 15:05, 18 Jun 2004 (UTC)

- Being a Bayesian at heart, I don't personally like standard likelihood-ratio tests any more than I like maximum likelihood as a method, but the argument might go like this: the evidence may point more to the null hypothesis or more to the alternative hypothesis. The degree to which the evidence points at one hypothesis rather than another is (thanks to the likelihood principle) expressed in the likelihood ratio. Therefore it makes sense to accept the null hypothesis if the likelihood ratio is "high enough" and to reject it if not. The value of "high enough" is a matter of choice; one approach might be to use 1 as the critical value, but for a Bayesian looking at point hypotheses the figure would best be a combination of the (inverse) ratio of the priors and the relative costs of Type I errors and Type II errors. --Henrygb 22:15, 23 Jun 2004 (UTC)

- Henry, I went ahead and removed "By contrast, a likelihood-ratio test can be based on the principle." from the article. A standard LR test, as described for example in the likelihood-ratio test article, does involve unrealized events and so it is not consistent with the likelihood principle. It might be possible to construct an unconventional LR test as described above, but that's not what is generally understood by the term, so I think it's beside the point. Regards & happy editing, Wile E. Heresiarch 14:20, 4 Aug 2004 (UTC)

- I think you need to consider what you are saying:
*"[the likelihood ratio] is the degree to which the observation x supports parameter value or hypothesis a against b. If this ratio is 1, the evidence is indifferent, and if greater or less than 1, the evidence supports a against b or vice versa."*But you think that this does not provide any justification for a likelihood ratio test which in effect says:*"If the likelihood ratio is less than some value κ, then we can decide to prefer b to a."*I find that very odd; I suspect that in fact you object to how frequentists calculate*κ*but that is not the point about likelihood ratio tests in general. --Henrygb 17:33, 13 Aug 2004 (UTC)

- I think you need to consider what you are saying:

- I'm willing to consider a compromise of the form "The conventional likelihood-ratio test is not consistent with the likelihood principle, although there is an unconventional LR test which is". That would make it necessary to explain just what an unconventional LR test is, which might be worthwhile. Comments? Wile E. Heresiarch 02:17, 15 Aug 2004 (UTC)

- I agree. We used an unconventional LR test in footnote 22 on page 79 of this paper (2004). We had to, because the comparison was between non-nested models (of the same data, of course). Our reference to Edwards should be to page 76 rather than to page 31. Arie ten Cate 13:40, 3 August 2005 (UTC)

I've largely rewritten the article. It still needs work, in particular, it needs some non-Bayesian argument under "Arguments in favor of the likelihood principle". I've tried to clarify the article by separating the general principle from particular applications. It could also use some links to topics of inference in general, maybe Hume, Popper, epistemology etc if we want to get wonky about it. Wile E. Heresiarch 17:51, 2 Jan 2004 (UTC)

I've added the voltmeter story as an argument in favor of the likelihood principle (and forgot to note this in the Edit summary). The story is taken from the 1976 reprint of the first edition of *Likelihood*. I trust it is in the second edition as well.

Although I find this argument very convincing, I am not sure if I would apply the likelihood principle in clinical trials. Maybe the whole point here is a subtle difference between two types of inferential statistics: "What can we learn from this particular experiment?" (in ordinary scientific research, where the likelihood principle should be applied) and "What would happen if we did it again?" (in commercial product research).

I tried to translate the voltmeter story to the story of Adam, Bill, and Charlotte, by assumming that Adam could have chosen the other stopping rule, with some probability. But then it loses its attractive simplicity, and I decided not to present this.

- Arie ten Cate 18:23, 7 August 2005 (UTC)

The remainder of the talk page here could probably be archived under a suitable title.

In my opinion, it is not true that, if the designs produce proportional likelihood functions, one should make an identical inference about a parameter from the data irrespective of the design which generated the data (likelihood principle: LP).

The situation is usually illustrated by means of the following well-known example. Consider a sequence of independent Bernoulli trials in which there is a constant probability of success p for each trial. The observation of x successes on n trials could arise in two ways: either by taking n trials yielding x successes, or by sampling until x successes occur, which happens to require n trials. According to the LP, the distinction is irrelevant. In fact, the likelihood is proportional to the same expression in each case, and the inferences about p would be the same.

Nevertheless, this point is questionable.

In particular, following the logical approach, the probability of hypothesis h is conditional upon or relative to given evidence (cf. Carnap, 1962, p.31). Quoting Carnap's own words, "the omission of any reference to evidence is often harmless". That means that probability is conditional to that which is known. Now, apart from other information, the design d is actually known. Therefore, evidence (e) comprises not only that which is known to the statistician prior the survey is performed (e*), but also the piece of information about d. Let suppose now that i (that stands for information) is our experimental observation and h one of the competing hypotheses, we could use the premise above to correctly formulate the probability of i as follows:

(1) p(i|h, e*, d)

Notice that this probability is not defined without a reference to d. Thus, the probability of x successes on n Bernoulli trials is different whether n or x is fixed before the experiment is performed. Namely, the design always enters into the inference because of its occurrence in the probability of i.

- So far so good. Note that p(i|h, e*, d) immediately simplifies to p(i|h, e*). Why? Because asserting that p(i|h, e*, d) != p(i|h, e*) is equivalent to asserting that p(d|i, h, e*) != p(d|h, e*) -- that is, knowing the experimental outcome must tell you something about the design. That's not so: I tell you that I tossed a coin 10 times and got 7 heads. Do you have any reason to believe one way or the other that I resolved to toss 10 times exactly, or to toss until getting 7 heads? No, you don't. Therefore p(i|h, e*, d) = p(i|h, e*), and the computation of p(h|i, e*) goes through as usual. Wile E. Heresiarch 17:51, 2 Jan 2004 (UTC)

The simplified manner with which Bayes formula has been and still is presented in Statistics (i.e. without specifying the evidence e) caused rather serious interpretation errors. As a matter a fact, the correct expression of Bayes' formula is of the form:

(2) p(h|i, e*, d) proportional to p(h| e*, d) p(i|h, e*, d)

in which it is apparent that the prior depends on d. Namely, in general, the prior is influenced by the knowledge available on design.

Consequently, contrary to a widely held opinion, the likelihood principle is not a direct consequence of Bayes theorem. In particular, the piece of information about the design is one part of the evidence, and, therefore, it is relevant for the prior.

REFERENCES:

CARNAP R. (1962). Logical Foundations of Probability. The University of Chicago Press.

DE CRISTOFARO R. (1992). The Inductive Reasoning in Statistical Inference, Communications in Statistics, Theory and Methods, v. 31, issue 7, pp. 1079-1089.

- I agree with the point that likelihood principle is not a direct consequence of Bayes' theorem. It is, however, a consequence of the sufficiency and conditionality principles, as proved by Birnbaum. I have added a paragraph pointing out this fact and briefly describing these two principles. Bill Jefferys 22:49, 15 October 2005 (UTC)

- However, I doubt your second point, that knowledge of the design is relevant for the prior. For example, in the coin-tossing example I fail to see how my prior on whether the coin is fair or not would depend on the design. That is, ISTM that in this case and the design is irrelevant. I can certainly imagine that one would design an experiment such that it is likely to tell us interesting things about h, and we would certainly use our prior information about h in deciding on the design, but in my view it's obvious that even in this case the prior on h comes first, and the design comes afterwards, and in no way is the prior on h affected by the design that we choose.

- Am I missing something here? Bill Jefferys 23:10, 15 October 2005 (UTC)

Dear Bill Jefferys, I saw in Wikipedia your comment to my remarks about likelihood principle. I suggest you to read my paper: On the Foundations of Likelihood Principle in Journal of Statistical Planning and Inference (2004) 126 (2), 401-411, and my communication to the Symposium held in June at Virginia Tech (Foundations of the 'Objective Bayesian Inference'): http://www.error06.econ.vt.edu/ (PDF of Paper). About your note, I would like to observe that The discussion about the LP is related to the example of an experiment that is designed to elicit a coin’s physical probabilities of landing heads and tails. And the conclusion is as follows: it does not matter whether the experimenter intended to stop after n tosses of the coin or after r heads appeared in a sample; the inference about ф [the probability of landing heads] is exactly the same in both cases. This conclusion is based on the assumption that the prior for ф is the same in both cases being investigated. In reality, inverse sampling (with r fixed in advance) always stops when we observe the last head. On the contrary, the last observation of direct sampling (with n fixed in advance) may be head or not. Namely, inverse sampling favours the chance of landing heads in the set of tosses. This circumstance brings into discussion the assumption of the same prior in the two different kinds of experiment. I think that this is a concrete example where p(h|e*, d) is different from p(h|e*). Of course, the likelihood principle (LP) holds that the inference should be the same in case the results of the experiments are the same. However, this thesis too appears equally dubious: the fact of considering only a particular set of results does not change the different nature of the experiments and their influence on the prior probabilities. As a futher example, the assignment of an equal probability to each face of a dice is based on the assumption that the casting of the dice is fair. In the same way, apart from other information, in order to assign the same probability to every admissible hypothesis, the design should be ‘fair’ or ‘impartial’, in the sense of ensuring the same support to all hypotheses. On the other hand, a general principle that concerns any inquiry is as follows: We can assign a uniform distribution over a partition of hypotheses where there is no reason to believe one more likely to be true than any other, in the sense of both irrelevance of prior information and impartiality of the method of inquiry. I do not see why the design (or the method of inquiry) should not be relevant in Statistics. In reality, it is relevant in all fields of the research work. Best regards, Rodolfo de Cristofaro See my home page in www.ds.unifi.it

- I fail to see why the fact that inverse binomial sampling always ends in a "heads" whereas binomial sampling may end in "tails" favours the chance of landing heads in the set of tosses under inverse sampling. This seems to me to be a mere unsupported assertion. As it seems to violate Birnbaum's proof of the LP, it would also appear to be false.

- To convince me otherwise, you would have to provide an effective method of calculating the prior under each design (inverse, direct) and a demonstration that this is the right thing to do.

- No one is saying that the design of an experiment is not relevant in statistics. Obvously it
*is*relevant in many situations, e.g., censored or truncated data. But it doesn't seem to be relevant in this case (i.e., inverse versus direct binomial sampling), and you'll need more than you've written here to convince me otherwise.

- In the meantime, thank you for sending me a copy of your paper, which I will read with interest. Bill Jefferys 16:41, 18 September 2006 (UTC)

MY ANSWER

- The proof of Birnbaum is not right because, if the LP is false, also the property of sufficiency no longer has exactly the same meaning.

For instance, the knowledge of the number of successes $x$ in a Bernoulli process is not `sufficient' for inference purpose. In fact, information about the used design is ancillary to $x$ (in the same way of $n$), and it cannot be ignored (not even under the same likelihood).

- The method of calculating the prior is as follows:

the prior for a parameter may be assumed proportional to the corresponding maximum value of likelihood for all possible likelihood functions obtainable from the projected design. This method is consistent with the uniform prior and relative transformations.

- Regarding your remarks, I would like noticing that p(h|e*) is unknown if d is unspecified. On the contrary, p(h|e*, d) is well-determined. This is the diference between them.

Rodolfo de Cristofaro decrist@ds.unifi.it

- Bald claims that contradict everything I know about inference and about the LP. You have not convinced me. Sorry.

- As far as I am concerned, this discussion is at an end. Bill Jefferys 12:16, 20 September 2006 (UTC)

It is a pity you are closed to newness. I can only invite you to read my papers. {Rodolfo de Cristofaro}} 25 September 2006

- No one is closed to newness. I am certainly not. But mere newness is not appropriate for an encyclopedia. It must be backed up by a substantial body of published research. Please read the WikiPedia policy on no original research. The policy is that WikiPedia is to present the scholarly consensus, plus any
*significant*opinions that vary from it, as presented (in cases like this) in the scholarly literature. BUT, the opinions of a single person, even if published in the scholarly literature, do not necessarily represent an opinion that should be included in this encyclopedia.

- As I have said, I will read the paper you sent me, when I have time. But what you are flogging here contradicts everything I know about the LP, published by numerous scholars of excellent reputation, many of whom I know personally. Your singular assertions are insufficient to overcome my objections. Perhaps I will change my mind when I read your paper; but until then, you will have to post your objections without further comment from me. Bill Jefferys 01:07, 26 September 2006 (UTC)

## Accept a hypothesis

Hello everyone. Well, two people have reverted my edits about accepting an hypothesis. The LP implies one can accept an hypothesis on the grounds that it cannot readily be improved. Edwards uses the two units of likelihood criterion for this, and surely his opinion should carry some weight. Perhaps this observation should appear somewhere else in the article. Comments? Robinh 08:10, 12 December 2006 (UTC)

## On design not mattering

Just a note to accompany the removal of a misleading rendition of LP early on this page. Differing experimental designs typically lead to different Likelihood functions. The classic example given here, and as a motivating example in Berger's book on the LP, is a rather special case designed to make a counterintuitive point. Consequently it is unwise to state that LP implies that experimental design `doesn't matter', even colloquially. It usually does. There's a good discussion emphasizing the relevance of design to Bayesian inference in chapter 7 of Gelman et al. 1995 'Bayesian Data Analysis'.

I find the example somewhat confusing; the two observations are different in more than just the design of the experiment. When observing X = 3 the 3 successes can happen anywhere within the 12 trials, whereas with Y = 12 the last trial must be a success. This is of course why the likelihood functions are different. How is it valid to say that those two outcomes are the same? K. Sperling (talk) 03:55, 19 August 2008 (UTC)

- The
*sampling distributions*for the data are different, but their dependences on the states of nature are proportional, so the likelihoods are proportional to each other for the fixed data observed. Since the likelihood is actually an equivalence class of functions proportional to each other, where the proportionality constant is independent of the state of nature (but may depend on the data), it is actually the case, in the Berger example, that the likelihoods are the*same.*For them to be different, the experimental design has to result in likelihood functions that are*not*proportional, but in this example the different designs of the experiment do not result in this outcome. Bill Jefferys (talk) 14:58, 19 August 2008 (UTC)

___ I do not think this is correct. What is the likelihood under your second design of a sample ending in a tail? According to your likelihood expression it is p^3*(1-p)^9 >> 0 if 0<p<1. Doesn't this trouble you a bit? Consider an even simpler case. Two experiment designs: (1) sample N observations from a Bernoulli distribution (2) sample from the BD until the first success. Suppose p is the probability of success and S is the number of successes in the first experiment. According to your logic both should have identical likelihoods. This is not so. The likelihood in experiment 1 is p^S*(1-p)^(N-S). The likelihood in experiment 2 is (1-p)^N*p The ML estimator for p in experiment 1 is p*=S/N. The ML estimator for p in experiment 2 is p*=1/N. In identical samples both experiments produce identical estimates. It doesn't mean that you have the same inference. The sample spaces for the experiments are completely different. —Preceding unsigned comment added by VladP (talk • contribs) 00:59, 5 January 2010 (UTC)

- The likelihoods are the same even in your example, because they only depend on the data
*actually observed*. If you sampled N observations from a binomial, and observed 1 success and N-1 failures, it doesn't matter whether the sampling was binomial (you planned to sample N times and happened to get just one success in that sequence) or inverse binomial (you planned to sample until you observe the first success, and it just happened that the first success occurred on the N'th sample). The*data*actually observed (S=1, N-S=N-1) are the same, so the likelihoods are indeed proportional.

- BTW, you wrote your likelihood in the second case incorrectly. There were exactly N samples in all, with one success, so you should have written (1-p)^(N-1)*p. Bill Jefferys (talk) 14:58, 3 September 2010 (UTC)

- Addendum: I failed to comment on one point made in the original comment of 5 January 2010:

- According to your likelihood expression it is p^3*(1-p)^9 >> 0 if 0<p<1. Doesn't this trouble you a bit?

- No, it would not trouble me even a little bit, even if it were true (which it is not, since the factors you display are all less than 1). The likelihood is
*not*a probability density, and does not have to be normalized. The likelihood is not a function in the usual sense, it is an*equivalence class*of functions which are proportional when considered only as a function of the states of nature (here p). The proportionality constant can be any positive constant (which may depend on the data but not the states of nature), and someone else with a different experimental design (as here, binomial vs. negative binomial) can have a different proportionality constant (the two different designs give different binomial coefficients in the sampling distributions). The point of the example is that in this particular case, the likelihoods are the same (i.e., the same equivalence class of functions) even though the designs are different. It is a feature of this particular example. In general, the experimental design can affect the likelihood, no one is disputing that. Bill Jefferys (talk) 22:37, 3 October 2011 (UTC)

- OMG, I misread this even more:

- According to your likelihood expression it is p^3*(1-p)^9 >> 0 if 0<p<1. Doesn't this trouble you a bit?

- I misread ">>1", not ">>0" as you wrote. Any positve real number is ">>0". Even "epsilon" is >>0, because you can always find a number >0 that is a million, a billion a google, a googleplex smaller than epsilon. Bill Jefferys (talk) 03:01, 4 October 2011 (UTC)

___ The likelyhoods are not independent of the experiment deisign. The sample space in the second experiment is: {F,S}, {F,F,S}, etc. The probability of each point is (1-p)^N*p. Read up on the literature on censoring. If you were correct censoring wouldn't matter. — Preceding unsigned comment added by VladP (talk • contribs) 07:23, 25 March 2011 (UTC)

- There is no censoring in this example. You are mistaken. Bill Jefferys (talk) 14:07, 28 September 2011 (UTC)

- About the above example. This is about the observations being the same and the designs being different. Since the observations are the same, we have S=1 in both cases and the two likelihood funtions are proportional with L(p)=(1-p)^(N-1)*p. Arie ten Cate (talk) 16:20, 3 October 2011 (UTC)

- Also, unlike stated above at 25 March 2011, the sample space in the second experiment starts with immediate success: {S}, {F,S}, {F,F,S}, etc. The probability of each point is (1-p)^(N-1)*p, starting with (1-p)^(1-1)*p = p, as it should be. The sum of this series from N=1 to infinity is unity, using the rule for infinite geometric series. (The series as stated at 25 March 2011 starts at N=2 with (1-p)^N*p = (1-p)^2*p and hence the sum to infinity is not unity but (1-p)^2.) Arie ten Cate (talk) 21:52, 3 October 2011 (UTC)

## Voltmeter story

I have trouble seeing the point in the "Voltmeter Story." If the voltmeter goes up to 100, and all measurements were strictly less than that, in what sense is the population 'censored?' If it is indeed 'censored' wouldn't it also be 'censored' if the voltmeter goes up to 1000 volts? —Preceding unsigned comment added by 170.223.221.198 (talk) 19:15, 16 October 2008 (UTC)

- Practically, we can consider 1000 volts as infinity, hence uncensored. Quoting from the section "Likelihood principle reference" below, the story is about whether or or not "having observed a specific value of x, other possible observations do not come into play". Arie ten Cate (talk) 12:13, 31 May 2009 (UTC)

I have a problem with the revision of 08:39, 12 January 2010 (http://en.wikipedia.org/w/index.php?title=Likelihood_principle&oldid=337352365), described as *The voltmeter story: clarified the description to clearly identify the seemingly paradoxical nature of the claim*.
This revision replaced the phrase "The distribution of the measurements depends on this probability" by "The likelihood theory claims that the distribution of the voltage measurements depends on the probability that an instrument not used in this experiment was broken at the time". I think a simple, clear phrase was replaced by an obscure phrase: the dependence discussed here is due to the measurement procedure, and not due to the likelihood theory. Also, it is unclear to me what is meant here with "likelihood theory": is it the likelihood function or the likelihood principle or something else? Also, there is no ambiguity in the word "measurements" in the original phrase, nor in "this probability". Hence I suggest to return to the original phrase. If the relevance of the dependency discussed here is unclear then something might be added like "and hence for the orthodox statistician this probability must be taken into account in the inference". (My favourite is the probability that Napoleon did win the Battle of Waterloo, changing the economic development of Europe. Hence the probability distribution of all European macro-economic time series depends on this probability and with a similar reasoning as in the voltmeter story, this probability must be taken into account in all European macro-economic research.) Arie ten Cate (talk) 14:07, 8 October 2011 (UTC)

## likelihood principle reference

The article doesn't give detailed references for where the information is coming from. In particularly, I'd like to see on which specific reference this definition of the likelihood principle is based, because I don't fully understand it. AFAIK, the likelihood principle just says that having observed a specific value of x, other possible observations do not come into play (i.e. I only need p(x=x_1|theta), not p(x|theta)).--Dreiche2 (talk) 15:17, 13 February 2009 (UTC)

- This is indeed what it (presently) says in the article, though in formal language:
*A likelihood function arises from a conditional probability distribution considered as a function of its distributional parameterization argument, conditioned on the data argument. [...] all information from the data relevant to inferences about the value of θ is found in the equivalence class*. - Somewhat less precise, but more plain:
*A likelihood function is a probability distribution considered as a function of the unknown parameter, given the observed data [...] all information from the data relevant to inferences about the parameter is found in the likelihood function.*That is, in L(theta|x_1)=p(x=x_1|theta) and not in p(x|theta). - I propose to add a translation of the formal definition into less precise but more plain language. For instance such as the one given by Dreiche2. In a consistent notation, of course. Arie ten Cate (talk) 13:14, 31 May 2009 (UTC)

## Mayo article

The article claims:

- In fact Birnbaum's alleged proof of the likelihood principle, however celebrated, has been shown to be invalid by Deborah G. Mayo in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science (Mayo and Spanos Eds, Cambridge, 2010,) chapter 7, pp. 305-314

I have asked an expert on the likelihood principle about this article. He responded that the article is nonsense, appears only in a book edited by the author of the article, and that there is apparently no peer-reviewed article by the author making that claim.

I will therefore remove the comment as not being properly sourced. Bill Jefferys (talk) 18:29, 3 September 2010 (UTC)

## Optional stopping in clinical trials

I just removed this section since it is patently wrong and have offered not sources, reliable or otherwise, since 2011 and the subject is tangentially related to the main article.

This section said thing like

- "Furthermore, as mentioned above, frequentist analysis is open to unscrupulous manipulation if the experimenter is allowed to choose the stopping point, whereas Bayesian methods are
**immune**to such manipulation."

Which is absolutely false since **any** statistical procedure, Bayesian or otherwise, may suffer from Confirmation bias.Viraltux (talk) 09:54, 27 July 2013 (UTC)