Lévy's continuity theorem: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
en>Frietjes
No edit summary
en>Yaxy2k
 
Line 1: Line 1:
'''Sample size determination''' is the act of choosing the number of observations or [[Replication (statistics)| replicates]] to include in a [[statistical sample]]. The sample size is an important feature of any empirical study in which the goal is to make [[statistical inference|inferences]] about a [[statistical population|population]] from a sample. In practice, the sample size used in a study is determined based on the expense of data collection, and the need to have sufficient [[statistical power]]. In complicated studies there may be several different sample sizes involved in the study: for example, in a [[survey sampling]] involving [[stratified sampling]] there would be different sample sizes for each population. In a [[census]], data are collected on the entire population, hence the sample size is equal to the population size. In [[experimental design]], where a study may be divided into different [[treatment group]]s, there may be different sample sizes for each group.
47 year old Engraver Simon Cadotte from Mill Bay, spends time with hobbies and interests like modeling ships, new launch property singapore and badge collecting. May be a travel enthusiast and in recent years traveled to Archaeological Monuments Zone of Xochicalco.<br><br>Feel free to surf to my homepage ... new launch in singapore ([http://jeno.in/test/blog/view/19518/as-the-jurong-lake-district-transforms-so-will-your-life-at-lakeville just click the next website page])
 
Sample sizes may be chosen in several different ways:
*expedience -  For example, include those items readily available or convenient to collect.  A choice of small sample sizes, though sometimes necessary, can result in wide [[confidence interval]]s or risks of errors in [[statistical hypothesis testing]].
*using a target variance for an estimate to be derived from the sample eventually obtained
*using a target for the power of a [[statistical hypothesis testing|statistical test]] to be applied once the sample is collected.
 
How samples are collected is discussed in [[sampling (statistics)]] and [[survey data collection]].
 
==Introduction==
 
Larger sample sizes generally lead to increased [[Accuracy and precision|precision]] when [[statistical estimation|estimating]] unknown parameters. For example, if we wish to know the proportion of a certain species of fish that is infected with a pathogen, we would generally have a more accurate estimate of this proportion if we sampled and examined 200 rather than 100 fish.  Several fundamental facts of mathematical statistics describe this phenomenon, including the [[law of large numbers]] and the [[central limit theorem]].
 
In some situations, the increase in accuracy for larger sample sizes is minimal, or even non-existent. This can result from the presence of [[systematic error]]s or strong [[correlation and dependence|dependence]] in the data, or if the data follow a heavy-tailed distribution.
 
Sample sizes are judged based on the quality of the resulting estimates. For example, if a proportion is being estimated, one may wish to have the 95% [[confidence interval]] be less than 0.06 units wide. Alternatively, sample size may be assessed based on the [[statistical power|power]] of a hypothesis test. For example, if we are comparing the support for a certain political candidate among women with the support for that candidate among men, we may wish to have 80% power to detect a difference in the support levels of 0.04 units.
 
==Estimating proportions and means==
 
A relatively simple situation is estimation of a [[Proportionality (mathematics)|proportion]]. For example, we may wish to estimate the proportion of residents in a community who are at least 65 years old.
 
The [[estimator]] of a [[Proportionality (mathematics)|proportion]] is <math> \hat p = X/n</math>, where ''X'' is the number of 'positive' observations  (e.g. the number of people out of the ''n'' sampled people who are at least 65 years old). When the observations are [[independent (statistics)|independent]], this estimator has a (scaled) [[binomial distribution]] (and is also the [[Sample (statistics)|sample]] [[arithmetic mean|mean]] of data from a [[Bernoulli distribution]]). The maximum [[variance]] of this distribution is 0.25*''n'', which occurs when the true [[parameter]] is ''p'' = 0.5. In practice, since ''p'' is unknown, the maximum variance is often used for sample size assessments.
 
For sufficiently large ''n'', the distribution of <math>\hat{p}</math> will be closely approximated by a [[normal distribution]].<ref>[[NIST]]/[[SEMATECH]], [http://www.itl.nist.gov/div898/handbook/prc/section2/prc242.htm "7.2.4.2. Sample sizes required"], ''e-Handbook of Statistical Methods.''</ref> Using this approximation, it can be shown that around 95% of this distribution's probability lies within 2 standard deviations of the mean. Using the
[[Binomial_distribution#Confidence_intervals | Wald method for the binomial distribution]],
an interval of the form
 
:<math>(\hat p -2\sqrt{0.25/n}, \hat p +2\sqrt{0.25/n})</math>
 
will form a 95% confidence interval for the true proportion. If this interval needs to be no more than ''W'' units wide, the equation
 
:<math>4\sqrt{0.25/n} = W</math>
 
can be solved for ''n'', yielding<ref>[http://www.utdallas.edu/~ammann/stat3355/node25.html "Large Sample Estimation of a Population Proportion"]</ref><ref>[http://nebula.deanza.fhda.edu/~bloom/Math10/M10ConfIntNotes.pdf "Confidence Interval for a Proportion"]</ref> ''n''&nbsp;=&nbsp;4/''W''<sup>2</sup>&nbsp;=&nbsp;1/''B''<sup>2</sup> where ''B'' is the error bound on the estimate, i.e., the estimate is usually given as ''within ± B''. So, for ''B'' = 10% one requires ''n'' = 100, for ''B'' = 5% one needs ''n'' = 400, for ''B'' = 3% the requirement approximates to ''n'' = 1000, while for ''B'' = 1% a sample size of ''n'' = 10000 is required. These numbers are quoted often in news reports of [[opinion poll]]s and other [[sample survey]]s.
 
===Estimation of means===
 
A proportion is a special case of a mean. When estimating the population mean using an independent and identically distributed (iid) sample of size ''n'', where each data value has variance ''σ''<sup>2</sup>, the [[standard error (statistics)|standard error]] of the sample mean is:
 
::<math>\sigma/\sqrt{n}.</math>
 
This expression describes quantitatively how the estimate becomes more precise as the sample size increases.  Using the [[central limit theorem]] to justify approximating the sample mean with a normal distribution yields an approximate 95% confidence interval of the form
 
:<math>(\bar x - 2\sigma/\sqrt{n},\bar x + 2\sigma/\sqrt{n}).</math>
 
If we wish to have a confidence interval that is ''W'' units in width, we would solve
 
:<math>
4\sigma/\sqrt{n} = W
</math>
 
for ''n'', yielding the sample size ''n''&nbsp;=&nbsp;16''σ<sup>2</sup>/W<sup>2</sup>.
 
For example, if we are interested in estimating the amount by which a drug lowers a subject's blood pressure with a confidence interval that is six units wide, and we know that the standard deviation of blood pressure in the population is 15, then the required sample size is 100.
 
==Required sample sizes for hypothesis tests {{anchor|Estimating sample sizes}}==
A common problem faced by the statisticians is calculating the sample size required to yield a certain [[Statistical power|power]] for a test, given a predetermined [[Type I error]] rate α. As follows, this can be estimated by pre-determined tables for certain values, by Mead's resource equation, or, more generally, by the [[cumulative distribution function]]:
 
===By tables===
{|class="wikitable" align="right"
!rowspan=2|<ref name=Kenny1987/><br>&nbsp;<br> [[statistical power|Power]] !!colspan=3| [[Cohen's d]]
|-
! 0.2 !! 0.5 !! 0.8
|-
! 0.25
| 84 || 14 || 6
|-
! 0.50
| 193 || 32 || 13
|-
! 0.60
| 246 || 40 || 16
|-
! 0.70
| 310 || 50 || 20
|-
! 0.80
| 393 || 64 || 26
|-
! 0.90
| 526 || 85 || 34
|-
! 0.95
| 651 || 105 || 42
|-
! 0.99
| 920 || 148 || 58
|}
The table shown at right can be used in a [[two-sample t-test]] to estimate the sample sizes of an [[experimental group]] and a [[control group]] that are of equal size, that is, the total number of individuals in the trial is twice that of the number given, and the desired [[significance level]] is 0.05.<ref name=Kenny1987>[http://davidakenny.net/doc/statbook/chapter_13.pdf Chapter 13], page 215, in: {{cite book |author=Kenny, David A. |title=Statistics for the social and behavioral sciences |publisher=Little, Brown |location=Boston |year=1987 |pages= |isbn=0-316-48915-8 |oclc= |doi= |accessdate=}}</ref> The parameters used are:
*The desired [[statistical power]] of the trial, shown in column to the left.
*[[Cohen's d]] (=effect size), which is the expected difference between the [[mean]]s of the target values between the experimental group and the [[control group]], divided by the expected [[standard deviation]].
 
===Mead's resource equation===
Mead's resource equation is often used for estimating sample sizes of [[laboratory animal]]s, as well as in many other laboratory experiments. It may not be as accurate as using other methods in estimating sample size, but gives a hint of what is the appropriate sample size where parameters such as expected standard deviations or expected differences in values between groups are unknown or very hard to estimate.<ref name=Hubrecht&Kirkwood2010>{{cite book |author=Kirkwood, James; Robert Hubrecht |title=The UFAW Handbook on the Care and Management of Laboratory and Other Research Animals |publisher=Wiley-Blackwell |location= |year=2010 |pages=29 |isbn=1-4051-7523-0 |oclc= |doi= |accessdate=}} [http://books.google.se/books?id=Wjr9u1AAhsdsdsdsdsdssdsdddddddddddddddddddst4C&pg=PA29 online Page 29]</ref>
 
All the parameters in the equation are in fact the [[Degrees of freedom (statistics)|degrees of freedom]] of the number of their concepts, and hence, their numbers are subtracted by 1 before insertion into the equation.
 
The equation is:<ref name=Hubrecht&Kirkwood2010/>
 
:<math> E = N - B - T,</math>
where:
*''N'' is the total number of individuals or units in the study (minus 1)
*''B'' is the ''blocking component'', representing environmental effects allowed for in the design (minus 1)
*''T'' is the ''treatment component'', corresponding to the number of [[treatment groups]] (including [[control group]]) being used, or the number of questions being asked (minus 1)
*''E'' is the degrees of freedom of the ''error component'', and should be somewhere between 10 and 20.
 
For example, if a study using laboratory animals is planned with four treatment groups (''T''=3), with  eight animals per group, making 32 animals total (''N''=31), without any further [[Stratified sampling|stratification]] (''B''=0), then ''E'' would equal 28, which is above the cutoff of 20, indicating that sample size may be a bit too large, and six animals per group might be more appropriate.<ref>[http://www.isogenic.info/html/resource_equation.html Isogenic.info > Resource equation] by Michael FW Festing. Updated Sept. 2006</ref>
 
===By cumulative distribution function===
Let ''X<sub>i</sub>'', ''i'' = 1, 2, ..., ''n'' be independent observations taken from a [[normal distribution]] with unknown mean μ and known variance σ<sup>2</sup>. Let us consider two hypotheses, a [[null hypothesis]]:
 
: <math> H_0:\mu=0 </math>
 
and an alternative hypothesis:
 
: <math> H_a:\mu=\mu^* </math>
 
for some 'smallest significant difference' μ<sup>*</sup> >0. This is the smallest value for which we care about observing a difference. Now, if we wish to (1) reject ''H''<sub>0</sub> with a probability of at least 1-β when
''H''<sub>a</sub> is true (i.e. a [[Statistical power|power]] of 1-β), and (2) reject ''H''<sub>0</sub> with probability α when ''H''<sub>0</sub> is true, then we need the following:
 
If ''z''<sub>α</sub> is the upper α percentage point of the standard normal distribution, then
 
: <math> \Pr(\bar x >z_{\alpha}\sigma/\sqrt{n}|H_0 \text{ true})=\alpha </math>
 
and so
 
: 'Reject ''H''<sub>0</sub> if our sample average (<math>\bar x</math>) is more than <math>z_{\alpha}\sigma/\sqrt{n}</math>'
 
is a [[decision rule]] which satisfies (2). (Note, this is a 1-tailed test)
 
Now we wish for this to happen with a probability at least 1-β when
''H''<sub>a</sub> is true. In this case, our sample average will come from a Normal distribution with mean μ<sup>*</sup>. Therefore we require
 
: <math> \Pr(\bar x >z_{\alpha}\sigma/\sqrt{n}|H_a \text{ true})\geq 1-\beta </math>
 
Through careful manipulation, this can be shown to happen when
 
: <math> n \geq \left(\frac{z_{\alpha}-\Phi^{-1}(1-\beta)}{\mu^{*}/\sigma}\right)^2 </math>
 
where <math>\Phi</math> is the normal [[cumulative distribution function]].
 
==Stratified sample size==
With more complicated sampling techniques, such as [[stratified sampling]], the sample can often be split up into sub-samples. Typically, if there are ''H'' such sub-samples (from ''H'' different strata) then each of them will have a sample size ''n<sub>h</sub>'', ''h'' = 1, 2, ..., ''H''. These ''n<sub>h</sub>'' must conform to the rule that ''n''<sub>1</sub> + ''n''<sub>2</sub> + ... + ''n''<sub>''H''</sub> = ''n'' (i.e. that the total sample size is given by the sum of the sub-sample sizes). Selecting these ''n<sub>h</sub>'' optimally can be done in various ways, using (for example) Neyman's optimal allocation.
 
There are many reasons to use stratified sampling:<ref>Kish (1965, Section 3.1)</ref> to decrease variances of sample estimates, to use partly non-random methods, or to study strata individually.
A useful, partly non-random method would be to sample individuals where easily accessible, but, where not, sample clusters to save travel costs.<ref>Kish (1965), p.148.</ref>
 
In general, for ''H'' strata, a weighted sample mean is
: <math> \bar x_w  = \sum_{h=1}^H W_h \bar x_h, </math>
with
 
: <math> \operatorname{Var}(\bar x_w) = \sum_{h=1}^H W_h^2 \,\operatorname{Var}(\bar x_h). </math><ref>Kish (1965), p.78.</ref>
 
The weights, <math>W_h</math>, frequently, but not always, represent the proportions of
the population elements in the strata, and <math>W_h=N_h/N</math>.  For a fixed sample
size, that is <math> N = \sum{N_h} </math>,
 
: <math>  \operatorname{Var}(\bar x_w) = \sum_{h=1}^H W_h^2 \,Var_h \left(\frac{1}{n_h} - \frac{1}{N_h}\right), </math><ref>Kish (1965), p.81.</ref>
 
which can be made a minimum if the [[sampling rate]] within each stratum is made
proportional to the standard deviation within each stratum: <math> n_h/N_h=k S_h </math>, where <math> S_h = \sqrt{Var_h} </math> and <math>k</math> is a constant such that <math> \sum{n_h} = n </math>.
 
An "optimum allocation" is reached when the sampling rates within the strata
are made directly proportional to the standard deviations within the strata
and inversely proportional to the square root of the sampling cost per element
within the strata, <math>C_h</math>:
: <math> \frac{n_h}{N_h} = \frac{K S_h}{\sqrt{C_h}}, </math><ref>Kish (1965), p.93.</ref>
 
where <math>K</math> is a constant such that <math> \sum{n_h} = n </math>, or, more generally, when
 
: <math> n_h = \frac{K' W_h S_h}{\sqrt{C_h}}. </math><ref>Kish (1965), p.94.</ref>
 
== Too-large sample size problem ==
A too large sample size may lead to the rejection of a [[null hypothesis]] even if the actual effect is so small, that it does not have practical importance. See [[Null_hypothesis#Too-large_sample_size_problem]].
 
==Software of sample size calculations==
 
See [[Statistical power#Software for Power and Sample Size Calculations|Software for Power and Sample Size Calculations]].
 
==See also==
{{Portal|Statistics}}
*[[Degrees of freedom (statistics)]]
*[[Design of experiments]]
*[[Replication (statistics)]]
*[[Sampling (statistics)]]
*[[Statistical power]]
*[[Stratified sampling]]
*Engineering response surface example under [[Stepwise regression]].
 
==Notes==
{{Reflist}}
 
==References==
*{{cite journal |last=Bartlett |first=J. E., II |last2=Kotrlik |first2=J. W. |last3=Higgins |first3=C. |year=2001 |url=http://www.osra.org/itlpj/bartlettkotrlikhiggins.pdf |title=Organizational research: Determining appropriate sample size for survey research |journal=Information Technology, Learning, and Performance Journal |volume=19 |issue=1 |pages=43–50 |doi= }}
*{{cite book |authorlink=Leslie Kish |last=Kish |first=L. |year=1965 |title=Survey Sampling |publisher=Wiley |isbn=0-471-48900-X }}
 
==Further reading==
* [http://www.itl.nist.gov/div898/handbook/ppc/section3/ppc333.htm NIST: Selecting Sample Sizes]
* [http://ravenanalytics.com/Articles/Sample_Size_Calculations.htm Raven Analytics: Sample Size Calculations]
* [[ASTM]] E122-07: Standard Practice for Calculating Sample Size to Estimate, With Specified Precision, the Average for a Characteristic of a Lot or Process
 
==External links==
* [http://www.nss.gov.au/nss/home.NSF/pages/Sample+size+calculator Sample size calculator] from the Australian [[National Statistical Service]]
* [http://www.raosoft.com/samplesize.html Sample Size Calculator by Raosoft, Inc.]
 
{{Statistics|collection|state=expanded}}
 
{{DEFAULTSORT:Sample Size}}
[[Category:Sampling (statistics)]]
 
[[de:Zufallsstichprobe#Stichprobenumfang]]

Latest revision as of 16:32, 28 December 2014

47 year old Engraver Simon Cadotte from Mill Bay, spends time with hobbies and interests like modeling ships, new launch property singapore and badge collecting. May be a travel enthusiast and in recent years traveled to Archaeological Monuments Zone of Xochicalco.

Feel free to surf to my homepage ... new launch in singapore (just click the next website page)