Lift (data mining): Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
en>Andyhowlett
No edit summary
en>Pacovila
Capitalize a last name for coherence
 
Line 1: Line 1:
{{no footnotes|date=November 2013}}
Hello and welcome. My title is Irwin and I totally dig that name. South Dakota is where me and my spouse reside. One of the issues he enjoys most is ice skating but he is struggling to discover time for it. For many years I've been working as a payroll clerk.<br><br>Review my web site - [http://kard.dk/?p=17207 at home std testing]
[[Image:standard deviation diagram.svg|thumb|350px|Dark blue is less than one standard deviation from the mean. For the [[normal distribution]], this accounts for 68.27% of the set; while two standard deviations from the mean (medium and dark blue) account for 95.45%; and three standard deviations (light, medium, and dark blue) account for 99.73%.]]
[[File:Standard score and prediction interval.png|thumb|250px|right|Prediction interval (on the [[y-axis]]) given from the [[standard score]] (on the [[x-axis]]). The y-axis is logarithmically scaled (but the values on it are not modified).]]
 
In [[statistics]], the '''68–95–99.7 rule''', also known as the '''three-sigma rule''' or '''empirical rule''', states that nearly all values lie within three [[standard deviation]]s of the [[Arithmetic mean|mean]] in a [[normal distribution]].
 
About 68.27% of the values lie within one standard deviation of the mean.  Similarly, about 95.45% of the values lie within two standard deviations of the mean.  Nearly all (99.73%) of the values lie within three standard deviations of the mean.
 
In mathematical notation, these facts can be expressed as follows, where <span class="texhtml">x</span> is an observation from a normally distributed [[random variable]], <span class="texhtml">μ</span> is the mean of the distribution, and <span class="texhtml">σ</span> is its standard deviation:
:<math>\begin{align}
  \Pr(\mu-\;\,\sigma \le x \le \mu+\;\,\sigma) &\approx 0.6827 \\
  \Pr(\mu-2\sigma \le x \le \mu+2\sigma)      &\approx 0.9545 \\
  \Pr(\mu-3\sigma \le x \le \mu+3\sigma)      &\approx 0.9973
\end{align}
</math>
 
==Derivation==
[[File:Cumulative distribution function for normal distribution, mean 0 and sd 1.png|270px|thumb|left|Diagram showing the [[cumulative distribution function]] for the normal distribution with mean (''µ'') 0 and variance (''σ''<sup>2</sup>)&nbsp;1. The prediction interval for any standard score corresponds numerically to (1-(1-<span style="font-size:100%;">Φ</span><sub>''µ'',''σ''<sup>2</sup></sub>(standard score))&middot;2). For example, a standard score of ''x''&nbsp;=&nbsp;2 gives <span style="font-size:100%;">Φ</span><sub>''µ'',''σ''<sup>2</sup></sub>(2) =&nbsp;0.97725 corresponding to a prediction interval of (1&nbsp;−&nbsp;(1&nbsp;−&nbsp;0.97725)&middot;2) =&nbsp;0.9545 =&nbsp;95.45%.]]
 
These numerical values come from the [[Normal_distribution#Cumulative_distribution|cumulative distribution function of the normal distribution]].  For example, <span class="texhtml">Φ(2) ≈ 0.9772</span>, or <span class="texhtml">Pr(x ≤ μ + 2σ) ≈ 0.9772</span>.  Note that this is not a symmetrical interval – this is merely the probability that an observation is less than <span class="texhtml">μ + 2σ</span>.  To compute the probability that an observation is within two standard deviations of the mean (small differences due to rounding):
:<math>\Pr(\mu-2\sigma \le x \le \mu+2\sigma)
= \Phi(2) - \Phi(-2)
\approx 0.9772 - (1 - 0.9772)
\approx 0.9545
</math>
 
This is related to [[confidence interval]] as used in statistics: <math>\scriptstyle \bar{x} \pm 2\sigma</math> is approximately a 95% confidence interval when <math>\bar{x}</math> is the average of a sample.
 
== Uses ==
This rule is often used to quickly get a rough probability estimate of something, given its standard deviation, if the population is assumed normal, thus also as a simple test for [[outliers]] (if the population is assumed normal), and as a [[normality test]] (if the population is potentially not normal).
 
Recall that to pass from a sample to a number of standard deviations, one
computes the [[deviation (statistics)|deviation]], either the [[Errors and residuals in statistics|error or residual]] (accordingly if one knows the population mean or only estimates it), and then either uses [[standardizing]] (dividing by the population standard deviation), if the population parameters are known, or [[studentizing]] (dividing by an estimate of the standard deviation), if the parameters are unknown and only estimated.
 
To use as a test for outliers or a normality test, one computes the size of deviations in terms of standard deviations, and compares this to expected frequency. Given a sample set, compute the [[studentized residual]]s and compare these to the expected frequency: points that fall more than 3 standard deviations from the norm are likely outliers (unless the [[sample size]] is significantly large, by which point one expects a sample this extreme), and if there are many points more than 3 standard deviations from the norm, one likely has reason to question the assumed normality of the distribution. This holds ever more strongly for moves of 4 or more standard deviations.
 
One can compute more precisely, approximating the number of extreme moves of a given magnitude or greater by a [[Poisson distribution]], but simply, if one has multiple 4 standard deviation moves in a sample of size 1,000, one has strong reason to consider these outliers or question the assumed normality of the distribution.
 
==Higher deviations==
Because of the exponential tails of the normal distribution, odds of higher deviations decrease very quickly. From the [[Standard deviation#Rules for normally distributed data|rules for normally distributed data]]:
{| class="wikitable" style="text-align:center"
|- bgcolor="#CCCCCC"
! Range !! Population in range !! Expected frequency outside range !! Approx. frequency for daily event
|-
|μ ± 1σ || {{gaps|0.682|689|492|137|086}} || 1 in 3 || Twice a week
|-
|μ ± 1.5σ || {{gaps|0.866|385|597|462|284}} || 1 in 7 || Weekly
|-
|μ ± 2σ || {{gaps|0.954|499|736|103|642}} || 1 in 22 || Every three weeks
|-
|μ ± 2.5σ || {{gaps|0.987|580|669|348|448}} || 1 in 81 || Quarterly
|-
|μ ± 3σ || {{gaps|0.997|300|203|936|740}} || 1 in 370 || Yearly
|-
|μ ± 3.5σ || {{gaps|0.999|534|741|841|929}} || 1 in 2149 || Every six years
|-
|μ ± 4σ || {{gaps|0.999|936|657|516|334}} || 1 in {{val|15787}} || Every 43 years (twice in a lifetime)
|-
|μ ± 4.5σ || {{gaps|0.999|993|204|653|751}} || 1 in {{val|147160}} || Every 403 years
|-
|μ ± 5σ || {{gaps|0.999|999|426|696|856}} || 1 in {{val|1744278}} || Every {{val|4776}} years (once in recorded history)
|-
|μ ± 5.5σ || {{gaps|0.999|999|962|020|875}} || 1 in {{val|26330254}} || Every {{val|72090}} years
|-
|μ ± 6σ || {{gaps|0.999|999|998|026|825}} || 1 in {{val|506797346}} || Every 1.38 million years (history of [[Homo Sapiens|humankind]])
|-
|μ ± 6.5σ || {{gaps|0.999|999|999|919|680}} || 1 in {{val|12450197393}} || Every 34 million years
|-
|μ ± 7σ || {{gaps|0.999|999|999|997|440}} || 1 in {{val|390682215445}} || Every 1.07 billion years
|-
|μ ± {{math|<var>x</var>}}σ || [[Error function|<math>\textstyle\operatorname{erf}\left(\frac{x}{\sqrt{2}}\right)</math>]]  || 1 in <math>\textstyle \frac{1}{1-\operatorname{erf}\left(\frac{x}{\sqrt{2}}\right)}</math> || Every <math>\textstyle \frac{1}{1-\operatorname{erf}\left(\frac{x}{\sqrt{2}}\right)}</math> days
|}
Thus for a daily process, a 6''σ'' event is expected to happen less than once in a million years. This gives a [[normality_test#Back_of_the_envelope_test|simple normality test]]: if one witnesses a 6''σ'' in daily data and significantly fewer than 1 million years have passed, then a normal distribution most likely does not provide a good model for the magnitude or frequency of large deviations in this respect.  In ''[[The Black Swan (Taleb book)|The Black Swan]]'', [[Nassim Nicholas Taleb]] gives the example of risk models for which the [[Black Monday (1987)|Black Monday]] crash was a 36-sigma event: the occurrence of such an event should instantly suggest a consideration of a catastrophic flaw in a model. However, such models were created before there was a proper understanding of [[stochastic volatility]] and the recitation of such calculations, which no modern practitioner would take seriously at all, is somewhat akin to a [[straw man]] argument. In such discussion it is important to be aware of the fact that there is actually nothing in the process of drawing with replacement that specifies the order in which the unlikely events should occur, merely their relative frequency, and one must take care when reasoning from sequential draws. It is a [[corollary]] of the [[gambler's fallacy]] to suggest that just because a rare event has been observed, that rare event was not rare. It is the observation of a multitude of puportedly rare events that undermines the hypothesis that they are actually rare.
 
== See also ==
* [[Standard score]]
* [[t-statistic]]
 
== External links ==
* "[http://www-stat.stanford.edu/~naras/jsm/NormalDensity/NormalDensity.html The Normal Distribution]" by Balasubramanian Narasimhan
* "[http://www.wolframalpha.com/input/?i=erf%28x%2Fsqrt%282%29%29 Calculate percentage proportion within ''x'' sigmas] at WolframAlpha
 
{{ProbDistributions|Normal distribution}}
 
{{DEFAULTSORT:68-95-99.7 rule}}
[[Category:Data analysis]]
[[Category:Statistical approximations]]
 
[[pl:Odchylenie standardowe#Dla rozkładu normalnego]]

Latest revision as of 19:34, 22 October 2014

Hello and welcome. My title is Irwin and I totally dig that name. South Dakota is where me and my spouse reside. One of the issues he enjoys most is ice skating but he is struggling to discover time for it. For many years I've been working as a payroll clerk.

Review my web site - at home std testing