Cheeger bound: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
en>Yobot
m →‎References: WP:CHECKWIKI error fixes (category with space) + general fixes using AWB (7796)
 
en>Feon
m dodge redirect
 
Line 1: Line 1:
In [[robust statistics]], '''Peirce's criterion'''  is a rule for eliminating [[outlier]]s from data sets, which was devised by [[Benjamin Peirce]].
34 year old Ship's Engineer Hobbie from Fort McMurray, loves leathercrafting, como ganhar dinheiro na internet and wine making. Has completed a great around the world tour that consisted of  going to the Tipasa.<br><br>Also visit my web page; [http://www.comoganhardinheiro101.com/slide-central/ ganhando dinheiro na internet]
 
==Outliers removed by Peirce's criterion==
 
===The problem of outliers===
{{Main|outlier}}
{{See also|robust statistic}}
In [[data set]]s containing real-numbered measurements, the suspected [[outlier]]s are the measured values that appear to lie outside the cluster of most of the other data values. The outliers would greatly change the estimate of location if the arithmetic average were to be used as a summary statistic of location. The problem is that the arithmetic mean is very sensitive to the inclusion of any outliers; in statistical terminology, the arithmetic mean is not [[robust statistic|robust]].
 
In the presence of outliers, the statistician has two options. First, the statistician may remove the suspected [[outlier]]s from the data set and then use the arithmetic mean to estimate the location parameter. Second, the statistician may use a robust statistic, such as the [[median]] statistic.
 
Peirce's criterion is a statistical procedure for eliminating outliers.
 
===Uses of Peirce's criterion===
The statistician and historian of statistics [[Stephen M. Stigler]] wrote the following about [[Benjamin Peirce]]:<ref name=stig246>S.M. Stigler, "Mathematical statistics in the early states," The Annals of Statistics, vol. 6, no. 2, p. 246, 1978. Available online: http://www.jstor.org/stable/2958876</ref>
<blockquote>
"In 1852 he published the first [[significance test]] designed to tell an investigator whether an outlier should be rejected (Peirce 1852, 1878). The test, based on a [[likelihood ratio]] type of argument, had the distinction of producing an international debate on the wisdom of such actions ([[Francis J. Anscombe|Anscombe]], 1960, Rider, 1933, [[Stephen Stigler|Stigler]], 1973a)."
</blockquote>
 
Peirce's criterion is derived from a statistical analysis of the [[Gaussian distribution]]. Unlike some other criteria for removing outliers, Peirce's method can be applied to identify two or more outliers.
 
<blockquote>
"It is proposed to determine in a series of <math>m</math> observations the limit of error, beyond which all observations involving so great an error may be rejected, provided there are as many as <math>n</math> such observations. The principle upon which it is proposed to solve this problem is, that the proposed observations should be rejected when the probability of the system of errors obtained by retaining them is less than that of the system of errors obtained by their rejection multiplied by the probability of making so many, and no more, abnormal observations."<ref name=Pierce516>Quoted in the editorial note on page 516 of the ''Collected Writings'' of Peirce (1982 edition). The quotation cites ''A Manual of Astronomy'' (2:558) by Chauvenet.</ref>
</blockquote>
<!-- The method can be applied using a table{{Citation needed|date=March 2010}} which lists criterion values for up to 60 data measurements. -->
Hawkins<ref>D.M. Hawkins (1980). "Brief early history in outlier rejection," Identification of Outliers (Monographs on Applied Probability and Statistics). Chapman & Hall, page 10.</ref> provides a formula for the criterion.
 
Peirce's criterion was used for decades at the [[United States Coast Survey]].<ref>Peirce (1878)</ref>
<blockquote>
"From 1852 to 1867 he served as the director of the longitude determinations of the U. S. Coast Survey and from 1867 to 1874 as superintendent of the Survey. During these years his test was consistently employed by all the clerks of this, the most active and mathematically inclined statistical organization of the era."<ref name=stig246/>
</blockquote>
 
Peirce's criterion was discussed in [[William Chauvenet]]'s book.<ref name=Pierce516/>
 
==Applications==
An application for Peirce's criterion is removing poor data points from observation pairs in order to perform a regression between the two observations (e.g., a linear regression).  Peirce's criteria does not depend on observation data (only characteristics of the observation data), therefore making it a highly repeatable process that can be calculated independently of other processes.  This feature makes Peirce's criteria for identifying outliers ideal in computer applications because it can be written as a call function.
 
===Previous attempts===
In 1855, B.A. Gould attempted to make Peirce's criterion easier to apply by creating tables of values representing values from Peirce's equations.<ref name=":0">Gould, B.A., "On Peirce's criterion for the rejection of doubtful observations, with tables for facilitating its application," Astronomical Journal, iss. 83, vol. 4, no. 11, pp. 81--87, 1855. DOI: 10.1086/100480. Available online at http://adsabs.harvard.edu/full/1855AJ......4...81G</ref>  Unfortunately, there still exists a disconnect between Gould's algorithm and the practical application of Peirce's criterion.
 
In 2003, S.M. Ross (University of New Haven) re-presents Gould's algorithm (now called "Peirce's method") with a new example data set and work-through of the algorithm.  Unfortunately, this methodology still relies on using look-up tables, which have been updated in this work (Peirce's criterion table).<ref>Ross, S.M., "Peirce's criterion for the elimination of suspect experimental data," Journal of Engineering Technology, vol. 2, no. 2, pp. 1-12, 2003.  Available online: http://www.eol.ucar.edu/system/files/piercescriterion.pdf</ref>
 
In 2008, an attempt to write a pseudo-code was made by a Danish geologist K. Thomsen.<ref>Thomsen, K., "Topic: Computing tables for use with Peirce's Criterion - in 1855 and 2008", The Math Forum @ Drexel, posted 5 Oct. 2008. Available online at http://mathforum.org/kb/message.jspa?messageID=6449606.  Accessed 15 Jul. 2013.</ref>  While this code provided some framework for Gould's algorithm, users were unsuccessful in calculating values reported by either Peirce or Gould.
 
In 2012, C. Dardis releases the R package "Peirce" with various methodologies (Peirce's criterion and the Chauvenet method) with comparisons of outlier removals. Dardis and fellow contributor Simon Muller, successfully implemented Thomsen's pseudo-code into a function called "findx."  The code is presented in the R implementation section below.  References for the R package are available online<ref>C. Dardis, "Package: Peirce," R-forge, accessed online: https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/Peirce/Peirce-manual.pdf?root=peirce</ref> as well as an unpublished review of the R package results.<ref>C. Dardis, "Peirce's criterion for the rejection of non-normal outliers; defining the range of applicability," Journal of Statistical Software (unpublished). Available online: https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/Peirce/PeirceSub.pdf?root=peirce</ref>
 
In 2013, a re-examination of Gould's algorithm and the utilisation of advanced Python programming modules (i.e., numpy and scipy) has made it possible to calculate the squared-error threshold values for identifying outliers.
 
===Python implementation===
In order to use Peirce's criteria, one must first understand the input and return values.  Regression analysis (or the fitting of curves to data) results in residual errors (or the difference between the fitted curve and the observation points).  Therefore, each observation point has a residual error associated with a fitted curve.  By taking the square (i.e., residual error raised to the power of two), residual errors are expressed as positive values.  If the squared error is too large (i.e., due to a poor observation) it can cause problems with the regression parameters (e.g., slope and intercept for a linear curve) retrieved from the curve fitting.
 
It was Peirce's idea to statistically identify what constituted an error as "too large" and therefore being identified as an "outlier" which could be removed from the observations to improve the fit between the observations and a curve.  K. Thomsen identified that three parameters were needed to perform the calculation: the number of observation pairs (N), the number of outliers to be removed (n), and the number of regression parameters (e.g., coefficients) used in the curve-fitting to get the residuals (m).  The end result of this process is to calculate a threshold value (of squared error) whereby observations with a squared error smaller than this threshold should be kept and observations with a squared error larger than this value should be removed (i.e., as an outlier).
 
Because Peirce's criteria does not take observations, fitting parameters, or residual errors as an input, the output must be re-associated with the data.  By taking the average of all the squared errors (i.e., the mean-squared error) and multiply it by the threshold squared error (i.e., the output of this function), it will result in the data-specific threshold value used to identify outliers.
 
The following Python code returns x-squared values for a given N (first column) and n (top row) in Table 1 (m = 1) and Table 2 (m = 2) of Gould 1855,.<ref name=":0" />  Due to the Newton-method of iteration, look-up tables, such as N versus log Q (Table III in Gould, 1855) and x versus log R (Table III in Peirce, 1852 and Table IV in Gould, 1855) are no longer necessary.
 
====python code====
<syntaxhighlight lang="python">
#!/usr/bin/python
#
# peirce_dev.py
# created 16 July 2013
#
#
#### MODULES ####
import numpy
import scipy.special
 
#### FUNCTION ####
def peirce_dev(N, n, m):
  # N :: number of observations
  # n :: number of outliers to be removed
  # m :: number of model unknowns (e.g., regression parameters)
  #
  # Assign floats to input variables:
  N = float(N)
  n = float(n)
  m = float(m)
  #
  # Check number of observations:
  if N > 1:
      # Calculate Q (Nth root of Gould's equation B):
      Q = (n**(n/N)*(N-n)**((N-n)/N))/N
      #
      # Initialize R values (as floats)
      Rnew = 1.0 
      Rold = 0.0 # <- Necessary to prompt while loop
      #
      # Start iteration to converge on R:
      while ( abs(Rnew-Rold) > (N*2.0e-16) ):
        # Calculate Lamda
        # (1/(N-n)th root of Gould's equation A'):
        ldiv = Rnew**n
        if ldiv == 0:
            ldiv = 1.0e-6
        Lamda = ((Q**N)/(ldiv))**(1.0/(N-n))
        #
        # Calculate x-squared (Gould's equation C):
        x2 = 1.0 + (N-m-n)/n * (1.0-Lamda**2.0)
        #
        # If x2 goes negative, return 0:
        if x2 < 0:
            x2 = 0.0
            Rold = Rnew
        else:
            # Use x-squared to update R (Gould's equation D):
            Rold = Rnew
            Rnew = numpy.exp((x2-1)/2.0) * scipy.special.erfc(
              numpy.sqrt(x2)/numpy.sqrt(2.0)
              )
        #
  else:
      x2 = 0.0
  return x2
</syntaxhighlight>
 
===R implementation===
Thomsen's code has been successfully written into the following function call, "findx" by C. Dardis and S. Muller in 2012 which returns the maximum error deviation, <math>x</math>.  To complement the Python code presented in the previous section, the R equivalent of "peirce_dev" is also presented here which returns the squared maximum error deviation, <math>x^2</math>.  These two functions return equivalent values by either squaring the returned value from the "findx" function or by taking the square-root of the value returned by the "peirce_dev" function.  Differences occur with error handling.  For example, the "findx" function returns NaNs for invalid data while "peirce_dev" returns 0 (which allows for computations to continue without additional NA value handling).  Also, the "findx" function does not support any error handling when the number of potential outliers increases towards the number of observations (throws missing value error and NaN warning).
 
Just as with the Python version, the squared-error (i.e., <math>x^2</math>) returned by the "peirce_dev" function must be multiplied by the mean-squared error of the model fit to get the squared-delta value (i.e., Δ2).  Use Δ2 to compare the squared-error values of the model fit.  Any observation pairs with a squared-error greater than Δ2 are considered outliers and can be removed from the model.  An iterator should be written to test increasing values of n until the number of outliers identified (comparing Δ2 to model-fit squared-errors) is less than those assumed (i.e., Peirce's n).
 
====r code====
<syntaxhighlight lang="rsplus">
findx <- function(N,k,m){
  # method by K. Thomsen (2008)
  # written by C. Dardis and S. Muller (2012)
  # Available online: https://r-forge.r-project.org/R/?group_id=1473
  #
  # Variable definitions:
  # N :: number of observations
  # k :: number of potential outliers to be removed
  # m :: number of unknown quantities
  #
  # Requires the complementary error function, erfc:
  erfc <- function(x) 2 * pnorm(x * sqrt(2), lower.tail = FALSE)
  #
  x <- 1
  if ((N - m - k) <= 0) {
    return(NaN)
    print(NaN)
  }  else {
    x    <- min(x, sqrt((N - m)/k) - 1e-10)
    #
    # Log of Gould's equation B:
    LnQN <- k * log(k) + (N - k) * log(N - k) - N * log(N)
    #
    # Gould's equation D:
    R1  <- exp((x^2 - 1)/2) * erfc(x/sqrt(2))
    #
    # Gould's equation A' solved for R w/ Lambda substitution:
    R2  <- exp( (LnQN - 0.5 * (N - k) * log((N-m-k*x^2)/(N-m-k)) )/k )
    #
    # Equate the two R equations:
    R1d  <- x * R1 - sqrt(2/pi/exp(1))
    R2d  <- x * (N - k)/(N - m - k * x^2) * R2
    #
    # Update x:
    oldx <- x
    x    <- oldx - (R1 - R2)/(R1d - R2d)
    #
    # Loop until convergence:
    while (abs(x - oldx) >= N * 2e-16){
      R1  <- exp((x^2 - 1)/2) * erfc(x/sqrt(2))
      R2  <- exp( (LnQN - 0.5 * (N - k) * log((N-m-k*x^2)/(N-m-k)) )/k )
      R1d  <- x * R1 - sqrt(2/pi/exp(1))
      R2d  <- x * (N - k)/(N - m - k * x^2) * R2
      oldx <- x
      x    <- oldx - (R1 - R2)/(R1d - R2d)
    }
  }
  return(x)
}
</syntaxhighlight>
 
<syntaxhighlight lang="rsplus">
peirce_dev <- function(N, n, m){
    # N :: total number of observations
    # n :: number of outliers to be removed
    # m :: number of model unknowns (e.g., regression parameters)
    #
    # Check number of observations:
    if (N > 1) {
      # Calculate Q (Nth root of Gould's equation B):
      Q = (n^(n/N) * (N-n)^((N-n)/N))/N
      #
      # Initialize R values:
      Rnew = 1.0
      Rold = 0.0  # <- Necessary to prompt while loop
      #
      while(abs(Rnew-Rold) > (N*2.0e-16)){
          # Calculate Lamda (1/(N-n)th root of Gould's equation A'):
          ldiv = Rnew^n
          if (ldiv == 0){
              ldiv = 1.0e-6
          }
          Lamda = ((Q^N)/(ldiv))^(1.0/(N-n))
          #
          # Calculate x-squared (Gould's equation C):
          x2 = 1.0 + (N-m-n)/n * (1.0-Lamda^2.0)
          #
          # If x2 goes negative, set equal to zero:
          if (x2 < 0){
              x2 = 0
              Rold = Rnew
          } else {
              #
              # Use x-squared to update R (Gould's equation D):
              # NOTE: error function (erfc) is replaced with pnorm (Rbasic):
              # source:
              # http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Normal.html
              Rold = Rnew
              Rnew = exp((x2-1)/2.0)*(2*pnorm(sqrt(x2)/sqrt(2)*sqrt(2), lower=FALSE))
          }
      }
    } else {
      x2 = 0
    }
    x2
}
</syntaxhighlight>
 
==Notes==
<references/>
 
==References==
* [[Benjamin Peirce|Peirce, Benjamin]], [http://articles.adsabs.harvard.edu/cgi-bin/nph-iarticle_query?1852AJ......2..161P;data_type=PDF_HIGH "Criterion for the Rejection of Doubtful Observations"], ''Astronomical Journal'' II 45 (1852) and [http://articles.adsabs.harvard.edu/cgi-bin/nph-iarticle_query?1852AJ......2..176P;data_type=PDF_HIGH Errata to the original paper].
 
* {{cite journal
|title=On Peirce's criterion
|authorlink=Benjamin Peirce
|first=Benjamin
|last=Peirce
|journal=Proceedings of the [[American Academy of Arts and Sciences]]
|volume=13
|month=May
|year=1877&ndash;1878
|pages=348–351
|jstor=25138498
|doi=10.2307/25138498
}}
 
* {{cite journal
|first=Charles Sanders
|last=Peirce
|authorlink=Charles Sanders Peirce
|title=Appendix No. 21. On the Theory of Errors of Observation
|journal=Report of the Superintendent of the United States [[Coast Survey]] Showing the Progress of the Survey During the Year 1870
|year=1870 [published 1873]
|pages=200–224
}}. NOAA [http://docs.lib.noaa.gov/rescue/cgs/001_pdf/CSC-0019.PDF#page=215 PDF Eprint] (goes to Report p.&nbsp;200, PDF's p.&nbsp;215). U.S. Coast and Geodetic Survey Annual Reports [http://docs.lib.noaa.gov/rescue/cgs/data_rescue_cgs_annual_reports.html links for years 1837&ndash;1965].
 
* {{cite book
|first=Charles Sanders
|last=Peirce
|authorlink=Charles Sanders Peirce
|contribution=On the Theory of Errors of Observation
|title=Writings of Charles S. Peirce: A Chronological Edition
|volume=Volume 3, 1872&ndash;1878
|editor=Kloesel, Christian J. W., ''et alia''
|publisher=Indiana University Press
|location=Bloomington, Indiana <!-- |copyright=1986 -->
|year=1982 [1986 copyright]<!-- copyright=1986, but publication is listed as 1982 -->
|pages=140–160
|isbn=0-253-37201-1
}}
 
* Ross, Stephen, "Peirce's Criterion for the Elimination of Suspect Experimental Data", ''J. Engr. Technology'', vol. 20 no.2, Fall, 2003. [http://newton.newhaven.edu/sross/piercescriterion.pdf]
 
* {{cite journal
|authorlink=Stephen Stigler
|last=Stigler
|first=Stephen M.
|title=Mathematical Statistics in the Early States
|journal=Annals of Statistics
|date=March 1978
|volume=6
|pages=239–265
|url=http://projecteuclid.org/euclid.aos/1176344123
|doi=10.1214/aos/1176344123
|jstor = 2958876
|mr=483118
|issue=2
}}
*{{cite book
|authorlink=Stephen Stigler
|last=Stigler
|first=Stephen M.
|chapter=Mathematical Statistics in the Early States
|editor=[[Stephen M. Stigler]]
|title=American Contributions to Mathematical Statistics in the Nineteenth Century, Volumes I & II
|volume=I
|publisher=Arno Press
|location=New York
|year=1980}}
 
*{{cite book
|authorlink=Stephen Stigler
|last=Stigler
|first=Stephen M.
|chapter=Mathematical Statistics in the Early States
|editor=Peter Duren
|title=A Century of Mathematics in America  <!-- Part III -->
|volume=III
|publisher=American Mathematical Society
|location=Providence, RI
|year=1989
|pages=537–564
}}
 
* Hawkins, D.M. (1980). ''Identification of outliers''. [[Chapman and Hall]], London. ISBN 0-412-21900-X
 
* Chauvenet, W. (1876) ''A Manual of Spherical and Practical Astronomy''. J.B.Lippincott, Philadelphia. (reprints of various editions: Dover, 1960; Peter Smith Pub, 2000, ISBN 0-8446-1845-4; Adamant Media Corporation (2 Volumes), 2001, ISBN 1-4021-7283-4, ISBN 1-4212-7259-8; BiblioBazaar, 2009, ISBN 1-103-92942-9 )
 
[[Category:Statistical theory]]
[[Category:Statistical outliers]]

Latest revision as of 11:08, 8 December 2014

34 year old Ship's Engineer Hobbie from Fort McMurray, loves leathercrafting, como ganhar dinheiro na internet and wine making. Has completed a great around the world tour that consisted of going to the Tipasa.

Also visit my web page; ganhando dinheiro na internet