|
|
Line 1: |
Line 1: |
| In [[statistics]], '''M-estimators''' are a broad class of [[estimator]]s, which are obtained as the minima of sums of functions of the data. Least-squares estimators are M-estimators. The definition of M-estimators was motivated by [[robust statistics]], which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called '''M-estimation'''.
| | They call me Emilia. He utilized to be unemployed but now he is a meter reader. Years ago we moved to North Dakota. To do aerobics is a factor that I'm totally addicted to.<br><br>Here is my weblog: [http://nationlinked.com/index.php?do=/profile-35688/info/ http://nationlinked.com/index.php?do=/profile-35688/info] |
| | |
| More generally, an M-estimator may be defined to be a zero of an [[estimating equations|estimating function]].<ref>V. P. Godambe, editor. ''Estimating functions'', volume 7 of Oxford Statistical Science Series. The Clarendon Press Oxford University Press, New York, 1991.</ref><ref>Christopher C. Heyde. ''Quasi-likelihood and its application: A general approach to optimal parameter estimation''. Springer Series in Statistics. Springer-Verlag, New York, 1997.</ref><ref>D. L. McLeish and Christopher G. Small. ''The theory and applications of statistical inference functions'', volume 44 of Lecture Notes in Statistics. Springer-Verlag, New York, 1988.</ref><ref>Parimal Mukhopadhyay. ''An Introduction to Estimating Functions''. Alpha Science International, Ltd, 2004.</ref><ref>Christopher G. Small and Jinfang Wang. ''Numerical methods for nonlinear estimating equations'', volume 29 of Oxford Statistical Science Series. The Clarendon Press Oxford University Press, New York, 2003.</ref><ref>Sara A. van de Geer. ''Empirical Processes in M-estimation: Applications of empirical process theory,'' volume 6 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2000.</ref> This estimating function is often the derivative of another statistical function: For example, a maximum-likelihood estimate is often defined to be a zero of the derivative of the likelihood function with respect to the parameter: thus, a maximum-likelihood estimator is often a [[Critical_point_(mathematics)|critical point]] of the [[Score (statistics)|score]] function.<ref>{{cite journal | last = Ferguson | first = Thomas S. | title = An inconsistent maximum likelihood estimate | journal = Journal of the American Statistical Association | volume = 77 | issue = 380 | year = 1982 | pages = 831–834 | jstor=2287314 }}</ref> In many applications, such M-estimators can be thought of as estimating characteristics of the population.
| |
| | |
| ==Historical motivation==
| |
| The method of [[least squares]] is a prototypical M-estimator, since the estimator is defined as a minimum of the sum of squares of the residuals.
| |
| | |
| Another popular M-estimator is maximum-likelihood estimation. For a family of [[probability density function]]s ''f'' parameterized by ''θ'', a [[maximum likelihood]] estimator of ''θ'' is computed for each set of data by maximizing the [[likelihood function]] over the parameter space { ''θ'' } . When the observations are independent and identically distributed, a ML-estimate <math>\hat{\theta}</math> satisfies
| |
| | |
| :<math>\widehat{\theta} = \arg\max_{\displaystyle\theta}{ \left( \prod_{i=1}^n f(x_i, \theta) \right) }\,\!</math>
| |
| | |
| or, equivalently,
| |
| | |
| :<math>\widehat{\theta} = \arg\min_{\displaystyle\theta}{ \left( -\sum_{i=1}^n \log{( f(x_i, \theta) ) }\right) }.\,\!</math>
| |
| | |
| Maximum-likelihood estimators are often inefficient and biased for finite samples. For many regular problems, maximum-likelihood estimation performs well for "large samples", being an approximation of a [[MAP estimator|posterior mode]]. If the problem is "regular", then any bias of the MLE (or posterior mode) decreases to zero when the sample-size increases to infinity. The performance of maximum-likelihood (and posterior-mode) estimators drops when the parametric family is mis-specified.
| |
| | |
| ==Definition==
| |
| In 1964, [[Peter Huber]] proposed generalizing maximum likelihood estimation to the minimization of
| |
| | |
| :<math>\sum_{i=1}^n\rho(x_i, \theta),\,\!</math>
| |
| | |
| where ρ is a function with certain properties (see below). The solutions
| |
| | |
| :<math>\hat{\theta} = \arg\min_{\displaystyle\theta}\left(\sum_{i=1}^n\rho(x_i, \theta)\right) \,\!</math>
| |
| | |
| are called '''M-estimators''' ("M" for "maximum likelihood-type" (Huber, 1981, page 43)); other types of robust estimator include [[L-estimator]]s, [[R-estimator]]s and [[S-estimator]]s. Maximum likelihood estimators (MLE) are thus a special case of M-estimators. With suitable rescaling, M-estimators are special cases of [[extremum estimator]]s (in which more general functions of the observations can be used).
| |
| | |
| The function ρ, or its derivative, ψ, can be chosen in such a way to provide the estimator desirable properties (in terms of bias and efficiency) when the data are truly from the assumed distribution, and 'not bad' behaviour when the data are generated from a model that is, in some sense, ''close'' to the assumed distribution.
| |
| | |
| ==Types of M-estimators==
| |
| | |
| M-estimators are solutions, ''θ'', which minimize
| |
| | |
| :<math>\sum_{i=1}^n\rho(x_i,\theta).\,\!</math>
| |
| | |
| This minimization can always be done directly. Often it is simpler to differentiate with respect to ''θ'' and solve for the root of the derivative. When this differentiation is possible, the M-estimator is said to be of '''ψ-type'''. Otherwise, the M-estimator is said to be of '''ρ-type'''.
| |
| | |
| In most practical cases, the M-estimators are of ψ-type.
| |
| | |
| ===ρ-type===
| |
| | |
| For positive integer ''r'', let <math>(\mathcal{X},\Sigma)</math> and <math>(\Theta\subset\mathbb{R}^r,S)</math> be measure spaces. <math>\theta\in\Theta</math> is a vector of parameters. An M-estimator of ρ-type ''T'' is defined through a [[measurable function]] <math>\rho:\mathcal{X}\times\Theta\rightarrow\mathbb{R}</math>. It maps a probability distribution ''F'' on <math>\mathcal{X}</math> to the value <math>T(F)\in\Theta</math> (if it exists) that minimizes
| |
| <math>\int_{\mathcal{X}}\rho(x,\theta)dF(x)</math>:
| |
| | |
| : <math>T(F):=\arg\min_{\theta\in\Theta}\int_{\mathcal{X}}\rho(x,\theta)dF(x)</math>
| |
| | |
| For example, for the [[maximum likelihood]] estimator, <math>\rho(x,\theta)=-\log(f(x,\theta))</math>, where <math>f(x,\theta)=\frac{\partial F(x,\theta)}{\partial x}</math>.
| |
| | |
| ===ψ-type===
| |
| If <math>\rho</math> is differentiable, the computation of <math>\widehat{\theta}</math> is usually much easier. An M-estimator of ψ-type ''T'' is defined through a measurable function <math>\psi:\mathcal{X}\times\Theta\rightarrow\mathbb{R}^r</math>. It maps a probability distribution ''F'' on <math>\mathcal{X}</math> to the value <math>T(F)\in\Theta</math> (if it exists) that solves the vector equation:
| |
| | |
| : <math>\int_{\mathcal{X}}\psi(x,\theta) \, dF(x)=0</math>
| |
| | |
| : <math>\int_{\mathcal{X}}\psi(x,T(F)) \, dF(x)=0</math>
| |
| | |
| For example, for the [[maximum likelihood]] estimator, <math>\psi(x,\theta)=\left(\frac{\partial\log(f(x,\theta))}{\partial \theta^1},\dots,\frac{\partial\log(f(x,\theta))}{\partial \theta^p}\right)^\mathrm{T}</math>, where <math>u^\mathrm{T}</math> denotes the transpose of vector ''u'' and <math>f(x,\theta)=\frac{\partial F(x,\theta)}{\partial x}</math>.
| |
| | |
| Such an estimator is not necessarily an M-estimator of ρ-type, but if ρ has a continuous first derivative with respect to <math>\theta</math>, then a necessary corresponding M-estimator of ψ-type to be an M-estimator of ρ-type is <math>\psi(x,\theta)=\nabla_\theta\rho(x,\theta)</math>. The previous definitions can easily be extended to finite samples.
| |
| | |
| If the function ψ decreases to zero as <math>x \rightarrow \pm \infty</math>, the estimator is called [[redescending M-estimator|redescending]]. Such estimators have some additional desirable properties, such as complete rejection of gross outliers.
| |
| | |
| ==Computation==
| |
| For many choices of ρ or ψ, no closed form solution exists and an iterative approach to computation is required. It is possible to use standard function optimization algorithms, such as Newton-Raphson. However, in most cases an [[iteratively re-weighted least squares]] fitting algorithm can be performed; this is typically the preferred method.
| |
| | |
| For some choices of ψ, specifically, ''[[Redescending M-estimator|redescending]]'' functions, the solution may not be unique. The issue is particularly relevant in multivariate and regression problems. Thus, some care is needed to ensure that good starting points are chosen. [[Robust statistics|Robust]] starting points, such as the [[median]] as an estimate of location and the [[median absolute deviation]] as a univariate estimate of scale, are common.
| |
| | |
| ==Properties==
| |
| ===Distribution===
| |
| It can be shown that M-estimators are asymptotically normally distributed. As such, [[Wald-type test|Wald-type approaches]] to constructing confidence intervals and hypothesis tests can be used. However, since the theory is asymptotic, it will frequently be sensible to check the distribution, perhaps by examining the permutation or [[bootstrap (statistics)|bootstrap]] distribution.
| |
| | |
| ===Influence function===
| |
| The influence function of an M-estimator of <math>\psi</math>-type is proportional to its defining <math>\psi</math> function.
| |
| | |
| Let ''T'' be an M-estimator of ψ-type, and ''G'' be a probability distribution for which <math>T(G)</math> is defined.<!-- and let <math>x\in\mathcal{X}</math>.--> Its influence function IF is
| |
| <!--
| |
| what's the variable of integration in the denominator?
| |
| Asssuming it is "y"...
| |
| -->
| |
| :<math>\operatorname{IF}(x;T,G) = -\frac{\psi(x,T(G))}
| |
| {\int\left[\frac{\partial\psi(y,\theta)}
| |
| {\partial\theta}
| |
| \right] \mathrm{d}y
| |
| }
| |
| </math>
| |
| | |
| A proof of this property of M-estimators can be found in Huber (1981, Section 3.2).
| |
| <!--
| |
| Proof:
| |
| | |
| By definition, <math>\forall G\in\mbox{dom}(T),\int\psi(x,T(G))dG(x)=0</math>. Let
| |
| <math>c(0)=G</math> and <math>c'(0)=\Delta_x-G</math>, for example <math>c(t)=G+t(\Delta_x-G)</math>. Then
| |
| | |
| :<math>\forall t\in\mathcal{X},\int\psi(y,T(c(t)))d(c(t)(y))=0</math>
| |
| | |
| Differentiating yields
| |
| | |
| :<math>\forall
| |
| t\in\mathcal{X},\frac{\partial}{\partial t}\int\psi(y,T(c(t)))d(c(t)(y))=0</math>
| |
| | |
| We know that <math>dc(t)=td(\Delta_x-G)+dF</math>. Therefore,
| |
| | |
| :<math>\forall
| |
| t\in\mathcal{X},\frac{\partial}{\partial t}\int\psi(y,T(c(t)))td(\Delta_x-G)(y)+\frac{\partial}{\partial t}\int\psi(x,T(c(t)))dG(y)=0</math>
| |
| | |
| Supposing differentiation and integration can be interchanged,
| |
| | |
| :<math>\forall
| |
| t\in\mathcal{X},t\int\frac{\partial\psi(y,T(c(t)))}{\partial t}d(\Delta_x-G)(y)+\int\psi(y,T(c(t)))d(\Delta_x-G)(y)+\int\frac{\partial\psi(x,T(c(t)))}{\partial t}dG(x)=0</math>
| |
| | |
| As
| |
| <math>\int\psi(y,T(c(t)))d(\Delta_x-G)(y)</math>
| |
| <math>=\int\psi(y,T(c(t)))d(\Delta_x)(y)-\int\psi(y,T(c(t)))dG(y)=\psi(x,T(c(t)))-0</math>
| |
| | |
| we can write:
| |
| :<math>\forall
| |
| t\in\mathcal{X}, \psi(x,c(t))+t\int\frac{\partial\psi(y,T(c(t)))}{\partial t}d(\Delta_x-G)(y)+\int\frac{\partial\psi(x,T(c(t)))}{\partial t}dG(x)=0</math>
| |
| | |
| Now,
| |
| <math>\frac{\partial\psi(y,T(c(t)))}{\partial t}=\left[\frac{\partial\psi(x,\theta)}{\partial\theta}\right]_{T(c(t))}</math>
| |
| | |
| Therefore,
| |
| <math>\forall
| |
| t\in\mathcal{X},\psi(x,c(t))+t\int\frac{\partial\psi(y,T(c(t)))}{\partial t}d(\Delta_x-G)(y)+</math>
| |
| <math>\int\left[\frac{\partial\psi(x,\theta)}{\partial\theta}\right]_{T(c(t))}dG(x)\frac{\partial T(c(t))}{\partial t}=0</math>
| |
| | |
| As this equation is valid for all ''t'' in <math>\mathcal{X}</math>, we can take <math>t=0</math>:
| |
| | |
| <math>\psi(x,T(G))+\int\left[\frac{\partial\psi(x,\theta)}{\partial\theta}\right]_{T(G)}dG(x)\left[\frac{\partial T(c(t))}{\partial t}\right]_{t=0}=0</math>
| |
| | |
| By definition,
| |
| <math>\left[\frac{\partial T(c(t))}{\partial t}\right]_{t=0}=d_GT(\Delta_x-G)=IF(x;T,G)</math>, hence
| |
| | |
| <math>IF(x;T,G)=-\frac{\psi(x,T(G))}{\int\left[\frac{\partial\psi(y,\theta)}{\partial\theta}\right]}</math>
| |
| | |
| which completes the proof.-->
| |
| | |
| ==Applications==
| |
| M-estimators can be constructed for location parameters and scale parameters in univariate and multivariate settings, as well as being used in robust regression.
| |
| | |
| ==Examples==
| |
| ===Mean===
| |
| Let (''X''<sub>1</sub>, ..., ''X''<sub>''n''</sub>) be a set of [[iid|independent, identically distributed]] random variables, with distribution ''F''.
| |
| | |
| If we define
| |
| | |
| :<math>\rho(x, \theta)=\frac{(x - \theta)^2}{2},\,\!</math>
| |
| | |
| we note that this is minimized when ''θ'' is the [[arithmetic mean|mean]] of the ''X''s. Thus the mean is an M-estimator of ρ-type, with this ρ function.
| |
| | |
| As this ρ function is continuously differentiable in ''θ'', the mean is thus also an M-estimator of ψ-type for ψ(''x'', ''θ'') = ''θ'' − ''x''.
| |
| | |
| ==See also==
| |
| *[[Robust statistics]]
| |
| *[[Robust regression]]
| |
| *[[Redescending M-estimator]]
| |
| | |
| ==References==
| |
| {{Reflist}}
| |
| | |
| ==Further reading==
| |
| * {{cite book | last = Andersen | first = Robert | title = Modern Methods for Robust Regression | publisher = Sage Publications | location = Los Angeles, CA | series = Quantitative Applications in the Social Sciences | volume = 152| year = 2008 | isbn = 978-1-4129-4072-6}}
| |
| * {{cite book | last = Godambe | first = V. P. | title = Estimating functions | publisher = Clarendon Press | location=New York | series= Oxford Statistical Science Series | volume = 7 | year = 1991 | isbn = 978-0-19-852228-7 }}
| |
| * {{cite book | last = Heyde | first = Christopher C. | title = Quasi-likelihood and its application: A general approach to optimal parameter estimation | publisher = Springer | location=New York | year = 1997 | series= Springer Series in Statistics | isbn = 978-0-387-98225-0 | doi = 10.1007/b98823 }}
| |
| * {{cite book | last = Huber | first = Peter J. | title = Robust Statistics | edition= 2nd | publisher = John Wiley & Sons Inc. | location = Hoboken, NJ | year = 2009 | isbn = 978-0-470-12990-6 }}
| |
| * {{cite book | last = Hoaglin | first = David C. | coauthors = Frederick Mosteller and John W. Tukey | title = Understanding Robust and Exploratory Data Analysis | publisher = John Wiley & Sons Inc. | location = Hoboken, NJ | year = 1983 | isbn = 0-471-09777-2 }}
| |
| * {{cite book | last = McLeish | first = D.L. | coauthors = Christopher G. Small | title = The theory and applications of statistical inference functions | publisher = Springer | location=New York | year = 1989 | series= Lecture Notes in Statistics | volume = 44 | isbn = 978-0-387-96720-2 }}
| |
| * {{cite book | last = Mukhopadhyay | first = Parimal | title = An Introduction to Estimating Functions | publisher = Alpha Science International, Ltd | location = Harrow, UK | year = 2004 | isbn = 978-1-84265-163-6 }}
| |
| *{{Citation | last1=Press | first1=WH | last2=Teukolsky | first2=SA | last3=Vetterling | first3=WT | last4=Flannery | first4=BP | year=2007 | title=Numerical Recipes: The Art of Scientific Computing | edition=3rd | publisher=Cambridge University Press | publication-place=New York | isbn=978-0-521-88068-8 | chapter=Section 15.7. Robust Estimation | chapter-url=http://apps.nrbook.com/empanel/index.html#pg=818}}
| |
| * {{cite book | last = Serfling | first = Robert J. | title = Approximation theorems of mathematical statistics | publisher = John Wiley & Sons Inc. | location = Hoboken, NJ | year = 2002 | series = Wiley Series in Probability and Mathematical Statistics | isbn = 978-0-471-21927-9 }}
| |
| * {{cite journal | last=Shapiro | first=Alexander | title=On the asymptotics of constrained local ''M''-estimators | journal= Annals of Statistics | volume=28 | issue=3 | year=2000 |pages=948–960 | doi=10.1214/aos/1015952006 | mr=1792795 | jstor = 2674061}}
| |
| * {{cite book | last = Small | first = Christopher G. | coauthors = Jinfang Wang | title = Numerical methods for nonlinear estimating equations | publisher = Oxford University Press | location = New York | year = 2003 | series= Oxford Statistical Science Series | volume = 29 | isbn = 978-0-19-850688-1 }}
| |
| * {{cite book | last = van de Geer | first = Sara A. | title = Empirical Processes in M-estimation: Applications of empirical process theory | publisher = Cambridge University Press | location = Cambridge, UK | year = 2000 | series = Cambridge Series in Statistical and Probabilistic Mathematics | volume = 6 | isbn = 978-0-521-65002-1 | doi = 10.2277/052165002X }}
| |
| * {{cite book | last = Wilcox | first = R. R. | title = Applying contemporary statistical techniques <!-- I think "Summarizing data" is the chapter name; no need since pages give --> | publisher = San Diego, CA: Academic Press | year = 2003 | pages = 55–79 }}
| |
| * {{cite book | last = Wilcox | first = R. R. | title = Introduction to Robust Estimation and Hypothesis Testing, 3rd Ed | publisher = San Diego, CA: Academic Press | year = 2012 }}
| |
| | |
| ==External links==
| |
| * [http://research.microsoft.com/en-us/um/people/zhang/INRIA/Publis/Tutorial-Estim/node24.html#SECTION000104000000000000000 M-estimators] — an introduction to the subject by Zhengyou Zhang
| |
| * [http://www.dinamistics.com/post/2012/M-estimators/ M-estimators] - an interactive demonstration of Huber's M-estimator
| |
| | |
| {{DEFAULTSORT:M-Estimator}}
| |
| [[Category:M-estimators]]
| |
| [[Category:Estimation theory]]
| |
| [[Category:Robust regression]]
| |
| [[Category:Robust statistics]]
| |