Richard Schroeppel: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
en>Addbot
m Bot: Migrating 2 interwiki links, now provided by Wikidata on d:q2150561
en>Rosulek
 
Line 1: Line 1:
{{About|the statistical properties of unweighted linear regression analysis|more general regression analysis|regression analysis|linear regression on a single variable|simple linear regression|the computation of least squares curve fits|numerical methods for linear least squares}}
My name is Sean (40 years old) and my hobbies are Association football and Knapping.<br><br>Feel free to surf to my web-site ... FIFA 15 coin hack ([http://mrlawncareandconstruction.com/?attachment_id=56 how you can help])
{{Regression bar}}
 
[[File:Okuns law quarterly differences.svg|300px|thumb|[[Okun's law]] in [[macroeconomics]] states that in an economy the GDP growth should depend linearly on the changes in the unemployment rate. Here the '''ordinary least squares''' method is used to construct the regression line describing this law.]]
In [[statistics]], '''ordinary least squares (OLS)''' or '''linear least squares''' is a method for estimating the unknown parameters in a [[linear regression model]]. This method minimizes the sum of squared vertical distances between the observed responses in the [[dataset]] and the responses predicted by the linear approximation. The resulting [[statistical estimation|estimator]] can be expressed by a simple formula, especially in the case of a [[simple linear regression|single regressor]] on the right-hand side.
 
The OLS estimator is [[consistent estimator|consistent]] when the regressors are [[exogenous]] and there is no perfect [[multicollinearity]], and optimal in the class of linear unbiased estimators when the [[statistical error|error]]s are [[homoscedastic]] and [[autocorrelation|serially uncorrelated]]. Under these conditions, the method of OLS provides [[UMVU|minimum-variance mean-unbiased]] estimation when the errors have finite variances. Under the additional assumption that the errors be [[normal distribution|normally distributed]], OLS is the [[maximum likelihood estimator]]. OLS is used in economics ([[econometrics]]), political science and electrical engineering ([[control theory]] and [[signal processing]]), among many areas of application.
 
== Linear model ==
{{main|Linear regression model}}
 
Suppose the data consists of ''n'' [[statistical unit|observations]] {&thinsp;''y{{su|b=i}},&thinsp;x{{su|b=i}}''&thinsp;}{{su|p=''n''|b=''i''=1}}. Each observation includes a scalar response ''y<sub>i</sub>'' and a vector of ''p'' predictors (or regressors) ''x<sub>i</sub>''. In a [[linear regression model]] the response variable is a linear function of the regressors:
: <math>
    y_i = x'_i\beta + \varepsilon_i, \,
  </math>
where ''β'' is a ''p×''1 vector of unknown parameters; ''ε<sub>i</sub>'''s are unobserved scalar random variables ([[errors and residuals in statistics|errors]]) which account for the discrepancy between the actually observed responses ''y<sub>i</sub>'' and the "predicted outcomes" ''x′<sub style="position:relative;left:-.2em">i</sub>β''; and ′ denotes [[matrix transpose]], so that {{nowrap|''x′&thinsp;β''}} is the [[dot product]] between the vectors ''x'' and ''β''. This model can also be written in matrix notation as
: <math>
    y = X\beta + \varepsilon, \,
  </math>
where ''y'' and ''ε'' are ''n×''1 vectors, and ''X'' is an ''n×p'' matrix of regressors, which is also sometimes called the [[design matrix]].
 
As a rule, the constant term is always included in the set of regressors ''X'', say, by taking ''x''<sub>''i''1</sub>&nbsp;=&nbsp;1 for all {{nowrap|1=''i'' = 1, …, ''n''}}. The coefficient ''β''<sub>1</sub> corresponding to this regressor is called the ''intercept''.
 
There may be some relationship between the regressors. For instance, the third regressor may be the square of the second regressor. In this case (assuming that the first regressor is constant) we have a quadratic model in the second regressor. But this is still considered a linear model because it is linear in the ''β''s.
 
=== Assumptions ===
There are several different frameworks in which the [[linear regression model]] can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed.
 
One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case ('''random design''') the regressors ''x<sub>i</sub>'' are random and sampled together with the ''y<sub>i</sub>''&#39;s from some [[statistical population|population]], as in an [[observational study]]. This approach allows for more natural study of the [[asymptotic theory (statistics)|asymptotic properties]] of the estimators. In the other interpretation ('''fixed design'''), the regressors ''X'' are treated as known constants set by a [[design of experiments|design]], and ''y'' is sampled conditionally on the values of ''X'' as in an [[experiment]]. For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on ''X''. All results stated in this article are within the random design framework.
 
'''The primary assumption of OLS is that there is zero or negligible errors in the independent variable, since this method only attempts to minimise the mean squared error in the dependent variable.'''
 
==== Classical linear regression model ====
The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observations ''n'' is fixed. This contrasts with the other approaches, which study the [[asymptotic theory (statistics)|asymptotic behavior]] of OLS, and in which the number of observations is allowed to grow to infinity.
 
<ul>
<li> '''Correct specification'''. The linear functional form is correctly specified.
 
<li> '''Strict exogeneity'''. The errors in the regression should have [[conditional expectation|conditional mean]] zero:<ref>{{harvtxt|Hayashi|2000|loc=page 7}}</ref>
: <math>
    \operatorname{E}[\,\varepsilon|X\,] = 0.
  </math>
The immediate consequence of the exogeneity assumption is that the errors have mean zero: {{nowrap|1=E[''ε''] = 0}}, and that the regressors are uncorrelated with the errors: {{nowrap|1=E[''X′ε''] = 0}}.<br>
The exogeneity assumption is critical for the OLS theory. If it holds then the regressor variables are called ''exogenous''. If it doesn't, then those regressors that are correlated with the error term are called ''endogenous'',<ref>{{harvtxt|Hayashi|2000|loc=page 187}}</ref> and then the OLS estimates become invalid. In such case the [[instrumental variable|method of instrumental variables]] may be used to carry out inference.
 
<li> '''No linear dependence'''. The regressors in ''X'' must all be [[linearly independent]]. Mathematically it means that the matrix ''X'' must have full [[column rank]] almost surely:<ref name="Hayashi 2000 loc=page 10">{{harvtxt|Hayashi|2000|loc=page 10}}</ref>
: <math>
    \Pr\!\big[\,\operatorname{rank}(X) = p\,\big] = 1
  </math>
Usually, it is also assumed that the regressors have finite moments up to at least second. In such case the matrix {{nowrap|1=''Q<sub>xx</sub>'' = E[''X′X&thinsp;/&thinsp;n'']}} will be finite and positive semi-definite.<br>
When this assumption is violated the regressors are called linearly dependent or [[multicollinearity|perfectly multicollinear]]. In such case the value of the regression coefficient ''β'' cannot be learned, although prediction of ''y'' values is still possible for new values of the regressors that lie in the same linearly dependent subspace.
 
<li> '''Spherical errors''':<ref name="Hayashi 2000 loc=page 10"/>
: <math>
    \operatorname{Var}[\,\varepsilon|X\,] = \sigma^2 I_n,
  </math>
where ''I<sub>n</sub>'' is an ''n×n'' [[identity matrix]], and ''σ''<sup>2</sup> is a parameter which determines the variance of each observation. This ''σ''<sup>2</sup> is considered a [[nuisance parameter]] in the model, although usually it is also estimated. If this assumption is violated then the OLS estimates are still valid, but no longer efficient.<br>
It is customary to split this assumption into two parts:
* '''[[Homoscedasticity]]''': {{nowrap|1=E[&thinsp;''ε<sub>i</sub>''<sup>2</sup>&thinsp;{{!}}&thinsp;''X''&thinsp;] = ''σ''<sup>2</sup>}}, which means that the error term has the same variance ''σ''<sup>2</sup> in each observation. When this requirement is violated this is called [[heteroscedasticity]], in such case a more efficient estimator would be [[weighted least squares]]. If the errors have infinite variance then the OLS estimates will also have infinite variance (although by the [[law of large numbers]] they will nonetheless tend toward the true values so long as the errors have zero mean). In this case, [[robust regression|robust estimation]] techniques are recommended.
* '''Nonautocorrelation''': the errors are [[correlation|uncorrelated]] between observations: {{nowrap|1=E[&thinsp;''ε<sub>i</sub>ε<sub>j</sub>''&thinsp;{{!}}&thinsp;''X''&thinsp;] = 0}} for {{nowrap|''i'' ≠ ''j''}}. This assumption may be violated in the context of [[time series]] data, [[panel data]], cluster samples, hierarchical data, repeated measures data, longitudinal data, and other data with dependencies. In such cases [[generalized least squares]] provides a better alternative than the OLS.
 
<li> '''Normality'''. It is sometimes additionally assumed that the errors have [[multivariate normal distribution|normal distribution]] conditional on the regressors:<ref>{{harvtxt|Hayashi|2000|loc=page 34}}</ref>
: <math>
    \varepsilon\,|\,X\ \sim\ \mathcal{N}(0,\, \sigma^2I_n).
  </math>
This assumption is not needed for the validity of the OLS method, although certain additional finite-sample properties can be established in case when it does (especially in the area of hypotheses testing). Also when the errors are normal, the OLS estimator is equivalent to the [[maximum likelihood estimator]] (MLE), and therefore it is asymptotically efficient in the class of all [[regular estimator]]s.
</ul>
 
==== Independent and identically distributed ====
In some applications, especially with [[cross-sectional data]], an additional assumption is imposed — that all observations are [[independent and identically distributed]] (iid). This means that all observations are taken from a [[random sample]] which makes all the assumptions listed earlier simpler and easier to interpret. Also this framework allows one to state asymptotic results (as the sample size {{nowrap|''n''&thinsp;→&thinsp;∞}}), which are understood as a theoretical possibility of fetching new independent observations from the [[data generating process]]. The list of assumptions in this case is:
* '''iid observations''': (''x<sub>i</sub>'', ''y<sub>i</sub>'') is [[independent random variables|independent]] from, and has the same [[Probability distribution|distribution]] as, (''x<sub>j</sub>'', ''y<sub>j</sub>'') for all {{nowrap|''i ≠ j''}};
* '''no perfect multicollinearity''': ''Q<sub>xx</sub>'' = E[&thinsp;''x<sub>i</sub>x′<sub style="position:relative;left:-.2em">i</sub>''&thinsp;] is a [[positive-definite matrix]];
* '''exogeneity''': E[&thinsp;''ε<sub>i</sub>''&thinsp;|&thinsp;''x<sub>i</sub>''&thinsp;] = 0;
* '''homoscedasticity''': Var[&thinsp;''ε<sub>i</sub>''&thinsp;|&thinsp;''x<sub>i</sub>''&thinsp;] = ''σ''<sup>2</sup>.
 
==== Time series model ====
* The [[stochastic process]] {''x<sub>i</sub>'', ''y<sub>i</sub>''} is [[stationary process|stationary]] and [[ergodic process|ergodic]];
* The regressors are ''predetermined'': E[''x<sub>i</sub>ε<sub>i</sub>''] = 0 for all ''i'' = 1, …, ''n'';
* The ''p×p'' matrix ''Q<sub>xx</sub>'' = E[&thinsp;''x<sub>i</sub>x′<sub style="position:relative;left:-.2em">i</sub>''&thinsp;] is of full rank, and hence [[Positive-definite matrix|positive-definite]];
* {''x<sub>i</sub>ε<sub>i</sub>''} is a [[martingale difference sequence]], with a finite matrix of second moments ''Q<sub>xxε²</sub>'' = E[&thinsp;''ε<sub>i</sub><sup>2</sup>x<sub>i</sub>x′<sub style="position:relative;left:-.2em">i</sub>''&thinsp;].
 
== Estimation ==
Suppose ''b'' is a "candidate" value for the parameter ''β''. The quantity {{nowrap|''y<sub>i</sub>'' − ''x<sub>i</sub>''′''b''}} is called the '''[[errors and residuals in statistics|residual]]''' for the ''i''-th observation, it measures the vertical distance between the data point {{nowrap|(''x<sub>i</sub>'', ''y<sub>i</sub>'')}} and the hyperplane {{nowrap|1=''y = x′b''}}, and thus assesses the degree of fit between the actual data and the model. The '''sum of squared residuals (SSR)''' (also called the '''error sum of squares (ESS)''' or '''residual sum of squares (RSS)''')<ref>{{harvtxt|Hayashi|2000|loc=page 15}}</ref> is a measure of the overall model fit:
: <math>
    S(b) = \sum_{i=1}^n (y_i - x'_ib)^2 = (y-Xb)^T(y-Xb),
  </math>
where ''T'' denotes the matrix [[transpose]].  The value of ''b'' which minimizes this sum is called the '''OLS estimator for ''β'''''. The function ''S''(''b'') is quadratic in ''b'' with positive-definite [[Hessian matrix|Hessian]], and therefore this function possesses a unique global minimum at <math>b =\hat\beta</math>, which can be given by the explicit formula:<ref>{{harvtxt|Hayashi|2000|loc=page 18}}</ref><sup>[[Proofs involving ordinary least squares#Least_squares_estimator_for_.CE.B2|[proof]]]</sup>
: <math>
    \hat\beta = {\rm arg}\min_{b\in\mathbb{R}^p} S(b) =  \bigg(\frac{1}{n}\sum_{i=1}^n x_ix'_i\bigg)^{\!-1} \!\!\cdot\, \frac{1}{n}\sum_{i=1}^n x_iy_i
  </math>
 
or equivalently in matrix form,
 
:<math>\hat\beta = (X^TX)^{-1}X^Ty\ . </math>
 
After we have estimated ''β'', the '''fitted values''' (or '''predicted values''') from the regression will be
: <math>
    \hat{y} = X\hat\beta = Py,
  </math>
where ''P'' = ''X''(''X<sup>T</sup>X'')<sup>−1</sup>''X<sup>T</sup>'' is the [[projection matrix]] onto the space spanned by the columns of ''X''. This matrix ''P'' is also sometimes called the [[hat matrix]] because it "puts a hat" onto the variable ''y''. Another matrix, closely related to ''P'' is the ''annihilator'' matrix {{nowrap|1=''M'' = ''I<sub>n</sub>'' − ''P''}}, this is a projection matrix onto the space orthogonal to ''X''. Both matrices ''P'' and ''M'' are [[symmetric matrix|symmetric]] and [[idempotent matrix|idempotent]] (meaning that {{nowrap|1=''P''<sup>2</sup> = ''P''}}), and relate to the data matrix ''X'' via identities {{nowrap|1=''PX = X''}} and {{nowrap|1=''MX'' = 0}}.<ref name="Hayashi 2000 loc=page 19">{{harvtxt|Hayashi|2000|loc=page 19}}</ref> Matrix ''M'' creates the '''residuals''' from the regression:
: <math>
    \hat\varepsilon = y - X\hat\beta = My = M\varepsilon.
  </math>
 
Using these residuals we can estimate the value of ''σ''<sup>2</sup>:
: <math>
    s^2 = \frac{\hat\varepsilon'\hat\varepsilon}{n-p} = \frac{y'My}{n-p} = \frac{S(\hat\beta)}{n-p},\qquad
    \hat\sigma^2 = \frac{n-p}{n}\;s^2
  </math>
The numerator, ''n''−''p'', is the [[Degrees of freedom (statistics)|statistical degrees of freedom]]. The first quantity, ''s''<sup>2</sup>, is the OLS estimate for ''σ''<sup>2</sup>, whereas the second, <math style="vertical-align:0">\scriptstyle\hat\sigma^2</math>, is the MLE estimate for ''σ''<sup>2</sup>. The two estimators are quite similar in large samples; the first one is always [[estimator bias|unbiased]], while the second is biased but minimizes the [[mean squared error]] of the estimator. In practice ''s''<sup>2</sup> is used more often, since it is more convenient for the hypothesis testing. The square root of ''s''<sup>2</sup> is called the '''standard error of the regression (SER)''', or '''standard error of the equation (SEE)'''.<ref name="Hayashi 2000 loc=page 19"/>
 
It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto ''X''. The '''[[coefficient of determination]] ''R''<sup>2</sup>''' is defined as a ratio of "explained" variance to the "total" variance of the dependent variable ''y'':<ref>{{harvtxt|Hayashi|2000|loc=page 20}}</ref>
: <math>
    R^2 = \frac{\sum(\hat y_i-\overline{y})^2}{\sum(y_i-\overline{y})^2} = \frac{y'P'LPy}{y'Ly} = 1 - \frac{y'My}{y'Ly} = 1 - \frac{\rm SSR}{\rm TSS}
  </math>
where TSS is the '''total sum of squares''' for the dependent variable, ''L'' = ''I<sub>n</sub>'' − '''11'''′/''n'', and '''1''' is an ''n''×1 vector of ones. (''L'' is a "centering matrix" which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for ''R''<sup>2</sup> to be meaningful, the matrix ''X'' of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, ''R''<sup>2</sup> will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.
 
=== Simple regression model ===
{{main|Simple linear regression}}
If the data matrix ''X'' contains only two variables: a constant, and a scalar regressor ''x<sub>i</sub>'', then this is called the "simple regression model".<ref>{{harvtxt|Hayashi|2000|loc=page 5}}</ref> This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The vectors of parameters in such model is 2-dimensional, and is commonly denoted as {{nowrap|(''α'', ''β'')}}:
: <math>
    y_i = \alpha + \beta x_i + \varepsilon_i.
  </math>
The least squares estimates in this case are given by simple formulas
: <math>
    \hat\beta = \frac{ \sum{x_iy_i} - \frac{1}{n}\sum{x_i}\sum{y_i} }
                    { \sum{x_i^2} - \frac{1}{n}(\sum{x_i})^2 } = \frac{ \mathrm{Cov}[x,y] }{ \mathrm{Var}[x] } , \quad
    \hat\alpha = \overline{y} - \hat\beta\,\overline{x}\ .
  </math>
 
== Alternative derivations ==
In the previous section the least squares estimator <math style="vertical-align:-.3em">\scriptstyle\hat\beta</math> was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: {{nowrap|1=''<sup style="position:relative;left:.6em;top:-.2em">^</sup>β'' = (''X′X'')<sup>−1</sup>''X′y''}}, the only difference is in how we interpret this result.
 
=== Geometric approach ===
[[File:OLS geometric interpretation.svg|thumb|250px|OLS estimation can be viewed as a projection onto the linear space spanned by the regressors.]]
{{main|Linear least squares (mathematics)}}
For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations {{nowrap|''Xβ'' ≈ ''y''}}, where ''β'' is the unknown. Assuming the system cannot be solved exactly (the number of equations ''n'' is much larger than the number of unknowns ''p''), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies
: <math>
    \hat\beta = {\rm arg}\min_\beta\,\lVert y - X\beta \rVert,
  </math>
where ||·|| is the standard [[Norm (mathematics)#Euclidean norm|''L''<sup>2</sup>&nbsp;norm]] in the ''n''-dimensional [[Euclidean space]] '''R'''<sup>''n''</sup>. The predicted quantity ''Xβ'' is just a certain linear combination of the vectors of regressors. Thus, the residual vector {{nowrap|''y − Xβ''}} will have the smallest length when ''y'' is [[projection (linear algebra)|projected orthogonally]] onto the [[linear subspace]] [[linear span|spanned]] by the columns of ''X''. The OLS estimator <math style="vertical-align:-.3em">\scriptstyle\hat\beta</math> in this case can be interpreted as the coefficients of [[vector decomposition]] of {{nowrap|1=''<sup style="position:relative;left:.5em;">^</sup>y = Py''}} along the basis of ''X''.
 
=== Maximum likelihood ===
The OLS estimator is identical to the [[maximum likelihood estimator]] (MLE) under the normality assumption for the error terms.<ref>{{harvtxt|Hayashi|2000|loc=page 49}}</ref><sup>[[Proofs involving ordinary least squares#Maximum_likelihood_approach|[proof]]]</sup> This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by [[Udny Yule|Yule]] and [[Karl Pearson|Pearson]].{{Citation needed|date=February 2010}} From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the [[Cramér-Rao bound]] for variance) if the normality assumption is satisfied.<ref name="Hayashi 2000 loc=page 52">{{harvtxt|Hayashi|2000|loc=page 52}}</ref>
 
=== Generalized method of moments ===
In [[iid]] case the OLS estimator can also be viewed as a [[Generalized method of moments|GMM]] estimator arising from the moment conditions
: <math>
    \mathrm{E}\big[\, x_i(y_i - x_i'\beta) \,\big] = 0.
  </math>
These moment conditions state that the regressors should be uncorrelated with the errors. Since ''x<sub>i</sub>'' is a ''p''-vector, the number of moment conditions is equal to the dimension of the parameter vector ''β'', and thus the system is exactly identified. This is the so-called classical GMM case, when the estimator does not depend on the choice of the weighting matrix.
 
Note that the original strict exogeneity assumption {{nowrap|E[''ε<sub>i</sub>''&thinsp;{{!}}&thinsp;''x<sub>i</sub>''] {{=}} 0}} implies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-function ''ƒ'', the moment condition {{nowrap|E[''ƒ''(''x<sub>i</sub>'')·''ε<sub>i</sub>''] {{=}} 0}} will hold. However it can be shown using the [[Gauss–Markov theorem]] that the optimal choice of function ''ƒ'' is to take {{nowrap|''ƒ''(''x'') {{=}} ''x''}}, which results in the moment equation posted above.
 
== Finite sample properties ==
First of all, under the ''strict exogeneity'' assumption the OLS estimators <math style="vertical-align:-.3em">\scriptstyle\hat\beta</math> and ''s''<sup>2</sup> are [[Bias of an estimator|unbiased]], meaning that their expected values coincide with the true values of the parameters:<ref>{{harvtxt|Hayashi|2000|loc=pages 27, 30}}</ref><sup>[[Proofs involving ordinary least squares#Unbiasedness_of_.CE.B2.CC.82|[proof]]]</sup>
: <math>
    \operatorname{E}[\, \hat\beta \,| X \,] = \beta, \quad \operatorname{E}[\,s^2\,|X\,] = \sigma^2.
  </math>
If the strict exogeneity does not hold (as is the case with many [[time series]] models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples.
 
The [[variance-covariance matrix]] of <math style="vertical-align:-.3em">\scriptstyle\hat\beta</math> is equal to <ref name="HayashiFSP">{{harvtxt|Hayashi|2000|loc=page 27}}</ref>
: <math>
    \operatorname{Var}[\, \hat\beta \,| X \,] = \sigma^2(X'X)^{-1}.
  </math>
In particular, the standard error of each coefficient <math style="vertical-align:-.4em">\scriptstyle\hat\beta_j</math> is equal to square root of the ''j''-th diagonal element of this matrix. The estimate of this standard error is obtained by replacing the unknown quantity ''σ''<sup>2</sup> with its estimate ''s''<sup>2</sup>. Thus,
: <math>
    \widehat{\operatorname{s.\!e}}(\hat{\beta}_j) = \sqrt{s^2 (X'X)^{-1}_{jj}}
  </math>
 
It can also be easily shown that the estimator <math style="vertical-align:-.3em">\scriptstyle\hat\beta</math> is uncorrelated with the residuals from the model:<ref name="HayashiFSP"/>
: <math>
    \operatorname{Cov}[\, \hat\beta,\hat\varepsilon \,|X\,] = 0.
  </math>
 
The '''[[Gauss–Markov theorem]]''' states that under the ''spherical errors'' assumption (that is, the errors should be [[uncorrelated]] and [[homoscedastic]]) the estimator <math style="vertical-align:-.3em">\scriptstyle\hat\beta</math> is efficient in the class of linear unbiased estimators. This is called the '''best linear unbiased estimator (BLUE)'''. Efficiency should be understood as if we were to find some other estimator <math style="vertical-align:-.3em">\scriptstyle\tilde\beta</math> which would be linear in ''y'' and unbiased, then <ref name="HayashiFSP"/>
: <math>
    \operatorname{Var}[\, \tilde\beta \,| X \,] - \operatorname{Var}[\, \hat\beta \,| X \,] \geq 0
  </math>
in the sense that this is a [[nonnegative-definite matrix]]. This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms ''ε'', other, non-linear estimators may provide better results than OLS.
 
=== Assuming normality ===
The properties listed so far are all valid regardless of the underlying distribution of the error terms. However if you are willing to assume that the ''normality assumption'' holds (that is, that {{nowrap|''ε'' ~ ''N''(0, ''σ''<sup>2</sup>''I<sub>n</sub>'')}}), then additional properties of the OLS estimators can be stated.
 
The estimator <math style="vertical-align:-.3em">\scriptstyle\hat\beta</math> is normally distributed, with mean and variance as given before:<ref>{{harvtxt|Amemiya|1985|loc=page 13}}</ref>
: <math>
    \hat\beta\ \sim\ \mathcal{N}\big(\beta,\ \sigma^2(X'X)^{-1}\big)
  </math>
This estimator reaches the [[Cramér–Rao bound]] for the model, and thus is optimal in the class of all unbiased estimators.<ref name="Hayashi 2000 loc=page 52"/> Note that unlike the [[Gauss–Markov theorem]], this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms.
 
The estimator ''s''<sup>2</sup> will be proportional to the [[chi-squared distribution]]:<ref>{{harvtxt|Amemiya|1985|loc=page 14}}</ref>
: <math>
    s^2\ \sim\ \frac{\sigma^2}{n-p} \cdot \chi^2_{n-p}
  </math>
The variance of this estimator is equal to {{nowrap|2''σ''<sup>4</sup>/(''n&thinsp;−&thinsp;p'')}}, which does not attain the [[Cramér–Rao bound]] of 2''σ''<sup>4</sup>/''n''. However it was shown that there are no unbiased estimators of ''σ''<sup>2</sup> with variance smaller than that of the estimator ''s''<sup>2</sup>.<ref>{{harvtxt|Rao|1973|loc=page 319}}</ref> If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the [[mean squared error]]) estimator in this class will be {{nowrap|1=<sup style="position:relative;left:.7em;top:-.1em">~</sup>''σ''<sup>2</sup> = SSR&thinsp;''/''&thinsp;(''n&thinsp;−&thinsp;p''&thinsp;+&thinsp;2)}}, which even beats the Cramér–Rao bound in case when there is only one regressor ({{nowrap|1=''p'' = 1}}).<ref>{{harvtxt|Amemiya|1985|loc=page 20}}</ref>
 
Moreover, the estimators <math style="vertical-align:-.3em">\scriptstyle\hat\beta</math> and ''s''<sup>2</sup> are [[independent random variables|independent]],<ref>{{harvtxt|Amemiya|1985|loc=page 27}}</ref> the fact which comes in useful when constructing the t- and F-tests for the regression.
 
=== Influential observations ===
As was mentioned before, the estimator <math style="vertical-align:-.3em">\scriptstyle\hat\beta</math> is linear in ''y'', meaning that it represents a linear combination of the dependent variables ''y<sub>i</sub>'''s. The weights in this linear combination are functions of the regressors ''X'', and generally are unequal. The observations with high weights are called '''influential''' because they have a more pronounced effect on the value of the estimator.
 
To analyze which observations are influential we remove a specific ''j''-th observation and consider how much the estimated quantities are going to change (similarly to the [[jackknife method]]). It can be shown that the change in the OLS estimator for ''β'' will be equal to <ref name="DvdMck33">{{harvtxt|Davidson|Mackinnon|1993|loc=page 33}}</ref>
: <math>
    \hat\beta^{(j)} - \hat\beta = - \frac{1}{1-h_j} (X'X)^{-1}x'_j\hat\varepsilon_j\,,
  </math>
where {{nowrap|1=''h<sub>j</sub>'' = ''x<sub>j</sub>''′&thinsp;(''X′X'')<sup>−1</sup>''x<sub>j</sub>''}} is the ''j''-th diagonal element of the hat matrix ''P'', and ''x<sub>j</sub>'' is the vector of regressors corresponding to the ''j''-th observation. Similarly, the change in the predicted value for ''j''-th observation resulting from omitting that observation from the dataset will be equal to <ref name="DvdMck33"/>
: <math>
    \hat{y}_j^{(j)} - \hat{y}_j = x'_j\hat\beta^{(j)} - x'_j\hat\beta = - \frac{h_j}{1-h_j}\,\hat\varepsilon_j
  </math>
 
From the properties of the hat matrix, {{nowrap|0 ≤ ''h<sub>j</sub>'' ≤ 1}}, and they sum up to ''p'', so that on average {{nowrap|''h<sub>j</sub>'' ≈ ''p/n''}}. These quantities ''h<sub>j</sub>'' are called the '''leverages''', and observations with high ''h<sub>j</sub>'''s — '''leverage points'''.<ref>{{harvtxt|Davidson|Mackinnon|1993|loc=page 36}}</ref> Usually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset.
 
=== Partitioned regression ===
Sometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form
: <math>
    y = X_1\beta_1 + X_2\beta_2 + \varepsilon,
  </math>
where ''X''<sub>1</sub> and ''X''<sub>2</sub> have dimensions ''n×p''<sub>1</sub>, ''n×p''<sub>2</sub>, and ''β''<sub>1</sub>, ''β''<sub>2</sub> are ''p''<sub>1</sub>×1 and ''p''<sub>2</sub>×1 vectors, with {{nowrap|1=''p''<sub>1</sub> + ''p''<sub>2</sub> = ''p''}}.
 
The '''[[Frisch–Waugh–Lovell theorem]]''' states that in this regression the residuals <math style="vertical-align:0">\hat\varepsilon</math> and the OLS estimate <math style="vertical-align:-.3em">\scriptstyle\hat\beta_2</math> will be numerically identical to the residuals and the OLS estimate for ''β''<sub>2</sub> in the following regression:<ref>{{harvtxt|Davidson|Mackinnon|1993|loc=page 20}}</ref>
: <math>
    M_1y = M_1X_2\beta_2 + \eta\,,
  </math>
where ''M''<sub>1</sub> is the annihilator matrix for regressors ''X''<sub>1</sub>.
 
The theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the demeaned variables but without the constant term.
 
=== Constrained estimation ===
Suppose it is known that the coefficients in the regression satisfy a system of linear equations
: <math>
    H_0\!:\ \ Q'\beta = c, \,
  </math>
where ''Q'' is a ''p×q'' matrix of full rank, and ''c'' is a ''q×''1 vector of known constants, where {{nowrap|''q&thinsp;<&thinsp;p''}}. In this case least squares estimation is equivalent to minimizing the sum of squared residuals of the model subject to the constraint ''H''<sub>0</sub>. The '''constrained least squares (CLS)''' estimator can be given by an explicit formula:<ref>{{harvtxt|Amemiya|1985|loc=page 21}}</ref>
: <math>
    \hat\beta^c = \hat\beta - (X'X)^{-1}Q\Big(Q'(X'X)^{-1}Q\Big)^{-1}(Q'\hat\beta - c)
  </math>
 
This expression for the constrained estimator is valid as long as the matrix ''X′X'' is invertible. It was assumed from the beginning of this article that this matrix is of full rank, and it was noted that when the rank condition fails, ''β'' will not be identifiable. However it may happen that adding the restriction ''H''<sub>0</sub> makes ''β'' identifiable, in which case one would like to find the formula for the estimator. The estimator is equal to <ref name="Amemiya22">{{harvtxt|Amemiya|1985|loc=page 22}}</ref>
: <math>
    \hat\beta^c = R(R'X'XR)^{-1}R'X'y + \Big(I_p - R(R'X'XR)^{-1}R'X'X\Big)Q(Q'Q)^{-1}c,
  </math>
where ''R'' is a ''p×''(''p−q'') matrix such that the matrix {{nowrap|[''Q R'']}} is non-singular, and {{nowrap|1=''R′Q'' = 0}}. Such a matrix can always be found, although generally it is not unique. The second formula coincides with the first in case when ''X′X'' is invertible.<ref name="Amemiya22"/>
 
== Large sample properties ==
The least squares estimators are [[point estimate]]s of the linear regression model parameters ''β''. However generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the [[interval estimate]]s.
 
Since we haven't made any assumption about the distribution of error term ''ε<sub>i</sub>'', it is impossible to infer the distribution of the estimators <math>\hat\beta</math> and <math>\hat\sigma^2</math>. Nevertheless, we can apply the [[law of large numbers]] and [[central limit theorem]] to derive their ''asymptotic'' properties as sample size ''n'' goes to infinity. While the sample size is necessarily finite, it is customary to assume that ''n'' is "large enough" so that the true distribution of the OLS estimator is close to its asymptotic limit, and the former may be approximately replaced by the latter.
 
We can show that under the model assumptions, least squares estimator for ''β'' is [[consistent estimator|consistent]] (that is <math>\hat\beta</math> [[Convergence_of_random_variables#Convergence_in_probability|converges in probability]] to ''β'') and asymptotically normal:<sup>[[Proofs involving ordinary least squares#Consistency_and_asymptotic_normality_of_.CE.B2.CC.82|[proof]]]</sup>
: <math>\sqrt{n}(\hat\beta - \beta)\ \xrightarrow{d}\ \mathcal{N}\big(0,\;\sigma^2Q_{xx}^{-1}\big),</math>
where <math>Q_{xx} = X'X.</math>
 
Using this asymptotic distribution, approximate two-sided confidence intervals for the ''j''-th component of the vector <math>\hat\beta</math> can be constructed as
: <math>\beta_j \in \bigg[\
    \hat\beta_j \pm q^{\mathcal{N}(0,1)}_{1-\alpha/2}\!\sqrt{\tfrac{1}{n}\hat\sigma^2\big[Q_{xx}^{-1}\big]_{jj}}
    \ \bigg]</math> &nbsp; at the 1&nbsp;&minus;&nbsp;''α'' confidence level,
where ''q'' denotes the [[quantile function]] of standard normal distribution, and [·]<sub>''jj''</sub> is the ''j''-th diagonal element of a matrix.
 
Similarly, the least squares estimator for ''σ''<sup>2</sup> is also consistent and asymptotically normal (provided that the fourth moment of ''ε<sub>i</sub>'' exists) with limiting distribution
: <math>\sqrt{n}(\hat\sigma^2-\sigma^2)\ \xrightarrow{d}\ \mathcal{N}\big(0,\;\operatorname{E}[\varepsilon_i^4]-\sigma^4\big). </math>
 
These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose <math>x_0</math> is some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. The [[mean response]] is the quantity <math>y_0=x'_0\beta</math>, whereas the [[predicted response]] is <math>\hat{y}_0=x'_0\hat\beta</math>. Clearly the predicted response is a random variable, its distribution can be derived from that of <math>\hat\beta</math>:
: <math>\sqrt{n}(\hat{y}_0 - y_0)\ \xrightarrow{d}\ \mathcal{N}\big(0,\;\sigma^2x'_0Q_{xx}^{-1}x_0\big),</math>
which allows construct confidence intervals for mean  response <math>y_0</math> to be constructed:
: <math>y_0\in\bigg[\ x_0'\hat\beta \pm q^{\mathcal{N}(0,1)}_{1-\alpha/2}\!\sqrt{\tfrac{1}{n}\hat\sigma^2x'_{0}Q_{xx}^{-1}x_{0}}\ \bigg]</math> &nbsp; at the 1&nbsp;&minus;&nbsp;''α'' confidence level.
 
== Hypothesis testing ==
{{main|Hypothesis testing}}
{{Empty section|date=July 2010}}
 
== Example with real data ==
[[File:OLS example weight vs height scatterplot.svg|thumb|[[Scatterplot]] of the data; the relationship is slightly curved but close to linear.]]
 
NB. this example exhibits the common mistake of ignoring the condition of having zero error in the dependent variable.
 
The following data set gives average heights and weights for American women aged 30–39 (source: ''The World Almanac and Book of Facts, 1975'').
:{|class="wikitable"
|- style="text-align:right;"
! style="text-align:left;" | &nbsp;Height (m):&nbsp;
| 1.47 || 1.50 || 1.52 || 1.55 || 1.57 || 1.60 || 1.63 || 1.65 || 1.68 || 1.70 || 1.73 || 1.75 || 1.78 || 1.80 || 1.83
|- style="text-align:right;"
! style="text-align:left;" | &nbsp;Weight (kg):&nbsp;
|52.21 ||53.12 ||54.48 ||55.84 ||57.20 ||58.57 ||59.93 ||61.29 ||63.11 ||64.47 ||66.28 ||68.10 ||69.92 ||72.19 ||74.46
|}
 
When only one dependent variable is being modeled, a [[scatterplot]] will suggest the form and strength of the relationship between the dependent variable and regressors. It might also reveal outliers, heteroscedasticity, and other aspects of the data that may complicate the interpretation of a fitted regression model.  The scatterplot suggests that the relationship is strong and can be approximated as a quadratic function. OLS can handle non-linear relationships by introducing the regressor <tt>HEIGHT</tt><sup>2</sup>.  The regression model then becomes a multiple linear model:
 
:<math>w_i = \beta_1 + \beta_2 h_i + \beta_3 h_i^2 + \varepsilon_i.</math>
 
The output from most popular [[List of statistical packages|statistical packages]] will look similar to this:
[[File:OLS example weight vs height fitted line.svg|thumb|right|300px|Fitted regression]]
:{|style="border:3px ridge; padding:2pt 10pt"
|colspan="6"| Method: Least Squares<br>Dependent variable: WEIGHT<br>Included observations: 15
|-
|colspan="6"| <hr>
|- align="right"
!align="left" | Variable
!align="right"| Coefficient
!<!-- empty column -->
!align="right"| [[standard error|Std.Error]]
!style="padding-left:6pt;"| [[t-statistic]]
!style="padding-left:6pt;"| [[p-value]]
|-
|colspan="6"| <hr>
|-
|- align="right"
|align="left"| <math>\beta</math>              || 128.8128    || || 16.3083  || 7.8986      || 0.0000
|- align="right"
|align="left"| <math>h</math>              || –143.1620  || || 19.8332  || –7.2183    || 0.0000
|- align="right"
|align="left"| <math>h^2</math>  || 61.9603    || || 6.0084    || 10.3122    || 0.0000
|-
|colspan="6"| <hr>
|-
| [[coefficient of determination|R<sup>2</sup>]]
|align="right"| 0.9989
|rowspan="6"| &nbsp; &nbsp;
|colspan="2"| S.E. of regression
|align="right"| 0.2516
|-
| Adjusted R<sup>2</sup>
|align="right"| 0.9987
|colspan="2"| Model sum-of-sq
|align="right"| 692.61
|-
| Log-likelihood
|align="right"| 1.0890
|colspan="2"| Residual sum-of-sq
|align="right"| 0.7595
|-
| [[Durbin–Watson statistic|Durbin–Watson stats.]]
|align="right"| 2.1013
|colspan="2"| Total sum-of-sq
|align="right"| 693.37
|-
| [[Akaike information criterion|Akaike criterion]]
|align="right"| 0.2548
|colspan="2"| F-statistic
|align="right"| 5471.2
|-
| [[Schwarz criterion]]
|align="right"| 0.3964
|colspan="2"| p-value (F-stat)
|align="right"| 0.0000
|}
 
In this table:
* The ''Coefficient'' column gives the least squares estimates of parameters ''β<sub>j</sub>''
* The ''Std. errors'' column shows [[standard error (statistics)|standard error]]s of each coefficient estimate: <math>\hat\sigma_j=\big(\tfrac{1}{n}\hat\sigma^2[Q_{xx}^{-1}]_{jj}\big)^{1/2}</math>
* The ''[[t-statistic]]'' and ''p-value'' columns are testing whether any of the coefficients might be equal to zero. The ''t''-statistic is calculated simply as <math>t=\hat\beta_j/\hat\sigma_j</math>. If the errors ε follow a normal distribution, ''t'' follows a Student-t distribution.  Under weaker conditions, ''t'' is asymptotically normal. Large values of ''t'' indicate that the null hypothesis can be rejected and that the corresponding coefficient is not zero. The second column, [[p-value|''p''-value]], expresses the results of the hypothesis test as a [[statistical significance|significance level]].  Conventionally, ''p''-values smaller than 0.05 are taken as evidence that the population coefficient is nonzero.
* ''R-squared'' is the [[coefficient of determination]] indicating goodness-of-fit of the regression. This statistic will be equal to one if fit is perfect, and to zero when regressors ''X'' have no explanatory power whatsoever. This is a biased estimate of the population ''R-squared'', and will never decrease if additional regressors are added, even if they are irrelevant.
* ''Adjusted R-squared'' is a slightly modified version of <math>R^2</math>, designed to penalize for the excess number of regressors which do not add to the explanatory power of the regression. This statistic is always smaller than <math>R^2</math>, can decrease as new regressors are added, and even be negative for poorly fitting models:
:: <math>\overline{R}^2 = 1 - \tfrac{n-1}{n-p}(1-R^2)</math>
* ''Log-likelihood'' is calculated under the assumption that errors follow normal distribution. Even though the assumption is not very reasonable, this statistic may still find its use in conducting LR tests.
* ''[[Durbin–Watson statistic]]'' tests whether there is any evidence of serial correlation between the residuals. As a rule of thumb, the value smaller than 2 will be an evidence of positive correlation.
* ''[[Akaike information criterion]]'' and ''[[Schwarz criterion]]'' are both used for model selection. Generally when comparing two alternative models, smaller values of one of these criteria will indicate a better model.<ref>{{Cite book
| edition = 2nd
| publisher = Springer
| isbn = 0-387-95364-7
| last = Burnham
| first = Kenneth P.
| coauthors = David Anderson
| title = Model Selection and Multi-Model Inference
| year = 2002
}}</ref>
* ''Standard error of regression'' is an estimate of ''σ'', standard error of the error term.
* ''Total sum of squares'', ''model sum of squared'', and ''residual sum of squares'' tell us how much of the initial variation in the sample were explained by the regression.
* ''F-statistic'' tries to test the hypothesis that all coefficients (except the intercept) are equal to zero. This statistic has ''F''(''p–1'',''n–p'') distribution under the null hypothesis and normality assumption, and its ''p-value'' indicates probability that the hypothesis is indeed true. Note that when errors are not normal this statistic becomes invalid, and other tests such as for example [[Wald test]] or [[likelihood ratio test|LR test]] should be used.
 
[[File:OLS example weight vs height residuals.svg|thumb|right|300px|Residuals plot]]
Ordinary least squares analysis often includes the use of diagnostic plots designed to detect departures of the data from the assumed form of the model.  These are some of the common diagnostic plots:
* Residuals against the explanatory variables in the model. A non-linear relation between these variables suggests that the linearity of the conditional mean function may not hold.  Different levels of variability in the residuals for different levels of the explanatory variables suggests possible heteroscedasticity.
* Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model.
* Residuals against the fitted values, <math>\hat{y}</math>.
* Residuals against the preceding residual.  This plot may identify serial correlations in the residuals.
 
An important consideration when carrying out statistical inference using regression models is how the data were sampled.  In this example, the data are averages rather than measurements on individual women.  The fit of the model is very good, but this does not imply that the weight of an individual woman can be predicted with high accuracy based only on her height.
 
===Sensitivity to rounding===
This example also demonstrates that coefficients determined by these calculations are sensitive to how the data is prepared. The heights were originally given rounded to the nearest inch and have been converted and rounded to the nearest centimetre. Since the conversion factor is one inch to 2.54&nbsp;cm this is ''not'' an exact conversion. The original inches can be recovered by Round(x/0.0254) and then re-converted to metric without rounding. If this is done the results become:
  const      height  Height<sup>2</sup>
128.8128  -143.162  61.96033  converted to metric with rounding.
119.0205  -131.5076  58.5046  converted to metric without rounding.
Using either of these equations to predict the weight of a 5' 6" (1.6764m) woman gives similar values: 62.94&nbsp;kg with rounding vs. 62.98&nbsp;kg without rounding.
Thus a seemingly small variation in the data has a real effect on the coefficients but a small effect on the results of the equation.
 
While this may look innocuous in the middle of the data range it could become significant at the extremes or in the case where the fitted model is used to project outside the data range ( [[extrapolation]] ).
 
This highlights a common error: this example is an abuse of OLS which inherently requires that the errors in the dependent variable (in this case height) are zero or at least negligible. The initial rounding to nearest inch plus any actual measurement errors constitute a finite and non-negligible error. As a result the fitted parameters are not the best estimates they are presumed to be. Though not totally spurious the error in the estimation will depend upon relative size of the x and y errors.
 
== See also ==
* [[Minimum mean square error|Bayesian least squares]]
* [[Fama–MacBeth regression]]
* [[Non-linear least squares]]
* [[Numerical methods for linear least squares]]
 
== References ==
{{reflist|30em}}
 
== Further reading ==
{{refbegin|60em}}
* {{cite book
  | last = Amemiya | first = Takeshi
  | year = 1985
  | title = Advanced econometrics
  | publisher = Harvard University Press
  | isbn = 0-674-00560-0
  | ref = harv
  }}
* {{cite book
  | last1 = Davidson  | first1 = Russell
  | last2 = Mackinnon | first2 = James G.
  | title = Estimation and inference in econometrics
  | year = 1993
  | publisher = Oxford University Press
  | isbn = 978-0-19-506011-9
  | ref = harv
  }}
* {{cite book
  | last = Greene | first = William H.
  | title = Econometric analysis
  | year = 2002
  | edition = 5th <!-- there is a newer edition available, but 5th is in the open access -->
  | publisher = Prentice Hall
  | location = New Jersey
  | isbn = 0-13-066189-9
  | url = http://bib.tiera.ru/DVD-010/Greene_W.H._Econometric_analysis_(2002)(5th_ed.)(en)(983s).pdf
  | accessdate = 2010-04-26
  | ref = harv
  }}
* {{cite book
  | last = Hayashi | first = Fumio
  | title = Econometrics
  | year = 2000
  | publisher = Princeton University Press
  | isbn = 0-691-01018-8
  | ref = harv
  }}
* {{cite book
  | last = Rao | first = C.R.
  | year = 1973
  | title = Linear statistical inference and its applications
  | edition = 2nd
  | location = New York
  | publisher = John Wiley & Sons
  | ref = harv
  }}
* {{cite book |last=Wooldridge |first=Jeffrey M. |year=2013 |title=Introductory Econometrics: A Modern Approach |location=Australia |publisher=South Western, Cengage Learning |edition=5th international |isbn=9781111534394 }}
{{refend}}
 
{{Least Squares and Regression Analysis}}
 
{{DEFAULTSORT:Ordinary Least Squares}}
[[Category:Regression analysis]]
[[Category:Estimation theory]]
[[Category:Parametric statistics]]
[[Category:Least squares]]
 
[[de:Methode der kleinsten Quadrate]]

Latest revision as of 20:58, 27 August 2014

My name is Sean (40 years old) and my hobbies are Association football and Knapping.

Feel free to surf to my web-site ... FIFA 15 coin hack (how you can help)