# Theil–Sen estimator

In non-parametric statistics, there is a method for robust linear regression that chooses the median slope among all lines through pairs of two-dimensional sample points. It has been called **Theil–Sen estimator**, **Sen's slope estimator**,^{[1]}^{[2]} **slope selection**,^{[3]}^{[4]} the **single median method**,^{[5]} **Kendall robust line-fit method**,^{[6]} and **Kendall–Theil robust line**^{[7]} It is named after Henri Theil and Pranab K. Sen, who published papers on this method in 1950 and 1968 respectively. It can be computed efficiently, and is insensitive to outliers; it can be significantly more accurate than simple linear regression for skewed and heteroskedastic data, and competes well against non-robust least squares even for normally distributed data in terms of statistical power.^{[8]} It has been called "the most popular nonparametric technique for estimating a linear trend".^{[2]}

## Definition

As defined by Template:Harvtxt, the Theil–Sen estimator of a set of two-dimensional points (*x _{i}*,

*y*) is the median Template:Mvar of the slopes (

_{i}*y*−

_{j}*y*)/(

_{i}*x*−

_{j}*x*) determined by all pairs of sample points. Template:Harvtxt extended this definition to handle the case in which two data points have the same Template:Mvar-coordinate. In Sen's definition, one takes the median of the slopes defined only from pairs of points having distinct Template:Mvar-coordinates.

_{i}Once the slope Template:Mvar has been determined, one may determine a line through the sample points by setting the Template:Mvar-intercept Template:Mvar to be the median of the values *y _{i}* −

*mx*.

_{i}^{[9]}As Sen observed, this estimator is the value that makes the Kendall tau rank correlation coefficient comparing the values of Template:Mvar with the residual for the

*i*-th observation become approximately zero.

^{[10]}

A confidence interval for the slope estimate may be determined as the interval containing the middle 95% of the slopes of lines determined by pairs of points,^{[11]} and may be estimated quickly by sampling pairs of points and determining the 95% interval of the sampled slopes. According to simulations, approximately 600 sample pairs are sufficient to determine an accurate confidence interval.^{[8]}

## Variations

A variation of the Theil–Sen estimator due to Template:Harvtxt determines, for each sample point (*x _{i}*,

*y*), the median Template:Mvar of the slopes (

_{i}*y*−

_{j}*y*)/(

_{i}*x*−

_{j}*x*) of lines through that point, and then determines the overall estimator as the median of these medians.

_{i}A different variant pairs up sample points by the rank of their Template:Mvar-coordinates (the point with the smallest coordinate being paired with the first point above the median coordinate, etc.) and computes the median of the slopes of the lines determined by these pairs of points.^{[12]}

Variations of the Theil–Sen estimator based on weighted medians have also been studied, based on the principle that pairs of samples whose Template:Mvar-coordinates differ more greatly are more likely to have an accurate slope and therefore should receive a higher weight.^{[13]}

For seasonal data, it may be appropriate to smooth out seasonal variations in the data by considering only pairs of sample points that both belong to the same month or the same season of the year, and finding the median of the slopes of the lines determined by this more restrictive set of pairs.^{[14]}

## Statistical properties

The Theil–Sen estimator is an unbiased estimator of the true slope in simple linear regression.^{[15]} For many distributions of the response error, this estimator has high asymptotic efficiency relative to least-squares estimation.^{[16]} Estimators with low efficiency require more independent observations to attain the same sample variance of efficient unbiased estimators.

The Theil–Sen estimator is more robust than the least-squares estimator because it is much less sensitive to outliers: It has a breakdown point of , meaning that it can tolerate arbitrary corruption of up to 29.3% of the input data-points without degradation of its accuracy.^{[9]} However, the breakdown point decreases for higher-dimensional generalizations of the method.^{[17]} A higher breakdown point, 50%, holds for the repeated median estimator of Siegel.^{[9]}

The Theil–Sen estimator is equivariant under every linear transformation of its response variable,^{[18]} but is not equivariant under affine transformations of both the predictor and response variables.^{[17]}

## Algorithms

The median slope of a set of Template:Mvar sample points may be computed exactly by computing all *O*(*n*^{2}) lines through pairs of points, and then applying a linear time median finding algorithm, or it may be estimated by sampling pairs of points. It is equivalent, under projective duality, to the problem of finding the crossing point in an arrangement of lines that has the median Template:Mvar-coordinate among all such crossing points.

The problem of performing slope selection exactly but more efficiently than the brute force quadratic time algorithm has been extensively studied in computational geometry. Several different methods are known for computing the Theil–Sen estimator exactly in *O*(*n* log *n*) time, either deterministically^{[3]} or using randomized algorithms.^{[4]} Siegel's repeated median estimator can also be constructed efficiently in the same time bound.^{[19]} In models of computation in which the input coordinates are integers and bitwise operations on integers take constant time, the problem can be solved even more quickly, in randomized expected time .^{[20]}

An estimator for the slope with approximately median rank, having the same breakdown point as the Theil–Sen estimator, may be maintained in the data stream model (in which the sample points are processed one by one by an algorithm that does not have enough persistent storage to represent the entire data set) using an algorithm based on ε-nets.^{[21]}

## Applications

Theil–Sen estimation has been applied to astronomy due to its ability to handle censored regression models.^{[22]} In biophysics, Template:Harvtxt suggest its use for remote sensing applications such as the estimation of leaf area from reflectance data due to its "simplicity in computation, analytical estimates of confidence intervals, robustness to outliers, testable assumptions regarding residuals and ... limited a priori information regarding measurement errors". For measuring seasonal environmental data such as water quality, a seasonally adjusted variant of the Theil–Sen estimator has been proposed as preferable to least squares estimation due to its high precision in the presence of skewed data.^{[14]} In computer science, the Theil–Sen method has been used to estimate trends in software aging.^{[23]}

## See also

- Regression dilution, for another problem affecting estimated trend slopes

## Notes

- ↑ Template:Harvtxt.
- ↑
^{2.0}^{2.1}Template:Harvtxt. - ↑
^{3.0}^{3.1}Template:Harvtxt; Template:Harvtxt; Template:Harvtxt. - ↑
^{4.0}^{4.1}Template:Harvtxt; Template:Harvtxt; Template:Harvtxt. - ↑ Template:Harvtxt.
- ↑ Template:Harvtxt; Template:Harvtxt.
- ↑ Template:Harvtxt
- ↑
^{8.0}^{8.1}Template:Harvtxt. - ↑
^{9.0}^{9.1}^{9.2}Template:Harvtxt, pp. 67, 164. - ↑ Template:Harvtxt.
- ↑ For determining confidence intervals, pairs of points must be sampled with replacement; this means that the set of pairs used in this calculation includes pairs in which both points are the same as each other. These pairs are always outside the confidence interval, because they do not determine a well-defined slope value, but using them as part of the calculation causes the confidence interval to be wider than it would be without them.
- ↑ Template:Harvtxt.
- ↑ Template:Harvtxt; Template:Harvtxt; Template:Harvtxt; Template:Harvtxt.
- ↑
^{14.0}^{14.1}Template:Harvtxt. - ↑ Template:Harvtxt, Theorem 5.1, p. 1384; Template:Harvtxt.
- ↑ Template:Harvtxt, Section 6; Template:Harvtxt.
- ↑
^{17.0}^{17.1}Template:Harvtxt. - ↑ Template:Harvtxt, p. 1383.
- ↑ Template:Harvtxt.
- ↑ Template:Harvtxt.
- ↑ Template:Harvtxt.
- ↑ Template:Harvtxt.
- ↑ Template:Harvtxt.

## References

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}.

- {{#invoke:citation/CS1|citation

|CitationClass=citation }}. Template:Refend

## External links

- Kendall-Theil Robust Line (KTRLine—version 1.0) public-domain Visual Basic software for Theil–Sen estimation published by the United States Geological Survey