Jump to content

James–Stein estimator: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
The James–Stein estimator: this formula was inconsistent with the previous one: (m-3) and m \ge 4 had to be wrong.
Added more explicit description of the needed variables, history and its significance, cleaned up wording and description in the Setting. No fundamental changes are made.
Line 1: Line 1:
{{Short description|Biased estimator for Gaussian random vectors, better than ordinary least-squared-error minimization}}{{technical|date=November 2017}}
{{Short description|Biased estimator for Gaussian random vectors, better than ordinary least-squared-error minimization}}{{technical|date=November 2017}}
The '''James–Stein estimator''' is a [[Bias of an estimator|biased]] [[estimator]] of the [[mean]] of Gaussian [[random vector]]s. It can be shown that the James–Stein estimator [[dominating decision rule|dominates]] the "ordinary" [[least squares]] approach, i.e., it has lower [[mean squared error]]. It is the best-known example of [[Stein's phenomenon]].
The '''James–Stein estimator''' is a [[Bias of an estimator|biased]] [[estimator]] of the [[mean]], <math>\boldsymbol\theta</math>, of (possibly) [[Correlation and dependence|correlated]] [[Normal distribution|Gaussian distributed]] [[random vector]]s <math>Y = \{Y_1, Y_2, ..., Y_m\}</math> with unknown means <math>\{\boldsymbol\theta_1, \boldsymbol\theta_2, ..., \boldsymbol\theta_m\}</math>.


It arose sequentially in two main published papers, the earlier version of the estimator was developed by [[Charles Stein (statistician)|Charles Stein]] in 1956,<ref name="stein-56">{{Citation|last=Stein|first=C.|title=Proc. Third Berkeley Symp. Math. Statist. Prob.|url=http://projecteuclid.org/euclid.bsmsp/1200501656|volume=1|pages=197–206|year=1956|contribution=Inadmissibility of the usual estimator for the mean of a multivariate distribution|mr=0084922|zbl=0073.35602|author-link=Charles Stein (statistician)}}</ref> which reached a relatively shocking conclusion that while the then usual estimate of the mean, or the sample mean written by Stein and James as <math>{\boldsymbol\hat\theta}(Y_i) = {\boldsymbol\theta}</math>, is [[Admissible decision rule|admissible]] when <math>m \leq 2</math>, however it is [[Admissible decision rule|inadmissible]] when <math>m \geq 3</math> and proposed a possible improvement to the estimator that [[Shrinkage (statistics)|shrinks]] the sample means <math>{\boldsymbol\theta_i}</math> towards a more central mean vector <math>\boldsymbol\nu</math> (which can be chosen [[A priori and a posteriori|a priori]] or commonly the "average of averages" of the sample means given all samples share the same size), is commonly referred to as '''[[Stein's example|Stein's example or paradox]]'''. This earlier result was improved later by Willard James and Charles Stein in 1961 through simplifying the original process.<ref name="james-stein-61">{{Citation|last=James|first=W.|title=Proc. Fourth Berkeley Symp. Math. Statist. Prob.|url=http://projecteuclid.org/euclid.bsmsp/1200512173|volume=1|pages=361–379|year=1961|contribution=Estimation with quadratic loss|mr=0133191|last2=Stein|first2=C.|author2-link=Charles Stein (statistician)}}</ref>
An earlier version of the estimator was developed by [[Charles Stein (statistician)|Charles Stein]] in 1956,<ref name="stein-56">{{Citation

| last = Stein | first = C.
It can be shown that the James–Stein estimator [[dominating decision rule|dominates]] the "ordinary" [[least squares]] approach, meaning the James_Stein estimator has a lower or equal [[mean squared error]] than the "ordinary least square estimator.
| author-link = Charles Stein (statistician)
| contribution = Inadmissibility of the usual estimator for the mean of a multivariate distribution
| title = Proc. Third Berkeley Symp. Math. Statist. Prob.
| year = 1956
| volume = 1
| pages = 197–206
| url = http://projecteuclid.org/euclid.bsmsp/1200501656
| mr = 0084922 | zbl = 0073.35602
}}</ref> and is sometimes referred to as '''Stein's estimator'''.{{Citation needed|date=October 2010}} The result was improved by Willard James and Charles Stein in 1961.<ref name="james-stein-61">{{Citation
| last = James | first = W.
| last2 = Stein | first2 = C. | author2-link = Charles Stein (statistician)
| contribution = Estimation with quadratic loss
| title = Proc. Fourth Berkeley Symp. Math. Statist. Prob.
| year = 1961
| volume = 1
| pages = 361–379
| url = http://projecteuclid.org/euclid.bsmsp/1200512173
|mr = 0133191
}}</ref>


== Setting ==
== Setting ==
Let <math>
Suppose the vector <math>\boldsymbol\theta</math> is the unknown [[Expected value|mean]] of a [[Multivariate normal distribution|<math>m</math>-variate normally distributed]] (with known [[covariance matrix]] <math>\sigma^2 I </math>) [[random variable]] <math>{\mathbf Y}</math>:
{\mathbf Y} \sim N_m({\boldsymbol \theta}, \sigma^2 I),\,
:<math>
</math>where the vector <math>\boldsymbol\theta</math> is the unknown [[Expected value|mean]] of <math>{\mathbf Y}</math>, which is [[Multivariate normal distribution|<math>m</math>-variate normally distributed]] and with known [[covariance matrix]] <math>\sigma^2 I </math>.
{\mathbf Y} \sim N({\boldsymbol \theta}, \sigma^2 I).\,
</math>


We are interested in obtaining an estimate <math>\widehat{\boldsymbol \theta} </math> of <math>\boldsymbol\theta</math>, based on a single observation, <math>{\mathbf y} </math>, of <math>{\mathbf Y} </math>.
We are interested in obtaining an estimate, <math>\widehat{\boldsymbol \theta} </math>, of <math>\boldsymbol\theta</math>, based on a single observation, <math>{\mathbf y} </math>, of <math>{\mathbf Y} </math>.


This is an everyday situation in which a set of parameters is measured, and the measurements are corrupted by independent Gaussian noise. Since the noise has zero mean, it is very reasonable to use the measurements themselves as an estimate of the parameters. This is the approach of the [[least squares]] estimator, which is <math>\widehat{\boldsymbol \theta}_{LS} = {\mathbf y}</math>.
In real-world application, this is a common situation in which a set of parameters is sampled, and the samples are corrupted by independent [[Gaussian noise]]. Since this noise has mean of zero, it may be reasonable to use the samples themselves as an estimate of the parameters. This approach is the [[least squares]] estimator, which is <math>\widehat{\boldsymbol \theta}_{LS} = {\mathbf y}</math>.


As a result, there was considerable shock and disbelief when Stein demonstrated that, in terms of [[mean squared error]] <math>\operatorname{E} \left[ \left\| {\boldsymbol \theta}-\widehat {\boldsymbol \theta} \right\|^2 \right]</math>, this approach is suboptimal.<ref name="stein-56"/> The result became known as [[Stein's phenomenon]].
Stein demonstrated that in terms of [[mean squared error]] <math>\operatorname{E} \left[ \left\| {\boldsymbol \theta}-\widehat {\boldsymbol \theta} \right\|^2 \right]</math>, the least squares estimator, <math>\widehat{\boldsymbol \theta}_{LS}</math>, is sub-optimal to a shrinkage based estimators, such as the '''James–Stein estimator''', <math>
\widehat{\boldsymbol \theta}_{JS}
</math>.<ref name="stein-56"/> The paradoxical result, that there is a (possibly) better and never any worse estimate of <math>\boldsymbol\theta</math> in mean squared error as compared to the sample mean, became known as [[Stein's phenomenon]].


== The James–Stein estimator ==
== The James–Stein estimator ==
Line 62: Line 45:
</math>
</math>


The James–Stein estimator dominates the usual estimator for any '''''ν'''''. A natural question to ask is whether the improvement over the usual estimator is independent of the choice of '''''ν'''''. The answer is no. The improvement is small if <math>\|{\boldsymbol\theta - \boldsymbol\nu}\|</math> is large. Thus to get a very great improvement some knowledge of the location of '''''θ''''' is necessary. Of course this is the quantity we are trying to estimate so we don't have this knowledge a priori. But we may have some guess as to what the mean vector is. This can be considered a disadvantage of the estimator: the choice is not objective as it may depend on the beliefs of the researcher.
The James–Stein estimator dominates the usual estimator for any '''''ν'''''. A natural question to ask is whether the improvement over the usual estimator is independent of the choice of '''''ν'''''. The answer is no. The improvement is small if <math>\|{\boldsymbol\theta - \boldsymbol\nu}\|</math> is large. Thus to get a very great improvement some knowledge of the location of '''''θ''''' is necessary. Of course this is the quantity we are trying to estimate so we don't have this knowledge [[A priori and a posteriori|a priori]]. But we may have some guess as to what the mean vector is. This can be considered a disadvantage of the estimator: the choice is not objective as it may depend on the beliefs of the researcher.


== Interpretation ==
== Interpretation ==

Revision as of 20:55, 27 October 2020

The James–Stein estimator is a biased estimator of the mean, , of (possibly) correlated Gaussian distributed random vectors with unknown means .

It arose sequentially in two main published papers, the earlier version of the estimator was developed by Charles Stein in 1956,[1] which reached a relatively shocking conclusion that while the then usual estimate of the mean, or the sample mean written by Stein and James as , is admissible when , however it is inadmissible when and proposed a possible improvement to the estimator that shrinks the sample means towards a more central mean vector (which can be chosen a priori or commonly the "average of averages" of the sample means given all samples share the same size), is commonly referred to as Stein's example or paradox. This earlier result was improved later by Willard James and Charles Stein in 1961 through simplifying the original process.[2]

It can be shown that the James–Stein estimator dominates the "ordinary" least squares approach, meaning the James_Stein estimator has a lower or equal mean squared error than the "ordinary least square estimator.

Setting

Let where the vector is the unknown mean of , which is -variate normally distributed and with known covariance matrix .

We are interested in obtaining an estimate, , of , based on a single observation, , of .

In real-world application, this is a common situation in which a set of parameters is sampled, and the samples are corrupted by independent Gaussian noise. Since this noise has mean of zero, it may be reasonable to use the samples themselves as an estimate of the parameters. This approach is the least squares estimator, which is .

Stein demonstrated that in terms of mean squared error , the least squares estimator, , is sub-optimal to a shrinkage based estimators, such as the James–Stein estimator, .[1] The paradoxical result, that there is a (possibly) better and never any worse estimate of in mean squared error as compared to the sample mean, became known as Stein's phenomenon.

The James–Stein estimator

MSE (R) of least squares estimator (ML) vs. James–Stein estimator (JS). The James–Stein estimator gives its best estimate when the norm of the actual parameter vector θ is near zero.

If is known, the James–Stein estimator is given by

James and Stein showed that the above estimator dominates for any , meaning that the James–Stein estimator always achieves lower mean squared error (MSE) than the maximum likelihood estimator.[2][3] By definition, this makes the least squares estimator inadmissible when .

Notice that if then this estimator simply takes the natural estimator and shrinks it towards the origin 0. In fact this is not the only direction of shrinkage that works. Let ν be an arbitrary fixed vector of length . Then there exists an estimator of the James-Stein type that shrinks toward ν, namely

The James–Stein estimator dominates the usual estimator for any ν. A natural question to ask is whether the improvement over the usual estimator is independent of the choice of ν. The answer is no. The improvement is small if is large. Thus to get a very great improvement some knowledge of the location of θ is necessary. Of course this is the quantity we are trying to estimate so we don't have this knowledge a priori. But we may have some guess as to what the mean vector is. This can be considered a disadvantage of the estimator: the choice is not objective as it may depend on the beliefs of the researcher.

Interpretation

Seeing the James–Stein estimator as an empirical Bayes method gives some intuition to this result: One assumes that θ itself is a random variable with prior distribution , where A is estimated from the data itself. Estimating A only gives an advantage compared to the maximum-likelihood estimator when the dimension is large enough; hence it does not work for . The James–Stein estimator is a member of a class of Bayesian estimators that dominate the maximum-likelihood estimator.[4]

A consequence of the above discussion is the following counterintuitive result: When three or more unrelated parameters are measured, their total MSE can be reduced by using a combined estimator such as the James–Stein estimator; whereas when each parameter is estimated separately, the least squares (LS) estimator is admissible. A quirky example would be estimating the speed of light, tea consumption in Taiwan, and hog weight in Montana, all together. The James–Stein estimator always improves upon the total MSE, i.e., the sum of the expected errors of each component. Therefore, the total MSE in measuring light speed, tea consumption, and hog weight would improve by using the James–Stein estimator. However, any particular component (such as the speed of light) would improve for some parameter values, and deteriorate for others. Thus, although the James–Stein estimator dominates the LS estimator when three or more parameters are estimated, any single component does not dominate the respective component of the LS estimator.

The conclusion from this hypothetical example is that measurements should be combined if one is interested in minimizing their total MSE. For example, in a telecommunication setting, it is reasonable to combine channel tap measurements in a channel estimation scenario, as the goal is to minimize the total channel estimation error. Conversely, there could be objections to combining channel estimates of different users, since no user would want their channel estimate to deteriorate in order to improve the average network performance.[citation needed]

The James–Stein estimator has also found use in fundamental quantum theory, where the estimator has been used to improve the theoretical bounds of the entropic uncertainty principle (a recent development of the Heisenberg uncertainty principle) for more than three measurements.[5]

Improvements

The basic James–Stein estimator has the peculiar property that for small values of the multiplier on is actually negative. This can be easily remedied by replacing this multiplier by zero when it is negative. The resulting estimator is called the positive-part James–Stein estimator and is given by

This estimator has a smaller risk than the basic James–Stein estimator. It follows that the basic James–Stein estimator is itself inadmissible.[6]

It turns out, however, that the positive-part estimator is also inadmissible.[3] This follows from a general result which requires admissible estimators to be smooth.

Extensions

The James–Stein estimator may seem at first sight to be a result of some peculiarity of the problem setting. In fact, the estimator exemplifies a very wide-ranging effect; namely, the fact that the "ordinary" or least squares estimator is often inadmissible for simultaneous estimation of several parameters.[citation needed] This effect has been called Stein's phenomenon, and has been demonstrated for several different problem settings, some of which are briefly outlined below.

  • James and Stein demonstrated that the estimator presented above can still be used when the variance is unknown, by replacing it with the standard estimator of the variance, . The dominance result still holds under the same condition, namely, .[2]
  • The results in this article are for the case when only a single observation vector y is available. For the more general case when vectors are available, the results are similar:[citation needed]
where is the -length average of the observations.
  • The work of James and Stein has been extended to the case of a general measurement covariance matrix, i.e., where measurements may be statistically dependent and may have differing variances.[7] A similar dominating estimator can be constructed, with a suitably generalized dominance condition. This can be used to construct a linear regression technique which outperforms the standard application of the LS estimator.[7]
  • Stein's result has been extended to a wide class of distributions and loss functions. However, this theory provides only an existence result, in that explicit dominating estimators were not actually exhibited.[8] It is quite difficult to obtain explicit estimators improving upon the usual estimator without specific restrictions on the underlying distributions.[3]

See also

References

  1. ^ a b Stein, C. (1956), "Inadmissibility of the usual estimator for the mean of a multivariate distribution", Proc. Third Berkeley Symp. Math. Statist. Prob., vol. 1, pp. 197–206, MR 0084922, Zbl 0073.35602
  2. ^ a b c James, W.; Stein, C. (1961), "Estimation with quadratic loss", Proc. Fourth Berkeley Symp. Math. Statist. Prob., vol. 1, pp. 361–379, MR 0133191
  3. ^ a b c Lehmann, E. L.; Casella, G. (1998), Theory of Point Estimation (2nd ed.), New York: Springer
  4. ^ Efron, B.; Morris, C. (1973). "Stein's Estimation Rule and Its Competitors—An Empirical Bayes Approach". Journal of the American Statistical Association. 68 (341). American Statistical Association: 117–130. doi:10.2307/2284155. JSTOR 2284155.
  5. ^ Stander, M. (2017), Using Stein's estimator to correct the bound on the entropic uncertainty principle for more than two measurements, arXiv:1702.02440, Bibcode:2017arXiv170202440S
  6. ^ Anderson, T. W. (1984), An Introduction to Multivariate Statistical Analysis (2nd ed.), New York: John Wiley & Sons
  7. ^ a b Bock, M. E. (1975), "Minimax estimators of the mean of a multivariate normal distribution", Annals of Statistics, 3 (1): 209–218, doi:10.1214/aos/1176343009, MR 0381064, Zbl 0314.62005
  8. ^ Brown, L. D. (1966), "On the admissibility of invariant estimators of one or more location parameters", Annals of Mathematical Statistics, 37 (5): 1087–1136, doi:10.1214/aoms/1177699259, MR 0216647, Zbl 0156.39401

Further reading

  • Judge, George G.; Bock, M. E. (1978). The Statistical Implications of Pre-Test and Stein-Rule Estimators in Econometrics. New York: North Holland. pp. 229–257. ISBN 0-7204-0729-X.