Jump to content

Talk:James–Stein estimator: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
sigma^2 same for all parms?
 
Tags: Mobile edit Mobile web edit Advanced mobile edit New topic
 
(43 intermediate revisions by 25 users not shown)
Line 1: Line 1:
{{WikiProject banner shell|class=B|
Is the assumption of equal variances fundamental to this? Should say one way or another.
{{WikiProject Statistics|importance = mid}}
{{WikiProject Mathematics|importance = Mid}}
}}

Is the assumption of equal variances fundamental to this? Should say one way or another. <small>—The preceding [[Wikipedia:Sign your posts on talk pages|unsigned]] comment was added by [[Special:Contributions/65.217.188.20|65.217.188.20]] ([[User talk:65.217.188.20|talk]]) {{{2|}}}.</small>
:Thanks for the comment. The assumption of equal variances is not required. I will add some information about this shortly. --[[User:Zvika|Zvika]] 19:29, 27 September 2006 (UTC)
::Looking forward to this addition. Also, what can be done if the variances are not known? After all, if <math>\theta</math> is not known then probably <math>\sigma^2</math> is not either. (Can you use some version of the sample variances, for instance?) Thanks! [[User:Eclecticos|Eclecticos]] ([[User talk:Eclecticos|talk]]) 05:26, 5 October 2008 (UTC)

Thanks for the great article on the James-Stein estimator. I think you may also want to mention the connection to Emprirical Bayes methods (e.g., as discsussed by Effron and Morris in their paper "Stein's Estimation Rule and Its Competitors--An Empirical Bayes Approach"). Personally, I found the Empirical Bayes explanation provided some very useful intuition to the "magic" of this estimator. <small class="autosigned">—&nbsp;Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/131.239.52.20|131.239.52.20]] ([[User talk:131.239.52.20|talk]]) 17:54, 18 April 2007 (UTC)</small><!-- Template:Unsigned IP -->
:Thanks for the compliment! Your suggestion sounds like a good idea. [[User:Billjefferys]] recently [[Talk:Stein's example|suggested]] a similar addition to the article [[Stein's example]], but neither of us has gotten around to working on it yet. --[[User:Zvika|Zvika]] 07:55, 19 April 2007 (UTC)

== dimensionality of y ==

A confusing point about this article: y is described as "observations" of an m-dimensional vector <math>\theta</math>, suggesting that it should be an m by n matrix, where n is the number of observations. However, this doesn't conform to the use of y in the formula for the James-Stein estimator, where y appears to be a single m-dimensional vector. (Is there some mean involved? Is <math>||y||^2</math> computed over all mn scalars?) Furthermore, can we still apply some version of the James-Stein technique in the case where we have more observations of <math>\theta_1</math> than of <math>\theta_2</math>, i.e., there is not a single n? Thanks for any clarification in the article. [[User:Eclecticos|Eclecticos]] ([[User talk:Eclecticos|talk]]) 05:19, 5 October 2008 (UTC)

:The setting in the article describes a case where there is one observation per parameter. I have added a clarifying comment to this effect. In the situation you describe, in which several independent observations are given per parameter, the mean of these observations is a [[sufficient statistic]] for estimating θ, so that this setting can be reduced to the one in the article. --[[User:Zvika|Zvika]] ([[User talk:Zvika|talk]]) 05:48, 5 October 2008 (UTC)

::The wording is still unclear, especially the sentence: "Suppose θ is an unknown parameter vector of length m, and let y be a vector of observations of θ (also of length m)". How can a vector of m-dimensional observations have length m? --[[User:StefanVanDerWalt|StefanVanDerWalt]] ([[User talk:StefanVanDerWalt|talk]]) 11:07, 1 February 2010 (UTC)

:::Indeed, it does not make sense. I'll give it a shot. [[Special:Contributions/84.238.115.164|84.238.115.164]] ([[User talk:84.238.115.164|talk]]) 19:49, 17 February 2010 (UTC)
:::: Me too. What do you think of my edit? [[User:Yak90|Yak90]] ([[User talk:Yak90|talk]]) 08:05, 24 September 2017 (UTC)

:Is the formula using σ<sup>2</sup>/ni applicable for different sample sizes in groups?. In ''Morris, 1983, Parametric Empirical Bayes Inference: Theory and Applications'', it is claimed that a more general version (which is also derived there) of Stein's estimator is needed if the variances Vi are unequal, where Vi denotes σ<sup>2</sup><sub>i</sub>/n<sub>i</sub> so as I understands it, Steins formula is only applicable for equal n<sub>i</sub> as well.

== Bias ==

The estimator is always biased, right? I think this is worth mentioning directly in the article. [[User:Lavaka|Lavaka]] ([[User talk:Lavaka|talk]]) 02:09, 22 March 2011 (UTC)

== Risk functions ==

The graph of the MSE functions would need a bit more precisions : we are in the case where ν=0, probably m=10 and σ=1, aren't we ? (I thought that, in this case , for θ = 0, MSE should be equal to 2 ; maybe the red curve represents the positive JS ?) <span style="font-size: smaller;" class="autosigned">—Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/82.244.59.11|82.244.59.11]] ([[User talk:82.244.59.11|talk]]) 15:40, 10 May 2011 (UTC)</span><!-- Template:UnsignedIP --> <!--Autosigned by SineBot-->

== Extensions ==
In the case of unknown variance, multiple observations are necessary, right? Thus it would make sense to swap bullet points 1 and 2 and reference the then first from the second. Also, the "usual estimator of the variance" is a bit dubious to me. Shouldn't it be something like: <math>\widehat{\sigma}^2 = \frac{1}{ m(n-1)+2}\sum_i \left\| y_i-\overline{y} \right\| _2</math>?

== Always or on average? ==

Currently the lead says

:''the James–Stein estimator [[dominating decision rule|dominates]] the "ordinary" [[least squares]] approach, i.e., it has lower mean squared error '''on average'''.''

But two sections later the article says

:''the James–Stein estimator '''always''' achieves lower [[mean squared error]] (MSE) than the [[maximum likelihood]] estimator. By definition, this makes the least squares estimator [[admissible decision rule|inadmissible]] when <math>m \ge 3</math>.''

(Bolding is mine.) This appears contradictory, and I suspect the passage in the lead should be changed from "on average" to "always". Since I don't know for sure, I won't change it myself. [[User:Loraof|Loraof]] ([[User talk:Loraof|talk]]) 23:55, 14 October 2017 (UTC)

:It sounds like the first one doubles the "mean". On average the squared error is lower. The mean squared error is lower. There is nothing more to average over. If you have a specific sample the squared error of the James-Stein estimator can be worse. --[[User:Mfb|mfb]] ([[User talk:Mfb|talk]]) 03:09, 15 October 2017 (UTC)

== Concrete examples? ==

This article would be greatly improved if some concrete examples were given in the lead and text so that laymen might have some idea of what the subject deals with in the real world. [[User:Medeis|μηδείς]] ([[User talk:Medeis|talk]]) 23:17, 8 November 2017 (UTC)

:There is an example starting at "A quirky example". I'm not sure if there are real world implications. --[[User:Mfb|mfb]] ([[User talk:Mfb|talk]]) 07:31, 9 November 2017 (UTC)

::I agree with {{u|Medeis}}. And by "concrete", I don't mean more hand-waving, I mean an actual variance-covariance matrix and set of observations, so that I can check the claim for myself. [[User:Maproom|Maproom]] ([[User talk:Maproom|talk]]) 09:42, 22 June 2018 (UTC)

== Using a single observation?? ==

Is this correct?

:"We are interested in obtaining an estimate <math>\widehat{\boldsymbol \theta} </math> of <math>\boldsymbol\theta</math>, based on a single observation, <math>{\mathbf y} </math>, of <math>{\mathbf Y} </math>."

How can you get an estimate from a single observation? Presumably the means along each dimension of <math>\boldsymbol\theta</math> are uncorrelated... [[User:Danski14|Danski14]]<sup>[[User talk:Danski14|(talk)]]</sup> 19:53, 2 March 2018 (UTC)
:: apparently it is right. They use the prior to get the estimate. Nevermind [[User:Danski14|Danski14]]<sup>[[User talk:Danski14|(talk)]]</sup> 20:13, 4 April 2018 (UTC)

:You make a single observation (in all dimensions), and then you either use this observation as your estimate or you do something else with it. --[[User:Mfb|mfb]] ([[User talk:Mfb|talk]]) 05:02, 5 April 2018 (UTC)

== Using James-Stein with a linear regression ==

I am wondering how to use James-Stein in an ordinary least squares regression.

First, if <math>\beta</math> are the coefficient estimates for an OLS (I skipped the hat), is the following the formula for shrinking it towards zero:

<math>\hat\beta^{JS}=\beta \left(1- \frac{(p-2)\sigma^2_n}{\beta'\beta}\right)</math>

where <math>\sigma_n^2</math> is the true variance (I might substitute the sample variance here), and p is the number of parameters in <math>\beta</math>. (I'm a bit fuzzy on whether <math>\alpha</math>, the constant in the regression, is in the $\beta$.)

I guessed this formula from [https://www.ssc.wisc.edu/~bhansen/papers/er_16.pdf "The risk of James-Stein and Lasso Shrinkage"], but I don't know if it's right.

Second, what would the formula be for the confidence intervals of the shrunken <math>\beta</math> estimates?

[[User:Dfrankow|dfrankow]] ([[User talk:Dfrankow|talk]]) 20:46, 12 May 2020 (UTC)

:I think you should look at https://journals.sagepub.com/doi/abs/10.1177/0008068320090105 and more directly at https://www.sciencedirect.com/science/article/abs/pii/S0378375806002813 both show that you need to minimize [[KL-divergence]]. This is completely missing in the article now and I just added a see also note... Maybe someone can expand this foundational aspect[[User:Biggerj1|Biggerj1]] ([[User talk:Biggerj1|talk]]) 20:08, 23 October 2024 (UTC)

== KL divergence, reference important but not discussed ==

see above two papers for the foundational role of KL divergence here. [[User:Biggerj1|Biggerj1]] ([[User talk:Biggerj1|talk]]) 20:09, 23 October 2024 (UTC)

Latest revision as of 20:09, 23 October 2024

Is the assumption of equal variances fundamental to this? Should say one way or another. —The preceding unsigned comment was added by 65.217.188.20 (talk) .

Thanks for the comment. The assumption of equal variances is not required. I will add some information about this shortly. --Zvika 19:29, 27 September 2006 (UTC)[reply]
Looking forward to this addition. Also, what can be done if the variances are not known? After all, if is not known then probably is not either. (Can you use some version of the sample variances, for instance?) Thanks! Eclecticos (talk) 05:26, 5 October 2008 (UTC)[reply]

Thanks for the great article on the James-Stein estimator. I think you may also want to mention the connection to Emprirical Bayes methods (e.g., as discsussed by Effron and Morris in their paper "Stein's Estimation Rule and Its Competitors--An Empirical Bayes Approach"). Personally, I found the Empirical Bayes explanation provided some very useful intuition to the "magic" of this estimator. — Preceding unsigned comment added by 131.239.52.20 (talk) 17:54, 18 April 2007 (UTC)[reply]

Thanks for the compliment! Your suggestion sounds like a good idea. User:Billjefferys recently suggested a similar addition to the article Stein's example, but neither of us has gotten around to working on it yet. --Zvika 07:55, 19 April 2007 (UTC)[reply]

dimensionality of y

[edit]

A confusing point about this article: y is described as "observations" of an m-dimensional vector , suggesting that it should be an m by n matrix, where n is the number of observations. However, this doesn't conform to the use of y in the formula for the James-Stein estimator, where y appears to be a single m-dimensional vector. (Is there some mean involved? Is computed over all mn scalars?) Furthermore, can we still apply some version of the James-Stein technique in the case where we have more observations of than of , i.e., there is not a single n? Thanks for any clarification in the article. Eclecticos (talk) 05:19, 5 October 2008 (UTC)[reply]

The setting in the article describes a case where there is one observation per parameter. I have added a clarifying comment to this effect. In the situation you describe, in which several independent observations are given per parameter, the mean of these observations is a sufficient statistic for estimating θ, so that this setting can be reduced to the one in the article. --Zvika (talk) 05:48, 5 October 2008 (UTC)[reply]
The wording is still unclear, especially the sentence: "Suppose θ is an unknown parameter vector of length m, and let y be a vector of observations of θ (also of length m)". How can a vector of m-dimensional observations have length m? --StefanVanDerWalt (talk) 11:07, 1 February 2010 (UTC)[reply]
Indeed, it does not make sense. I'll give it a shot. 84.238.115.164 (talk) 19:49, 17 February 2010 (UTC)[reply]
Me too. What do you think of my edit? Yak90 (talk) 08:05, 24 September 2017 (UTC)[reply]
Is the formula using σ2/ni applicable for different sample sizes in groups?. In Morris, 1983, Parametric Empirical Bayes Inference: Theory and Applications, it is claimed that a more general version (which is also derived there) of Stein's estimator is needed if the variances Vi are unequal, where Vi denotes σ2i/ni so as I understands it, Steins formula is only applicable for equal ni as well.

Bias

[edit]

The estimator is always biased, right? I think this is worth mentioning directly in the article. Lavaka (talk) 02:09, 22 March 2011 (UTC)[reply]

Risk functions

[edit]

The graph of the MSE functions would need a bit more precisions : we are in the case where ν=0, probably m=10 and σ=1, aren't we ? (I thought that, in this case , for θ = 0, MSE should be equal to 2 ; maybe the red curve represents the positive JS ?) —Preceding unsigned comment added by 82.244.59.11 (talk) 15:40, 10 May 2011 (UTC)[reply]

Extensions

[edit]

In the case of unknown variance, multiple observations are necessary, right? Thus it would make sense to swap bullet points 1 and 2 and reference the then first from the second. Also, the "usual estimator of the variance" is a bit dubious to me. Shouldn't it be something like: ?

Always or on average?

[edit]

Currently the lead says

the James–Stein estimator dominates the "ordinary" least squares approach, i.e., it has lower mean squared error on average.

But two sections later the article says

the James–Stein estimator always achieves lower mean squared error (MSE) than the maximum likelihood estimator. By definition, this makes the least squares estimator inadmissible when .

(Bolding is mine.) This appears contradictory, and I suspect the passage in the lead should be changed from "on average" to "always". Since I don't know for sure, I won't change it myself. Loraof (talk) 23:55, 14 October 2017 (UTC)[reply]

It sounds like the first one doubles the "mean". On average the squared error is lower. The mean squared error is lower. There is nothing more to average over. If you have a specific sample the squared error of the James-Stein estimator can be worse. --mfb (talk) 03:09, 15 October 2017 (UTC)[reply]

Concrete examples?

[edit]

This article would be greatly improved if some concrete examples were given in the lead and text so that laymen might have some idea of what the subject deals with in the real world. μηδείς (talk) 23:17, 8 November 2017 (UTC)[reply]

There is an example starting at "A quirky example". I'm not sure if there are real world implications. --mfb (talk) 07:31, 9 November 2017 (UTC)[reply]
I agree with Medeis. And by "concrete", I don't mean more hand-waving, I mean an actual variance-covariance matrix and set of observations, so that I can check the claim for myself. Maproom (talk) 09:42, 22 June 2018 (UTC)[reply]

Using a single observation??

[edit]

Is this correct?

"We are interested in obtaining an estimate of , based on a single observation, , of ."

How can you get an estimate from a single observation? Presumably the means along each dimension of are uncorrelated... Danski14(talk) 19:53, 2 March 2018 (UTC)[reply]

apparently it is right. They use the prior to get the estimate. Nevermind Danski14(talk) 20:13, 4 April 2018 (UTC)[reply]
You make a single observation (in all dimensions), and then you either use this observation as your estimate or you do something else with it. --mfb (talk) 05:02, 5 April 2018 (UTC)[reply]

Using James-Stein with a linear regression

[edit]

I am wondering how to use James-Stein in an ordinary least squares regression.

First, if are the coefficient estimates for an OLS (I skipped the hat), is the following the formula for shrinking it towards zero:

where is the true variance (I might substitute the sample variance here), and p is the number of parameters in . (I'm a bit fuzzy on whether , the constant in the regression, is in the $\beta$.)

I guessed this formula from "The risk of James-Stein and Lasso Shrinkage", but I don't know if it's right.

Second, what would the formula be for the confidence intervals of the shrunken estimates?

dfrankow (talk) 20:46, 12 May 2020 (UTC)[reply]

I think you should look at https://journals.sagepub.com/doi/abs/10.1177/0008068320090105 and more directly at https://www.sciencedirect.com/science/article/abs/pii/S0378375806002813 both show that you need to minimize KL-divergence. This is completely missing in the article now and I just added a see also note... Maybe someone can expand this foundational aspectBiggerj1 (talk) 20:08, 23 October 2024 (UTC)[reply]

KL divergence, reference important but not discussed

[edit]

see above two papers for the foundational role of KL divergence here. Biggerj1 (talk) 20:09, 23 October 2024 (UTC)[reply]