Generalized least squares: Difference between revisions
Tags: Mobile edit Mobile web edit Advanced mobile edit |
|||
(31 intermediate revisions by 23 users not shown) | |||
Line 1: | Line 1: | ||
{{Short description|Statistical estimation technique}} |
{{Short description|Statistical estimation technique}} |
||
{{Distinguish|generalized linear model}} |
{{Distinguish|generalized linear model}} |
||
{{Tone|date=April 2023}} |
|||
{{Regression bar}} |
{{Regression bar}} |
||
In [[statistics]], '''generalized least squares |
In [[statistics]], '''generalized least squares (GLS)''' is a method used to estimate the unknown parameters in a [[Linear regression|linear regression model]]. It is used when there is a non-zero amount of [[correlation]] between the [[Residual (statistics)|residuals]] in the regression model. GLS is employed to improve [[efficiency_(statistics)|statistical efficiency]] and reduce the risk of drawing erroneous inferences, as compared to conventional [[least squares]] and [[weighted least squares]] methods. It was first described by [[Alexander Aitken]] in 1935.<ref>{{cite journal |last1=Aitken |first1=A. C. |year=1935 |title=On Least Squares and Linear Combinations of Observations |journal=Proceedings of the Royal Society of Edinburgh |volume=55 |pages=42–48 |doi=10.1017/s0370164600014346}}</ref> |
||
It requires knowledge of the [[covariance matrix]] for the residuals. If this is unknown, estimating the covariance matrix gives the method of feasible generalized least squares (FGLS). However, FGLS provides fewer guarantees of improvement. |
|||
== Method == |
== Method == |
||
In standard [[linear regression]] models, one observes data <math>\{y_i,x_{ij}\}_{i=1, \dots, n,j=2, \dots, k}</math> on ''n'' [[statistical unit]]s with '' |
In standard [[linear regression]] models, one observes data <math>\{y_i,x_{ij}\}_{i=1, \dots, n,j=2, \dots, k}</math> on ''n'' [[statistical unit]]s with ''k'' − 1 predictor values and one response value each. |
||
The response values are placed in a vector,<math display="block">\mathbf{y} \equiv |
The response values are placed in a vector,<math display="block">\mathbf{y} \equiv |
||
Line 17: | Line 18: | ||
y_n |
y_n |
||
\end{pmatrix}, |
\end{pmatrix}, |
||
</math> |
|||
and the predictor values are placed in the [[design matrix]],<math display="block">\mathbf{X} \equiv |
|||
\begin{pmatrix} |
\begin{pmatrix} |
||
1 & x_{12} & x_{13} & \cdots & x_{1k} |
1 & x_{12} & x_{13} & \cdots & x_{1k} |
||
Line 27: | Line 29: | ||
1 & x_{n2} & x_{n3} & \cdots & x_{nk} |
1 & x_{n2} & x_{n3} & \cdots & x_{nk} |
||
\end{pmatrix} |
\end{pmatrix} |
||
,</math> |
|||
where each row is a vector of the <math>k</math> predictor variables (including a constant) for the <math>i</math>th data point. |
|||
The model assumes that the [[conditional mean]] of <math>\mathbf{y}</math> given <math>\mathbf{X}</math> to be a linear function of <math>\mathbf{X}</math> and that the conditional [[variance]] of the error term given <math>\mathbf{X}</math> is a known [[Invertible matrix|non-singular]] [[covariance matrix]], <math>\mathbf{\Omega}</math>. That is,<math display="block"> |
The model assumes that the [[conditional mean]] of <math>\mathbf{y}</math> given <math>\mathbf{X}</math> to be a linear function of <math>\mathbf{X}</math> and that the conditional [[variance]] of the error term given <math>\mathbf{X}</math> is a known [[Invertible matrix|non-singular]] [[covariance matrix]], <math>\mathbf{\Omega}</math>. That is,<math display="block"> |
||
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \operatorname{E}[\boldsymbol\varepsilon\mid\mathbf{X}]=0, \quad \operatorname{Cov}[\boldsymbol\varepsilon\mid\mathbf{X}]= \boldsymbol{\Omega}, </math>where <math>\boldsymbol\beta \in \mathbb{R}^k</math> is a vector of unknown constants, called |
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \operatorname{E}[\boldsymbol\varepsilon\mid\mathbf{X}]=0, \quad \operatorname{Cov}[\boldsymbol\varepsilon\mid\mathbf{X}]= \boldsymbol{\Omega}, </math> |
||
where <math>\boldsymbol\beta \in \mathbb{R}^k</math> is a vector of unknown constants, called "regression coefficients", which are estimated from the data. |
|||
If <math>\mathbf{b}</math> is a candidate estimate for <math>\boldsymbol{\beta}</math>, then the [[errors and residuals in statistics|residual]] vector for <math>\mathbf{b}</math> is <math>\mathbf{y}- \mathbf{X} \mathbf{b}</math>. The generalized least squares method estimates <math>\boldsymbol{\beta}</math> by minimizing the squared [[Mahalanobis distance|Mahalanobis length]] of this residual vector:<math display="block"> |
If <math>\mathbf{b}</math> is a candidate estimate for <math>\boldsymbol{\beta}</math>, then the [[errors and residuals in statistics|residual]] vector for <math>\mathbf{b}</math> is <math>\mathbf{y}- \mathbf{X} \mathbf{b}</math>. The generalized least squares method estimates <math>\boldsymbol{\beta}</math> by minimizing the squared [[Mahalanobis distance|Mahalanobis length]] of this residual vector:<math display="block"> |
||
Line 38: | Line 42: | ||
\end{align} |
\end{align} |
||
</math>which is equivalent to |
</math>which is equivalent to<math display="block"> |
||
{\hat{\boldsymbol\beta}} = \underset{ \mathbf{b}}\operatorname{arg min}\,\mathbf{y}^{\mathrm{T}}\,\mathbf{\Omega}^{-1}\mathbf{y} + \mathbf{b}^{\mathrm{T}} \mathbf{X}^{\mathrm{T}} \mathbf{\Omega}^{-1} \mathbf{X} \mathbf{b} -2 \mathbf{b}^{\mathrm{T}} \mathbf{X} ^{\mathrm{T}}\mathbf{\Omega}^{-1}\mathbf{y}, </math>which is a [[quadratic programming]] problem. The stationary point of the objective function occurs when |
{\hat{\boldsymbol\beta}} = \underset{ \mathbf{b}}\operatorname{arg min}\,\mathbf{y}^{\mathrm{T}}\,\mathbf{\Omega}^{-1}\mathbf{y} + \mathbf{b}^{\mathrm{T}} \mathbf{X}^{\mathrm{T}} \mathbf{\Omega}^{-1} \mathbf{X} \mathbf{b} -2 \mathbf{b}^{\mathrm{T}} \mathbf{X} ^{\mathrm{T}}\mathbf{\Omega}^{-1}\mathbf{y}, </math>which is a [[quadratic programming]] problem. The stationary point of the objective function occurs when<math display="block"> |
||
2 \mathbf{X}^{\mathrm{T}} \mathbf{\Omega}^{-1} \mathbf{X} { \mathbf{b}} -2 \mathbf{X} ^{\mathrm{T}}\mathbf{\Omega}^{-1}\mathbf{y} = 0 |
2 \mathbf{X}^{\mathrm{T}} \mathbf{\Omega}^{-1} \mathbf{X} { \mathbf{b}} -2 \mathbf{X} ^{\mathrm{T}}\mathbf{\Omega}^{-1}\mathbf{y} = 0 |
||
, |
, |
||
Line 47: | Line 51: | ||
=== Properties === |
=== Properties === |
||
The GLS estimator is [[Bias of an estimator|unbiased]], [[consistent estimator|consistent]], [[efficiency (statistics)|efficient]], and [[asymptotic distribution|asymptotically normal]] with |
The GLS estimator is [[Bias of an estimator|unbiased]], [[consistent estimator|consistent]], [[efficiency (statistics)|efficient]], and [[asymptotic distribution|asymptotically normal]] with<math display="block">\operatorname{E}[\hat\boldsymbol\beta\mid\mathbf{X}] = \boldsymbol\beta, |
||
\quad\text{and}\quad |
\quad\text{and}\quad |
||
\operatorname{Cov}[\hat{\boldsymbol\beta}\mid\mathbf{X}] = (\mathbf{X}^{\mathrm{T}}\boldsymbol\Omega^{-1}\mathbf{X})^{-1}.</math>GLS is equivalent to applying [[ordinary least squares]] (OLS) to a linearly |
\operatorname{Cov}[\hat{\boldsymbol\beta}\mid\mathbf{X}] = (\mathbf{X}^{\mathrm{T}}\boldsymbol\Omega^{-1}\mathbf{X})^{-1}.</math>GLS is equivalent to applying [[ordinary least squares]] (OLS) to a linearly transformed version of the data. This can be seen by factoring <math>\mathbf{\Omega} = \mathbf{C} \mathbf{C}^{ \mathrm{T}}</math> using a method such as [[Cholesky decomposition]]. Left-multiplying both sides of <math>\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}</math> by <math>\mathbf{C}^{-1}</math> yields an equivalent linear model: <math display="block">\mathbf{y}^{*} = \mathbf{X}^{*} \boldsymbol{\beta} + \boldsymbol{\varepsilon}^{*}, |
||
\quad |
\quad |
||
\text{where} |
\text{where} |
||
Line 58: | Line 62: | ||
\quad |
\quad |
||
\boldsymbol{\varepsilon}^{*} = \mathbf{C}^{-1} \boldsymbol{\varepsilon}.</math>In this model, <math>\operatorname{Var}[{\boldsymbol{\varepsilon}}^{*}\mid\mathbf{X}]= \mathbf{C}^{-1} \mathbf{\Omega} \left(\mathbf{C}^{-1} \right)^{\mathrm{T}} = \mathbf{I}</math>, where <math>\mathbf{I}</math> is the [[identity matrix]]. Then, <math>\boldsymbol{\beta}</math> can be efficiently estimated by applying OLS to the transformed data, which requires minimizing the objective,<math display="block"> |
\boldsymbol{\varepsilon}^{*} = \mathbf{C}^{-1} \boldsymbol{\varepsilon}.</math>In this model, <math>\operatorname{Var}[{\boldsymbol{\varepsilon}}^{*}\mid\mathbf{X}]= \mathbf{C}^{-1} \mathbf{\Omega} \left(\mathbf{C}^{-1} \right)^{\mathrm{T}} = \mathbf{I}</math>, where <math>\mathbf{I}</math> is the [[identity matrix]]. Then, <math>\boldsymbol{\beta}</math> can be efficiently estimated by applying OLS to the transformed data, which requires minimizing the objective,<math display="block"> |
||
\left(\mathbf{y}^{*} - \mathbf{X}^{*} \boldsymbol{\beta} \right)^{\mathrm{T}} (\mathbf{y}^{*} - \mathbf{X}^{*} \boldsymbol{\beta}) = (\mathbf{y}- \mathbf{X} \mathbf{b})^{\mathrm{T}}\,\mathbf{\Omega}^{-1}(\mathbf{y}- \mathbf{X} \mathbf{b}). |
\left(\mathbf{y}^{*} - \mathbf{X}^{*} \boldsymbol{\beta} \right)^{\mathrm{T}} (\mathbf{y}^{*} - \mathbf{X}^{*} \boldsymbol{\beta}) = (\mathbf{y}- \mathbf{X} \mathbf{b})^{\mathrm{T}}\,\mathbf{\Omega}^{-1}(\mathbf{y}- \mathbf{X} \mathbf{b}).</math> |
||
This transformation effectively standardizes the scale of and de-correlates the errors. When OLS is used on data with [[Homoscedasticity and heteroscedasticity|homoscedastic]] errors, the [[Gauss–Markov theorem]] applies, so the GLS estimate is the [[Blue (statistics)|best linear unbiased estimator]] for ''<math>\boldsymbol{\beta}</math>''. |
|||
== Weighted least squares == |
== Weighted least squares == |
||
{{Main|Weighted least squares}} |
{{Main|Weighted least squares}} |
||
A special case of GLS, called weighted least squares (WLS), occurs when all the off-diagonal entries of Ω are 0. This situation arises when the variances of the observed values are unequal or when [[heteroscedasticity]] is present, but no correlations exist among the observed variances. The weight for unit i is proportional to the reciprocal of the variance of the response for unit ''i''.<ref>{{cite book|author=Strutz, T.| title=Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond) |publisher=Springer Vieweg | year=2016 | isbn= 978-3-658-11455-8}}, chapter 3</ref> |
A special case of GLS, called weighted least squares (WLS), occurs when all the off-diagonal entries of Ω are 0. This situation arises when the variances of the observed values are unequal or when [[heteroscedasticity]] is present, but no correlations exist among the observed variances. The weight for unit ''i'' is proportional to the reciprocal of the variance of the response for unit ''i''.<ref>{{cite book|author=Strutz, T.| title=Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond) |publisher=Springer Vieweg | year=2016 | isbn= 978-3-658-11455-8}}, chapter 3</ref> |
||
== Derivation by maximum likelihood estimation == |
== Derivation by maximum likelihood estimation == |
||
[[Ordinary least squares]] can be interpreted as [[maximum likelihood estimation]] with the [[Bayesian prior|prior]] that the errors are independent and normally |
[[Ordinary least squares]] can be interpreted as [[maximum likelihood estimation]] with the [[Bayesian prior|prior]] that the errors are independent and normally distributed with zero mean and common variance. In GLS, the prior is generalized to the case where errors may not be independent and may have [[Homoscedasticity and heteroscedasticity|differing variances]]. For given fit parameters <math>\mathbf b</math>, the [[conditional probability density function]] of the errors are assumed to be:<math display="block">p(\boldsymbol\varepsilon| \mathbf b) |
||
= |
= |
||
\frac{1}{\sqrt{(2\pi)^n \det \boldsymbol \Omega }} |
\frac{1}{\sqrt{(2\pi)^n \det \boldsymbol \Omega }} |
||
Line 72: | Line 76: | ||
\frac{1}{2} |
\frac{1}{2} |
||
\boldsymbol \varepsilon^{\mathrm{T}} \boldsymbol \Omega^{-1}\boldsymbol \varepsilon |
\boldsymbol \varepsilon^{\mathrm{T}} \boldsymbol \Omega^{-1}\boldsymbol \varepsilon |
||
\right).</math>By [[Bayes' theorem]],<math display="block">p(\mathbf b | \boldsymbol \varepsilon) |
\right).</math> |
||
By [[Bayes' theorem]],<math display="block">p(\mathbf b | \boldsymbol \varepsilon) |
|||
= |
= |
||
\frac{p(\boldsymbol \varepsilon | \mathbf b) p(\mathbf b )}{p(\boldsymbol \varepsilon)}.</math>In GLS, a [[Uniform prior|uniform (improper) prior]] is taken for <math>p(\mathbf b)</math>, and as <math>p(\boldsymbol\varepsilon)</math> is a marginal distribution, it does not depend on <math>\mathbf b</math>. Therefore the log-probability is |
\frac{p(\boldsymbol \varepsilon | \mathbf b) p(\mathbf b )}{p(\boldsymbol \varepsilon)}.</math>In GLS, a [[Uniform prior|uniform (improper) prior]] is taken for <math>p(\mathbf b)</math>, and as <math>p(\boldsymbol\varepsilon)</math> is a marginal distribution, it does not depend on <math>\mathbf b</math>. Therefore the log-probability is<math display="block">\log p(\mathbf b|\boldsymbol \varepsilon) |
||
= |
= |
||
\log p(\boldsymbol \varepsilon | \mathbf b) |
\log p(\boldsymbol \varepsilon | \mathbf b) |
||
Line 80: | Line 85: | ||
\cdots |
\cdots |
||
= |
= |
||
-\frac{1}{2}\boldsymbol \varepsilon^{\mathrm{T}} \boldsymbol \Omega^{-1} \boldsymbol\varepsilon +\cdots,</math>where the hidden terms are those that do not depend on <math>\mathbf b</math>, and <math>\log p(\boldsymbol \varepsilon | \mathbf b)</math> is the [[log-likelihood]]. The [[Maximum a posteriori estimation|maximum a posteriori]] (MAP) estimate is then the [[Maximum likelihood estimation|maximum likelihood estimate]] (MLE) which is equivalent to the optimization problem from above,<math>{\hat{\boldsymbol{\beta}}} = \underset{\mathbf{b}}\operatorname{argmax} \; p(\mathbf b| \boldsymbol \varepsilon) |
-\frac{1}{2}\boldsymbol \varepsilon^{\mathrm{T}} \boldsymbol \Omega^{-1} \boldsymbol\varepsilon +\cdots,</math>where the hidden terms are those that do not depend on <math>\mathbf b</math>, and <math>\log p(\boldsymbol \varepsilon | \mathbf b)</math> is the [[log-likelihood]]. The [[Maximum a posteriori estimation|maximum a posteriori]] (MAP) estimate is then the [[Maximum likelihood estimation|maximum likelihood estimate]] (MLE), which is equivalent to the optimization problem from above,<math>{\hat{\boldsymbol{\beta}}} = \underset{\mathbf{b}}\operatorname{argmax} \; p(\mathbf b| \boldsymbol \varepsilon) |
||
=\underset{\mathbf{b}}\operatorname{argmax} \; \log p(\mathbf b | \boldsymbol \varepsilon) |
=\underset{\mathbf{b}}\operatorname{argmax} \; \log p(\mathbf b | \boldsymbol \varepsilon) |
||
=\underset{\mathbf{b}}\operatorname{argmax} \; \log p(\boldsymbol \varepsilon | \mathbf b ) |
=\underset{\mathbf{b}}\operatorname{argmax} \; \log p(\boldsymbol \varepsilon | \mathbf b ), |
||
⚫ | |||
⚫ | |||
</math> |
</math> |
||
⚫ | where the optimization problem has been re-written using the fact that the [[logarithm]] is a [[strictly increasing function]] and the property that the argument solving an [[Mathematical optimization|optimization problem]] is independent of terms in the objective function which do not involve said terms. |
||
Substituting <math>\mathbf y - \mathbf X \mathbf b |
|||
</math> |
</math> for <math>\boldsymbol \varepsilon |
||
⚫ | |||
</math>, |
|||
⚫ | |||
⚫ | |||
</math> |
|||
== Feasible generalized least squares == |
== Feasible generalized least squares == |
||
If the covariance of the errors <math>\Omega </math> is unknown one can get a consistent estimate of <math>\Omega |
If the covariance of the errors <math>\Omega </math> is unknown, one can get a consistent estimate of <math>\Omega</math>, say <math>\widehat \Omega </math>,<ref name="Baltagi2008">Baltagi, B. H. (2008). Econometrics (4th ed.). New York: Springer.</ref> using an implementable version of GLS known as the '''feasible generalized least squares'''<!--"Feasible generalized least squares" redirects here; this is bolded per MOS:BOLD--> ('''FGLS''') estimator. |
||
In FGLS, modeling proceeds in two stages: |
In FGLS, modeling proceeds in two stages: |
||
Line 100: | Line 107: | ||
# Then, using the consistent estimator of the covariance matrix of the errors, one can implement GLS ideas. |
# Then, using the consistent estimator of the covariance matrix of the errors, one can implement GLS ideas. |
||
Whereas GLS is more efficient than OLS under [[heteroscedasticity]] (also spelled heteroskedasticity) or [[autocorrelation]], this is not true for FGLS. The feasible estimator is ''asymptotically'' more efficient |
Whereas GLS is more efficient than OLS under [[heteroscedasticity]] (also spelled heteroskedasticity) or [[autocorrelation]], this is not true for FGLS. The feasible estimator is ''asymptotically'' more efficient (provided the errors covariance matrix is consistently estimated), but for a small to medium-sized sample, it can be actually less efficient than OLS. This is why some authors prefer to use OLS and reformulate their inferences by simply considering an alternative estimator for the variance of the estimator robust to heteroscedasticity or serial autocorrelation. However, for large samples, FGLS is preferred over OLS under heteroskedasticity or serial correlation.<ref name="Baltagi2008" /><ref name="Greene2003">Greene, W. H. (2003). Econometric Analysis (5th ed.). Upper Saddle River, NJ: Prentice Hall.</ref> A cautionary note is that the FGLS estimator is not always consistent. One case in which FGLS might be inconsistent is if there are individual-specific fixed effects.<ref>{{Cite journal |last=Hansen |first=Christian B. |title=Generalized Least Squares Inference in Panel and Multilevel Models with Serial Correlation and Fixed Effects |journal=[[Journal of Econometrics]] |year=2007 |volume=140 |issue=2 |pages=670–694 |doi=10.1016/j.jeconom.2006.07.011 }}</ref> |
||
However, for large samples FGLS is preferred over OLS under heteroskedasticity or serial correlation.<ref name="Baltagi2008" /><ref name="Greene2003">Greene, W. H. (2003). Econometric Analysis (5th ed.). Upper Saddle River, NJ: Prentice Hall.</ref> A cautionary note is that the FGLS estimator is not always consistent. One case in which FGLS might be inconsistent is if there are individual specific fixed effects.<ref>{{Cite journal |last=Hansen |first=Christian B. |title=Generalized Least Squares Inference in Panel and Multilevel Models with Serial Correlation and Fixed Effects |journal=[[Journal of Econometrics]] |year=2007 |volume=140 |issue=2 |pages=670–694 |doi=10.1016/j.jeconom.2006.07.011 }}</ref> |
|||
⚫ | In general, this estimator has different properties than GLS. For large samples (i.e., asymptotically), all properties are (under appropriate conditions) common with respect to GLS, but for finite samples, the properties of FGLS estimators are unknown: they vary dramatically with each particular model, and as a general rule, their exact distributions cannot be derived analytically. For finite samples, FGLS may be less efficient than OLS in some cases. Thus, while GLS can be made feasible, it is not always wise to apply this method when the sample is small. A method used to improve the accuracy of the estimators in finite samples is to iterate; that is, to take the residuals from FGLS to update the errors' covariance estimator and then update the FGLS estimation, applying the same idea iteratively until the estimators vary less than some tolerance. However, this method does not necessarily improve the efficiency of the estimator very much if the original sample was small. |
||
⚫ | |||
⚫ | In general, this estimator has different properties than GLS. For large samples (i.e. asymptotically) all properties are (under appropriate conditions) common with respect to GLS, but for finite samples the properties of FGLS estimators are unknown: they vary dramatically with each particular model, and as a general rule their exact distributions cannot be derived analytically. For finite samples, FGLS may be less efficient than OLS in some cases. Thus, while GLS can be made feasible, it is not always wise to apply this method when the sample is small. |
||
⚫ | |||
A method used to improve accuracy of the estimators in finite samples is to iterate, i.e., to take the residuals from FGLS to update the errors' covariance estimator and then update the FGLS estimation, applying the same idea iteratively until the estimators vary less than some tolerance. But this method does not necessarily improve the efficiency of the estimator very much if the original sample was small. |
|||
⚫ | (which is inconsistent in this framework) and instead use a HAC (Heteroskedasticity and Autocorrelation Consistent) estimator. In the context of autocorrelation, the [[Newey–West estimator]] can be used, and in heteroscedastic contexts, the [[Heteroscedasticity-consistent standard errors|Eicker–White estimator]] can be used instead. This approach is much safer, and it is the appropriate path to take unless the sample is large, where "large" is sometimes a slippery issue (e.g., if the error distribution is asymmetric the required sample will be much larger). |
||
⚫ | |||
⚫ | |||
⚫ | (which is inconsistent in this framework) and instead use a HAC (Heteroskedasticity and Autocorrelation Consistent) estimator. |
||
The [[ordinary least squares]] (OLS) estimator is calculated by: |
The [[ordinary least squares]] (OLS) estimator is calculated by: |
||
:<math> |
:<math> |
||
\widehat \beta_\text{OLS} = (X |
\widehat \beta_\text{OLS} = (X^\operatorname{T} X)^{-1} X^\operatorname{T} y |
||
</math> |
</math> |
||
Line 123: | Line 129: | ||
</math> |
</math> |
||
It is important to notice that the squared residuals cannot be used in the previous expression; |
It is important to notice that the squared residuals cannot be used in the previous expression; an estimator of the errors' variances is needed. To do so, a parametric [[Homoscedasticity and heteroscedasticity|heteroskedasticity]] model or nonparametric estimator can be used. |
||
Estimate <math> \beta_{FGLS1}</math> using <math> \widehat{\Omega}_\text{OLS}</math> using<ref name="Greene2003" /> [[weighted least squares]]: |
Estimate <math> \beta_{FGLS1}</math> using <math> \widehat{\Omega}_\text{OLS}</math> using<ref name="Greene2003" /> [[weighted least squares]]: |
||
:<math> |
:<math> |
||
\widehat \beta_{FGLS1} = (X |
\widehat \beta_{FGLS1} = (X^\operatorname{T} \widehat{\Omega}^{-1}_\text{OLS} X)^{-1} X^\operatorname{T} \widehat{\Omega}^{-1}_\text{OLS} y |
||
</math> |
</math> |
||
Line 141: | Line 147: | ||
:<math> |
:<math> |
||
\widehat \beta_{FGLS2} = (X |
\widehat \beta_{FGLS2} = (X^\operatorname{T} \widehat{\Omega}^{-1}_{FGLS1} X)^{-1} X^\operatorname{T} \widehat{\Omega}^{-1}_{FGLS1} y |
||
</math> |
</math> |
||
This estimation of <math>\widehat{\Omega}</math> can be iterated to convergence. |
This estimation of <math>\widehat{\Omega}</math> can be iterated to convergence. |
||
Under regularity conditions the FGLS estimator (or the estimator of its iterations, if |
Under regularity conditions, the FGLS estimator (or the estimator of its iterations, if a finite number of iterations are conducted) is asymptotically distributed as: |
||
: <math> |
: <math> |
||
\sqrt{n}(\hat\beta_{FGLS} - \beta)\ \xrightarrow{d}\ \mathcal{N}\!\left(0,\,V\right) |
\sqrt{n}(\hat\beta_{FGLS} - \beta)\ \xrightarrow{d}\ \mathcal{N}\!\left(0,\,V\right) |
||
</math> |
</math> |
||
where n is the sample size and |
where <math>n</math> is the sample size, and |
||
:<math> |
:<math> |
||
V = \operatorname{p-lim}(X |
V = \operatorname{p-lim}(X^\operatorname{T} \Omega^{-1}X/n) |
||
</math> |
</math> |
||
where <math>\text{p-lim}</math> means [[Convergence of random variables|limit in probability]]. |
|||
== See also == |
== See also == |
||
Line 162: | Line 168: | ||
* [[Degrees of freedom (statistics)#Effective degrees of freedom|Effective degrees of freedom]] |
* [[Degrees of freedom (statistics)#Effective degrees of freedom|Effective degrees of freedom]] |
||
* [[Prais–Winsten estimation]] |
* [[Prais–Winsten estimation]] |
||
* [[Whitening transformation]] |
|||
== References == |
== References == |
Latest revision as of 19:18, 3 November 2024
Part of a series on |
Regression analysis |
---|
Models |
Estimation |
Background |
In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.[1]
It requires knowledge of the covariance matrix for the residuals. If this is unknown, estimating the covariance matrix gives the method of feasible generalized least squares (FGLS). However, FGLS provides fewer guarantees of improvement.
Method
[edit]In standard linear regression models, one observes data on n statistical units with k − 1 predictor values and one response value each.
The response values are placed in a vector, and the predictor values are placed in the design matrix, where each row is a vector of the predictor variables (including a constant) for the th data point.
The model assumes that the conditional mean of given to be a linear function of and that the conditional variance of the error term given is a known non-singular covariance matrix, . That is, where is a vector of unknown constants, called "regression coefficients", which are estimated from the data.
If is a candidate estimate for , then the residual vector for is . The generalized least squares method estimates by minimizing the squared Mahalanobis length of this residual vector:which is equivalent towhich is a quadratic programming problem. The stationary point of the objective function occurs whenso the estimator isThe quantity is known as the precision matrix (or dispersion matrix), a generalization of the diagonal weight matrix.
Properties
[edit]The GLS estimator is unbiased, consistent, efficient, and asymptotically normal withGLS is equivalent to applying ordinary least squares (OLS) to a linearly transformed version of the data. This can be seen by factoring using a method such as Cholesky decomposition. Left-multiplying both sides of by yields an equivalent linear model: In this model, , where is the identity matrix. Then, can be efficiently estimated by applying OLS to the transformed data, which requires minimizing the objective, This transformation effectively standardizes the scale of and de-correlates the errors. When OLS is used on data with homoscedastic errors, the Gauss–Markov theorem applies, so the GLS estimate is the best linear unbiased estimator for .
Weighted least squares
[edit]A special case of GLS, called weighted least squares (WLS), occurs when all the off-diagonal entries of Ω are 0. This situation arises when the variances of the observed values are unequal or when heteroscedasticity is present, but no correlations exist among the observed variances. The weight for unit i is proportional to the reciprocal of the variance of the response for unit i.[2]
Derivation by maximum likelihood estimation
[edit]Ordinary least squares can be interpreted as maximum likelihood estimation with the prior that the errors are independent and normally distributed with zero mean and common variance. In GLS, the prior is generalized to the case where errors may not be independent and may have differing variances. For given fit parameters , the conditional probability density function of the errors are assumed to be: By Bayes' theorem,In GLS, a uniform (improper) prior is taken for , and as is a marginal distribution, it does not depend on . Therefore the log-probability iswhere the hidden terms are those that do not depend on , and is the log-likelihood. The maximum a posteriori (MAP) estimate is then the maximum likelihood estimate (MLE), which is equivalent to the optimization problem from above,
where the optimization problem has been re-written using the fact that the logarithm is a strictly increasing function and the property that the argument solving an optimization problem is independent of terms in the objective function which do not involve said terms. Substituting for ,
Feasible generalized least squares
[edit]If the covariance of the errors is unknown, one can get a consistent estimate of , say ,[3] using an implementable version of GLS known as the feasible generalized least squares (FGLS) estimator.
In FGLS, modeling proceeds in two stages:
- The model is estimated by OLS or another consistent (but inefficient) estimator, and the residuals are used to build a consistent estimator of the errors covariance matrix (to do so, one often needs to examine the model adding additional constraints; for example, if the errors follow a time series process, a statistician generally needs some theoretical assumptions on this process to ensure that a consistent estimator is available).
- Then, using the consistent estimator of the covariance matrix of the errors, one can implement GLS ideas.
Whereas GLS is more efficient than OLS under heteroscedasticity (also spelled heteroskedasticity) or autocorrelation, this is not true for FGLS. The feasible estimator is asymptotically more efficient (provided the errors covariance matrix is consistently estimated), but for a small to medium-sized sample, it can be actually less efficient than OLS. This is why some authors prefer to use OLS and reformulate their inferences by simply considering an alternative estimator for the variance of the estimator robust to heteroscedasticity or serial autocorrelation. However, for large samples, FGLS is preferred over OLS under heteroskedasticity or serial correlation.[3][4] A cautionary note is that the FGLS estimator is not always consistent. One case in which FGLS might be inconsistent is if there are individual-specific fixed effects.[5]
In general, this estimator has different properties than GLS. For large samples (i.e., asymptotically), all properties are (under appropriate conditions) common with respect to GLS, but for finite samples, the properties of FGLS estimators are unknown: they vary dramatically with each particular model, and as a general rule, their exact distributions cannot be derived analytically. For finite samples, FGLS may be less efficient than OLS in some cases. Thus, while GLS can be made feasible, it is not always wise to apply this method when the sample is small. A method used to improve the accuracy of the estimators in finite samples is to iterate; that is, to take the residuals from FGLS to update the errors' covariance estimator and then update the FGLS estimation, applying the same idea iteratively until the estimators vary less than some tolerance. However, this method does not necessarily improve the efficiency of the estimator very much if the original sample was small.
A reasonable option when samples are not too large is to apply OLS but discard the classical variance estimator
(which is inconsistent in this framework) and instead use a HAC (Heteroskedasticity and Autocorrelation Consistent) estimator. In the context of autocorrelation, the Newey–West estimator can be used, and in heteroscedastic contexts, the Eicker–White estimator can be used instead. This approach is much safer, and it is the appropriate path to take unless the sample is large, where "large" is sometimes a slippery issue (e.g., if the error distribution is asymmetric the required sample will be much larger).
The ordinary least squares (OLS) estimator is calculated by:
and estimates of the residuals are constructed.
For simplicity, consider the model for heteroscedastic and non-autocorrelated errors. Assume that the variance-covariance matrix of the error vector is diagonal, or equivalently that errors from distinct observations are uncorrelated. Then each diagonal entry may be estimated by the fitted residuals so may be constructed by:
It is important to notice that the squared residuals cannot be used in the previous expression; an estimator of the errors' variances is needed. To do so, a parametric heteroskedasticity model or nonparametric estimator can be used.
Estimate using using[4] weighted least squares:
The procedure can be iterated. The first iteration is given by:
This estimation of can be iterated to convergence.
Under regularity conditions, the FGLS estimator (or the estimator of its iterations, if a finite number of iterations are conducted) is asymptotically distributed as:
where is the sample size, and
where means limit in probability.
See also
[edit]References
[edit]- ^ Aitken, A. C. (1935). "On Least Squares and Linear Combinations of Observations". Proceedings of the Royal Society of Edinburgh. 55: 42–48. doi:10.1017/s0370164600014346.
- ^ Strutz, T. (2016). Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond). Springer Vieweg. ISBN 978-3-658-11455-8., chapter 3
- ^ a b Baltagi, B. H. (2008). Econometrics (4th ed.). New York: Springer.
- ^ a b Greene, W. H. (2003). Econometric Analysis (5th ed.). Upper Saddle River, NJ: Prentice Hall.
- ^ Hansen, Christian B. (2007). "Generalized Least Squares Inference in Panel and Multilevel Models with Serial Correlation and Fixed Effects". Journal of Econometrics. 140 (2): 670–694. doi:10.1016/j.jeconom.2006.07.011.
Further reading
[edit]- Amemiya, Takeshi (1985). "Generalized Least Squares Theory". Advanced Econometrics. Harvard University Press. ISBN 0-674-00560-0.
- Johnston, John (1972). "Generalized Least-squares". Econometric Methods (Second ed.). New York: McGraw-Hill. pp. 208–242.
- Kmenta, Jan (1986). "Generalized Linear Regression Model and Its Applications". Elements of Econometrics (Second ed.). New York: Macmillan. pp. 607–650. ISBN 0-472-10886-7.
- Beck, Nathaniel; Katz, Jonathan N. (September 1995). "What To Do (and Not to Do) with Time-Series Cross-Section Data". American Political Science Review. 89 (3): 634–647. doi:10.2307/2082979. ISSN 1537-5943. JSTOR 2082979. S2CID 63222945.