Leverage (statistics): Difference between revisions

Content deleted Content added

Inline

Latest revision as of 04:53, 29 October 2024

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in $\mathbb {R} ^{p}$ space, where ${p}$ is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation.^[1] Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.^[2]

Definition and interpretations

Consider the linear regression model ${y}_{i}={\boldsymbol {x}}_{i}^{\top }{\boldsymbol {\beta }}+{\varepsilon }_{i}$ , $i=1,\,2,\ldots ,\,n$ . That is, ${\boldsymbol {y}}=\mathbf {X} {\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }}$ , where, $\mathbf {X}$ is the $n\times p$ design matrix whose rows correspond to the observations and whose columns correspond to the independent or explanatory variables. The leverage score for the ${i}^{th}$ independent observation ${\boldsymbol {x}}_{i}$ is given as:

h_{ii}=\left[\mathbf {H} \right]_{ii}={\boldsymbol {x}}_{i}^{\top }\left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}{\boldsymbol {x}}_{i}

, the

{i}^{th}

diagonal element of the ortho-projection matrix (a.k.a hat matrix)

\mathbf {H} =\mathbf {X} \left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}\mathbf {X} ^{\top }

.

Thus the ${i}^{th}$ leverage score can be viewed as the 'weighted' distance between ${\boldsymbol {x}}_{i}$ to the mean of ${\boldsymbol {x}}_{i}$ 's (see its relation with Mahalanobis distance). It can also be interpreted as the degree by which the ${i}^{th}$ measured (dependent) value (i.e., $y_{i}$ ) influences the ${i}^{th}$ fitted (predicted) value (i.e., ${\widehat {y\,}}_{i}$ ): mathematically,

h_{ii}={\frac {\partial {\widehat {y\,}}_{i}}{\partial y_{i}}}

.

Hence, the leverage score is also known as the observation self-sensitivity or self-influence.^[3] Using the fact that ${\boldsymbol {\widehat {y}}}={\mathbf {H} }{\boldsymbol {y}}$ (i.e., the prediction ${\boldsymbol {\widehat {y}}}$ is ortho-projection of ${\boldsymbol {y}}$ onto range space of $\mathbf {X}$ ) in the above expression, we get $h_{ii}=\left[\mathbf {H} \right]_{ii}$ . Note that this leverage depends on the values of the explanatory variables $(\mathbf {X} )$ of all observations but not on any of the values of the dependent variables $(y_{i})$ .

Properties

The leverage $h_{ii}$ is a number between 0 and 1, $0\leq h_{ii}\leq 1.$
Proof: Note that $\mathbf {H}$ is idempotent matrix ( $\mathbf {H} ^{2}=\mathbf {H}$ ) and symmetric ( $h_{ij}=h_{ji}$ ). Thus, by using the fact that $\left[\mathbf {H} ^{2}\right]_{ii}=\left[\mathbf {H} \right]_{ii}$ , we have $h_{ii}=h_{ii}^{2}+\sum _{j\neq i}h_{ij}^{2}$ . Since we know that $\sum _{j\neq i}h_{ij}^{2}\geq 0$ , we have $h_{ii}\geq h_{ii}^{2}\implies 0\leq h_{ii}\leq 1$ .
Sum of leverages is equal to the number of parameters $(p)$ in ${\boldsymbol {\beta }}$ (including the intercept).
Proof: $\sum _{i=1}^{n}h_{ii}=\operatorname {Tr} (\mathbf {H} )=\operatorname {Tr} \left(\mathbf {X} \left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}\mathbf {X} ^{\top }\right)=\operatorname {Tr} \left(\mathbf {X} ^{\top }\mathbf {X} \left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}\right)=\operatorname {Tr} (\mathbf {I} _{p})=p$ .

Determination of outliers in X using leverages

Large leverage ${h_{ii}}$ corresponds to an ${{\boldsymbol {x}}_{i}}$ that is extreme. A common rule is to identify ${{\boldsymbol {x}}_{i}}$ whose leverage value ${h}_{ii}$ is more than 2 times larger than the mean leverage ${\bar {h}}={\dfrac {1}{n}}\sum _{i=1}^{n}h_{ii}={\dfrac {p}{n}}$ (see property 2 above). That is, if $h_{ii}>2{\dfrac {p}{n}}$ , ${{\boldsymbol {x}}_{i}}$ shall be considered an outlier. Some statisticians prefer the threshold of $3p/{n}$ instead of $2p/{n}$ .

Relation to Mahalanobis distance

Leverage is closely related to the Mahalanobis distance (proof^[4]). Specifically, for some $n\times p$ matrix $\mathbf {X}$ , the squared Mahalanobis distance of ${{\boldsymbol {x}}_{i}}$ (where ${\boldsymbol {x}}_{i}^{\top }$ is ${i}^{th}$ row of $\mathbf {X}$ ) from the vector of mean ${\widehat {\boldsymbol {\mu }}}=\sum _{i=1}^{n}{\boldsymbol {x}}_{i}$ of length $p$ , is $D^{2}({\boldsymbol {x}}_{i})=({\boldsymbol {x}}_{i}-{\widehat {\boldsymbol {\mu }}})^{\top }\mathbf {S} ^{-1}({\boldsymbol {x}}_{i}-{\widehat {\boldsymbol {\mu }}})$ , where $\mathbf {S} =\mathbf {X} ^{\top }\mathbf {X}$ is the estimated covariance matrix of ${{\boldsymbol {x}}_{i}}$ 's. This is related to the leverage $h_{ii}$ of the hat matrix of $\mathbf {X}$ after appending a column vector of 1's to it. The relationship between the two is:

D^{2}({\boldsymbol {x}}_{i})=(n-1)(h_{ii}-{\tfrac {1}{n}})

This relationship enables us to decompose leverage into meaningful components so that some sources of high leverage can be investigated analytically.^[5]

Relation to influence functions

In a regression context, we combine leverage and influence functions to compute the degree to which estimated coefficients would change if we removed a single data point. Denoting the regression residuals as ${\widehat {e}}_{i}=y_{i}-{\boldsymbol {x}}_{i}^{\top }{\widehat {\boldsymbol {\beta }}}$ , one can compare the estimated coefficient ${\widehat {\boldsymbol {\beta }}}$ to the leave-one-out estimated coefficient ${\widehat {\boldsymbol {\beta }}}^{(-i)}$ using the formula ^[6]^[7]

{\widehat {\boldsymbol {\beta }}}-{\widehat {\boldsymbol {\beta }}}^{(-i)}={\frac {(\mathbf {X} ^{\top }\mathbf {X} )^{-1}{\boldsymbol {x}}_{i}{\widehat {e}}_{i}}{1-h_{ii}}}

Young (2019) uses a version of this formula after residualizing controls.^[8] To gain intuition for this formula, note that ${\frac {\partial {\hat {\beta }}}{\partial y_{i}}}=(\mathbf {X} ^{\top }\mathbf {X} )^{-1}{\boldsymbol {x}}_{i}$ captures the potential for an observation to affect the regression parameters, and therefore $(\mathbf {X} ^{\top }\mathbf {X} )^{-1}{\boldsymbol {x}}_{i}{\widehat {e}}_{i}$ captures the actual influence of that observations' deviations from its fitted value on the regression parameters. The formula then divides by $(1-h_{ii})$ to account for the fact that we remove the observation rather than adjusting its value, reflecting the fact that removal changes the distribution of covariates more when applied to high-leverage observations (i.e. with outlier covariate values). Similar formulas arise when applying general formulas for statistical influences functions in the regression context.^[9]^[10]

Effect on residual variance

If we are in an ordinary least squares setting with fixed $\mathbf {X}$ and homoscedastic regression errors $\varepsilon _{i},$ ${\boldsymbol {y}}=\mathbf {X} {\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }};\ \ \operatorname {Var} ({\boldsymbol {\varepsilon }})=\sigma ^{2}\mathbf {I}$ , then the ${i}^{th}$ regression residual, $e_{i}=y_{i}-{\widehat {y}}_{i}$ has variance

\operatorname {Var} (e_{i})=(1-h_{ii})\sigma ^{2}

.

In other words, an observation's leverage score determines the degree of noise in the model's misprediction of that observation, with higher leverage leading to less noise. This follows from the fact that $\mathbf {I} -\mathbf {H}$ is idempotent and symmetric and ${\widehat {\boldsymbol {y}}}=\mathbf {H} {\boldsymbol {y}}$ , hence, $\operatorname {Var} ({\boldsymbol {e}})=\operatorname {Var} ((\mathbf {I} -\mathbf {H} ){\boldsymbol {y}})=(\mathbf {I} -\mathbf {H} )\operatorname {Var} ({\boldsymbol {y}})(\mathbf {I} -\mathbf {H} )^{\top }=\sigma ^{2}(\mathbf {I} -\mathbf {H} )^{2}=\sigma ^{2}(\mathbf {I} -\mathbf {H} )$ .

The corresponding studentized residual—the residual adjusted for its observation-specific estimated residual variance—is then

t_{i}={e_{i} \over {\widehat {\sigma }}{\sqrt {1-h_{ii}\ }}}

where ${\widehat {\sigma }}$ is an appropriate estimate of $\sigma$ .

Partial leverage

Partial leverage (PL) is a measure of the contribution of the individual independent variables to the total leverage of each observation. That is, PL is a measure of how $h_{ii}$ changes as a variable is added to the regression model. It is computed as:

\left(\mathrm {PL} _{j}\right)_{i}={\frac {\left(\mathbf {X} _{j\bullet [j]}\right)_{i}^{2}}{\sum _{k=1}^{n}\left(\mathbf {X} _{j\bullet [j]}\right)_{k}^{2}}}

where $j$ is the index of independent variable, $i$ is the index of observation and $\mathbf {X} _{j\bullet [j]}$ are the residuals from regressing $\mathbf {X} _{j}$ against the remaining independent variables. Note that the partial leverage is the leverage of the ${i}^{th}$ point in the partial regression plot for the ${j}^{th}$ variable. Data points with large partial leverage for an independent variable can exert undue influence on the selection of that variable in automatic regression model building procedures.

Software implementations

Many programs and statistics packages, such as R, Python, etc., include implementations of Leverage.

Language/Program	Function	Notes
R	`hat(x, intercept = TRUE)` or `hatvalues(model, ...)`	See [1]
Python	`(x * np.linalg.pinv(x).T).sum(-1)`	See [2]

References

^ Everitt, B. S. (2002). Cambridge Dictionary of Statistics. Cambridge University Press. ISBN 0-521-81099-X.
^ James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert (2021). An introduction to statistical learning: with applications in R (Second ed.). New York, NY: Springer. p. 112. ISBN 978-1-0716-1418-1. Retrieved 29 October 2024.
^ Cardinali, C. (June 2013). "Data Assimilation: Observation influence diagnostic of a data assimilation system" (PDF).
^ Prove the relation between Mahalanobis distance and Leverage?
^ Kim, M. G. (2004). "Sources of high leverage in linear regression model (Journal of Applied Mathematics and Computing, Vol 16, 509–513)". arXiv:2006.04024 [math.ST].
^ Miller, Rupert G. (September 1974). "An Unbalanced Jackknife". Annals of Statistics. 2 (5): 880–891. doi:10.1214/aos/1176342811. ISSN 0090-5364.
^ Hiyashi, Fumio (2000). Econometrics. Princeton University Press. p. 21.
^ Young, Alwyn (2019). "Channeling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results". The Quarterly Journal of Economics. 134 (2): 567. doi:10.1093/qje/qjy029.
^ Chatterjee, Samprit; Hadi, Ali S. (August 1986). "Influential Observations, High Leverage Points, and Outliers in Linear Regression". Statistical Science. 1 (3): 379–393. doi:10.1214/ss/1177013622. ISSN 0883-4237.
^ "regression - Influence functions and OLS". Cross Validated. Retrieved 2020-12-06.

[1] Everitt, B. S. (2002). Cambridge Dictionary of Statistics. Cambridge University Press. ISBN 0-521-81099-X.

[2] James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert (2021). An introduction to statistical learning: with applications in R (Second ed.). New York, NY: Springer. p. 112. ISBN 978-1-0716-1418-1. Retrieved 29 October 2024.

[3] Cardinali, C. (June 2013). "Data Assimilation: Observation influence diagnostic of a data assimilation system" (PDF).

[4] Prove the relation between Mahalanobis distance and Leverage?

[5] Kim, M. G. (2004). "Sources of high leverage in linear regression model (Journal of Applied Mathematics and Computing, Vol 16, 509–513)". arXiv:2006.04024 [math.ST].

[6] Miller, Rupert G. (September 1974). "An Unbalanced Jackknife". Annals of Statistics. 2 (5): 880–891. doi:10.1214/aos/1176342811. ISSN 0090-5364.

[7] Hiyashi, Fumio (2000). Econometrics. Princeton University Press. p. 21.

[8] Young, Alwyn (2019). "Channeling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results". The Quarterly Journal of Economics. 134 (2): 567. doi:10.1093/qje/qjy029.

[9] Chatterjee, Samprit; Hadi, Ali S. (August 1986). "Influential Observations, High Leverage Points, and Outliers in Linear Regression". Statistical Science. 1 (3): 379–393. doi:10.1214/ss/1177013622. ISSN 0883-4237.

[10] "regression - Influence functions and OLS". Cross Validated. Retrieved 2020-12-06.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

@@ Line 1: / Line 1: @@
+{{Short description|Statistical term}}
-In [[statistics]] and in particular in [[regression analysis]], '''leverage''' is a measure of how far away the [[independent variable]] values of an [[observation (statistics)|observation]] are from those of the other observations.
+In [[statistics]] and in particular in [[regression analysis]], '''leverage''' is a measure of how far away the [[independent variable]] values of an [[observation (statistics)|observation]] are from those of the other observations. ''High-leverage points'', if any, are [[Outlier|outliers]] with respect to the [[independent variables]]. That is, high-leverage points have no neighboring points in <math>\mathbb{R}^{p}</math> space, where ''<math>{p}</math>'' is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation.<ref>{{cite book |last=Everitt |first=B. S. |year=2002 |title=Cambridge Dictionary of Statistics |publisher=Cambridge University Press |isbn=0-521-81099-X }}</ref> Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be [[Influential point|influential points]]. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the [[hat matrix]].<ref>{{cite book |last1=James |first1=Gareth |last2=Witten |first2=Daniela |last3=Hastie |first3=Trevor |last4=Tibshirani |first4=Robert |title=An introduction to statistical learning: with applications in R |date=2021 |publisher=Springer |location=New York, NY |isbn=978-1-0716-1418-1 |page=112 |edition=Second |url=https://link.springer.com/book/10.1007/978-1-0716-1418-1 |access-date=29 October 2024 |language=en}}</ref>
+==Definition and interpretations==
-'''High-leverage points''' are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation.<ref>{{cite book |last=Everitt |first=B. S. |year=2002 |title=Cambridge Dictionary of Statistics |publisher=Cambridge University Press |isbn=0-521-81099-X }}</ref>
+Consider the [[linear regression]] model  <math>{y}_i = \boldsymbol{x}_i^{\top}\boldsymbol{\beta}+{\varepsilon}_i</math>, <math>i=1,\, 2,\ldots,\, n</math>. That is, <math>\boldsymbol{y} = \mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}</math>, where, <math>\mathbf{X}</math> is the <math>n\times p</math> [[design matrix]] whose rows correspond to the observations and whose columns correspond to the independent or explanatory variables. The ''leverage score'' for the <math>{i}^{th}</math> independent observation <math>\boldsymbol{x}_i</math> is given as:
+:<math> h_{ii}= \left[ \mathbf{H} \right]_{ii} = \boldsymbol{x}_i ^{\top} \left( \mathbf{X}^{\top} \mathbf{X} \right)^{-1}\boldsymbol{x}_i </math>, the <math>{i}^{th}</math> diagonal element of the [[projection matrix|ortho-projection matrix]] (''a.k.a'' hat matrix) <math>\mathbf{H} = \mathbf{X} \left( \mathbf{X}^{\top} \mathbf{X} \right)^{-1} \mathbf{X}^{\top}</math>.
+Thus the <math>{i}^{th}</math> leverage score can be viewed as the 'weighted' distance between  <math>\boldsymbol{x}_i</math> to the mean of  <math>\boldsymbol{x}_i</math>'s (see its [[#Relation to Mahalanobis distance|relation with Mahalanobis distance]]). It can also be interpreted as the degree by which the <math>{i}^{th}</math> measured (dependent) value (i.e.,  <math>y_i</math>) influences the <math>{i}^{th}</math> fitted (predicted) value (i.e., <math>\widehat{y\,}_i</math>): mathematically,
+:<math>h_{ii} = \frac{\partial\widehat{y\,}_i}{\partial y_i}</math>.
-==Definition==
-In the [[linear regression]] model, the ''leverage score'' for the ''i''-th observation is defined as:
-:<math> h_{ii}= \left[ \mathbf{H} \right]_{ii}, </math>
-the ''i''-th diagonal element of the [[projection matrix]] <math>\mathbf{H} = \mathbf{X} \left( \mathbf{X}^{\mathsf{T}} \mathbf{X} \right)^{-1} \mathbf{X}^{\mathsf{T}}</math>, where <math>\mathbf{X}</math> is the [[design matrix]] (whose rows correspond to the observations and whose columns correspond to the independent or explanatory variables).
+Hence, the leverage score is also known as the observation self-sensitivity or self-influence.<ref>{{cite web |title=Data Assimilation: Observation influence diagnostic of a data assimilation system |first=C. |last=Cardinali |date=June 2013 |url=https://www.ecmwf.int/sites/default/files/elibrary/2013/16938-observation-influence-diagnostic-data-assimilation-system.pdf }}</ref> Using the fact that <math>{\boldsymbol \widehat{y}}={\mathbf H}{\boldsymbol y}</math> (i.e., the prediction <math>{\boldsymbol \widehat{y}}</math> is ortho-projection of <math>{\boldsymbol y}</math> onto range space of <math>\mathbf{X}</math>) in the above expression, we get <math> h_{ii}= \left[ \mathbf{H} \right]_{ii} </math>. Note that this leverage depends on the values of the explanatory variables <math>(\mathbf{X})</math> of all observations but not on any of the values of the dependent variables <math>(y_i)</math>.
-== Interpretation ==
-The leverage score is also known as the observation self-sensitivity or self-influence,<ref>{{cite web |title=Data Assimilation: Observation influence diagnostic of a data assimilation system |first=C. |last=Cardinali |date=June 2013 |url=https://www.ecmwf.int/sites/default/files/elibrary/2013/16938-observation-influence-diagnostic-data-assimilation-system.pdf }}</ref> because of the equation
+==Properties==
-:<math>h_{ii} = \frac{\partial\widehat{y\,}_i}{\partial y_i},</math>
+# The leverage  <math>h_{ii}</math> is a number between 0 and 1, <math> 0 \leq h_{ii} \leq 1 .</math><br>'''Proof:''' Note that <math>\mathbf{H}</math> is [[idempotent matrix]] (<math> \mathbf{H}^2=\mathbf{H}</math>) and symmetric (<math> h_{ij}=h_{ji} </math>). Thus, by using the fact that <math> \left[ \mathbf{H}^2 \right]_{ii}= \left[ \mathbf{H} \right]_{ii} </math>, we have <math> h_{ii}=h_{ii}^2+\sum_{j\neq i}h_{ij}^2 </math>. Since we know that <math> \sum_{j\neq i}h_{ij}^2 \geq 0 </math>, we have <math> h_{ii} \geq h_{ii}^2 \implies 0 \leq h_{ii} \leq 1</math>.
-which states that the leverage of the ''i''-th observation equals the [[partial derivative]] of the fitted ''i''-th dependent value <math>\widehat{y\,}_i</math>with respect to the measured ''i''-th dependent value  <math>y_i</math>.  This partial derivative describes the degree by which the ''i''-th measured value influences the ''i''-th fitted value. Note that this leverage depends on the values of the explanatory (x-) variables of all observations but not on any of the values of the dependent (y-) variables.
+# Sum of leverages is equal to the number of parameters <math>(p)</math> in <math>\boldsymbol{\beta}</math> (including the intercept).<br>'''Proof:''' <math> \sum_{i=1}^n h_{ii} =\operatorname{Tr}(\mathbf{H})
+=\operatorname{Tr}\left(\mathbf{X} \left( \mathbf{X}^{\top} \mathbf{X} \right)^{-1} \mathbf{X}^{\top}\right)
+=\operatorname{Tr}\left(\mathbf{X}^{\top} \mathbf{X} \left(\mathbf{X}^{\top} \mathbf{X} \right)^{-1} \right)
+=\operatorname{Tr}(\mathbf{I}_p)=p </math>.
+== Determination of outliers in X using leverages ==
-The equation <math>h_{ii} = \frac{\partial\widehat{y\,}_i}{\partial y_i}</math>follows directly from the computation of the fitted values via the [[hat matrix]] as <math>{\mathbf \widehat{y}}={\mathbf H}{\mathbf y}</math>; that is, leverage is a diagonal element of the design matrix:
+Large leverage <math>{ h_{ii}}</math> corresponds to an <math>{ {\boldsymbol {x}}_{i}}</math> that is extreme. A common rule is to identify <math>{ {\boldsymbol {x}}_{i}}</math> whose leverage value <math>{h}_{ii}</math> is more than 2 times larger than the mean leverage <math>\bar{h}=\dfrac{1}{n}\sum_{i=1}^{n}h_{ii}=\dfrac{p}{n}</math> (see property 2 above). That is, if <math>h_{ii}>2\dfrac{p}{n}</math>, <math>{ {\boldsymbol {x}}_{i}}</math> shall be considered an outlier. Some statisticians prefer the threshold of  <math>3p/{n}</math> instead of <math>2p/{n}</math>.
+== Relation to Mahalanobis distance ==
-:<math>h_{ii} = \mathbf{H}(i,i).</math>
+Leverage is closely related to the Mahalanobis distance (proof<ref>[https://stats.stackexchange.com/q/200566 Prove the relation between Mahalanobis distance and Leverage?]</ref>). Specifically, for some <math>n\times p</math> matrix <math>\mathbf{X}</math>, the squared Mahalanobis distance of <math>{ {\boldsymbol {x}}_{i}}</math> (where <math>{\boldsymbol {x}}_{i}^{\top}</math> is  <math>{i}^{th}</math> row of  <math>\mathbf{X}</math>) from the vector of mean <math>\widehat{\boldsymbol{\mu}}=\sum_{i=1}^n \boldsymbol{x}_i</math> of length <math>p</math>, is <math>D^2(\boldsymbol{x}_{i}) = (\boldsymbol{x}_{i} - \widehat{\boldsymbol{\mu}})^{\top} \mathbf{S}^{-1} (\boldsymbol{x}_{i}-\widehat{\boldsymbol{\mu}}) </math>, where <math>\mathbf{S} = \mathbf{X}^{\top}\mathbf{X}</math> is the estimated [[Covariance matrix#Estimation|covariance matrix]] of  <math>{ {\boldsymbol {x}}_{i}}</math>'s.  This is related to the leverage <math>h_{ii}</math> of the hat matrix of <math>\mathbf{X}</math> after appending a column vector of 1's to it. The relationship between the two is:
+:<math>D^2(\boldsymbol{x}_{i}) = (n - 1)(h_{ii} - \tfrac{1}{n})</math>
-==Bounds on leverage==
-:<math> 0 \leq h_{ii} \leq 1 .</math>
+This relationship enables us to decompose leverage into meaningful components  so that some sources of high leverage can be investigated analytically.<ref>{{cite arXiv|eprint=2006.04024|class=math.ST|first=M. G.|last=Kim|title=Sources of high leverage in linear regression model (Journal of Applied Mathematics and Computing, Vol 16, 509–513)|date=2004}}</ref>
-===Proof===
-First, note that ''H'' is an [[idempotent matrix]]: <math> H^2=X(X^\top X)^{-1} X^\top X(X^\top X)^{-1} X^\top = XI(X^\top X)^{-1}X^\top = H .</math> Also, observe that <math> H </math> is symmetric (i.e.: <math> h_{ij}=h_{ji} </math>). So equating the ''ii'' element of ''H'' to that of ''H'' <sup>2</sup>, we have
-:<math> h_{ii}=h_{ii}^2+\sum_{j\neq i}h_{ij}^2 \geq 0 </math>
-and
-:<math> h_{ii} \geq h_{ii}^2 \implies h_{ii} \leq 1 .</math>
-== Relation to Influence Functions ==
+== Relation to influence functions ==
-In a regression context, we combine leverage and [[Influence function (statistics)|influence functions]] to compute the degree to which estimated coefficients would change if we removed a single data point. Denoting leverage <math>h_{ii} \equiv x_i' (X'X)^{-1} x_i </math> and the regression residual <math>\hat{e}_i \equiv y_i- x_i' \beta </math> , one can compare the estimated coefficient <math>\hat{\beta} </math> to the leave-one-out estimated coefficient <math>\hat{\beta}^{(-i)} </math> using the formula <ref>{{Cite journal|last=Miller|first=Rupert G.|date=September 1974|title=An Unbalanced Jackknife|url=https://projecteuclid.org/euclid.aos/1176342811|journal=Annals of Statistics|language=EN|volume=2|issue=5|pages=880–891|doi=10.1214/aos/1176342811|issn=0090-5364|doi-access=free}}</ref><ref>{{Cite book|last=Hiyashi|first=Fumio|title=Econometrics|publisher=Princeton University Press|year=2000|pages=21}}</ref>
+In a regression context, we combine leverage and [[Influence function (statistics)|influence functions]] to compute the degree to which estimated coefficients would change if we removed a single data point. Denoting the regression residuals as <math>\widehat{e}_i = y_i- \boldsymbol{x}_i^{\top}\widehat\boldsymbol{\beta} </math> , one can compare the estimated coefficient <math>\widehat\boldsymbol{\beta} </math> to the leave-one-out estimated coefficient <math>\widehat\boldsymbol{\beta}^{(-i)} </math> using the formula <ref>{{Cite journal|last=Miller|first=Rupert G.|date=September 1974|title=An Unbalanced Jackknife|url=https://projecteuclid.org/euclid.aos/1176342811|journal=Annals of Statistics|language=EN|volume=2|issue=5|pages=880–891|doi=10.1214/aos/1176342811|issn=0090-5364|doi-access=free}}</ref><ref>{{Cite book|last=Hiyashi|first=Fumio|title=Econometrics|publisher=Princeton University Press|year=2000|pages=21}}</ref>
-: <math>\hat{\beta} - \hat{\beta}^{(-i)} = \frac{(X'X)^{-1}x_i'\hat{e}_i}{1-h_{ii}} </math>
+: <math>\widehat\boldsymbol{\beta} - \widehat\boldsymbol{\beta}^{(-i)}
+= \frac{(\mathbf{X}^{\top}\mathbf{X})^{-1}\boldsymbol{x}_i\widehat{e}_i}{1-h_{ii}} </math>
-Young (2019) uses a version of this formula after residualizing controls.<ref>{{Cite journal|last=Young|first=Alwyn|date=2019|title=Channeling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results|url=https://academic.oup.com/qje/article/134/2/557/5195544|journal=The Quarterly Journal of Economics|volume=134|pages=567}}</ref>
+Young (2019) uses a version of this formula after residualizing controls.<ref>{{Cite journal|last=Young|first=Alwyn|date=2019|title=Channeling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results|url=https://academic.oup.com/qje/article/134/2/557/5195544|journal=The Quarterly Journal of Economics|volume=134|issue=2 |pages=567|doi=10.1093/qje/qjy029 |doi-access=free}}</ref> To gain intuition for this formula, note that <math>\frac{\partial \hat{\beta}}{\partial y_i} = (\mathbf{X}^{\top}\mathbf{X})^{-1}\boldsymbol{x}_i </math> captures the potential for an observation to affect the regression parameters, and therefore <math>(\mathbf{X}^{\top}\mathbf{X})^{-1}\boldsymbol{x}_i\widehat{e}_i </math> captures the actual influence of that observations' deviations from its fitted value on the regression parameters. The formula then divides by <math>(1-h_{ii}) </math> to account for the fact that we remove the observation rather than adjusting its value, reflecting the fact that removal changes the distribution of covariates more when applied to high-leverage observations (i.e. with outlier covariate values). Similar formulas arise when applying general formulas for statistical influences functions in the regression context.<ref>{{Cite journal|last1=Chatterjee|first1=Samprit|last2=Hadi|first2=Ali S.|date=August 1986|title=Influential Observations, High Leverage Points, and Outliers in Linear Regression|url=https://projecteuclid.org/euclid.ss/1177013622|journal=Statistical Science|language=EN|volume=1|issue=3|pages=379–393|doi=10.1214/ss/1177013622|issn=0883-4237|doi-access=free}}</ref><ref>{{Cite web|title=regression - Influence functions and OLS|url=https://stats.stackexchange.com/q/8344 |access-date=2020-12-06|website=Cross Validated}}</ref>
-To gain intuition for this formula, note that the k-by-1 vector <math>\frac{\partial \hat{\beta}}{\partial y_i} = (X'X)^{-1}x_i </math> captures the potential for an observation to affect the regression parameters, and therefore <math>(X'X)^{-1}x_i\hat{e}_i </math> captures the actual influence of that observations' deviations from its fitted value on the regression parameters. The formula then divides by <math>(1-h_{ii}) </math> to account for the fact that we remove the observation rather than adjusting its value, reflecting the fact that removal changes the distribution of covariates more when applied to high-leverage observations (i.e. with outlier covariate values).
-Similar formulas arise when applying general formulas for statistical influences functions in the regression context.<ref>{{Cite journal|last=Chatterjee|first=Samprit|last2=Hadi|first2=Ali S.|date=August 1986|title=Influential Observations, High Leverage Points, and Outliers in Linear Regression|url=https://projecteuclid.org/euclid.ss/1177013622|journal=Statistical Science|language=EN|volume=1|issue=3|pages=379–393|doi=10.1214/ss/1177013622|issn=0883-4237|doi-access=free}}</ref><ref>{{Cite web|title=regression - Influence functions and OLS|url=https://stats.stackexchange.com/questions/8344/influence-functions-and-ols|access-date=2020-12-06|website=Cross Validated}}</ref>
 ==Effect on residual variance==
-If we are in an [[ordinary least squares]] setting with fixed X and  [[homoscedastic]] [[errors and residuals|regression errors]] <math>\varepsilon_i,</math>
+If we are in an [[ordinary least squares]] setting with fixed <math>\mathbf{X}</math> and  [[homoscedastic]] [[errors and residuals|regression errors]] <math>\varepsilon_i,</math> <math> \boldsymbol{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon};  \ \ \operatorname{Var}(\boldsymbol{\varepsilon})=\sigma^2\mathbf{I} </math>, then the <math>{i}^{th}</math> [[errors and residuals|regression residual]], <math> e_i=y_i-\widehat{y}_i </math> has variance
-:<math> Y=X\beta+\varepsilon;  \ \ \operatorname{Var}(\varepsilon)=\sigma^2I </math>
+:<math> \operatorname{Var}(e_i)=(1-h_{ii})\sigma^2 </math>.
+In other words, an observation's leverage score determines the degree of noise in the model's misprediction of that observation, with higher leverage leading to less noise. This follows from the fact that <math> \mathbf{I}-\mathbf{H} </math> is idempotent and symmetric and <math>\widehat{\boldsymbol{y}}=\mathbf{H}\boldsymbol{y}</math>, hence, <math> \operatorname{Var}(\boldsymbol{e})=\operatorname{Var}((\mathbf{I}-\mathbf{H})\boldsymbol{y})
-then the ''i-''th [[errors and residuals|regression residual]]
+=(\mathbf{I}-\mathbf{H})\operatorname{Var}(\boldsymbol{y})(\mathbf{I}-\mathbf{H})^\top
-:<math> e_i=Y_i-\widehat{Y}_i </math>
+= \sigma^2 (\mathbf{I}-\mathbf{H})^2=\sigma^2(\mathbf{I}-\mathbf{H}) </math>.
-has variance
-:<math> \operatorname{Var}(e_i)=(1-h_{ii})\sigma^2 </math>
-In other words, an observation's leverage score determines the degree of noise in the model's misprediction of that observation, with higher leverage leading to less noise.
-===Proof===
-First, note that <math> I-H </math> is idempotent and symmetric, and <math>\widehat{Y}=HY</math>. This gives
-: <math> \operatorname{Var}(e)=\operatorname{Var}((I-H)Y)=(I-H)\operatorname{Var}(Y)(I-H)^\top = \sigma^2 (I-H)^2=\sigma^2(I-H). </math>
-Thus <math> \operatorname{Var}(e_i)=(1-h_{ii})\sigma^2 .</math>
-===Studentized residuals===
 The corresponding [[studentized residual]]—the residual adjusted for its observation-specific estimated residual variance—is then
@@ Line 60: / Line 48: @@
 :<math>t_i = {e_i\over \widehat{\sigma} \sqrt{1-h_{ii}\  }}</math>
-where <math>\widehat{\sigma}</math> is an appropriate estimate of <math>\sigma.</math>
+where <math>\widehat{\sigma}</math> is an appropriate estimate of <math>\sigma</math>.
-== Related concepts ==
+== Partial leverage ==
+'''Partial leverage''' ('''PL''') is a measure of the contribution of the individual [[Independent variable|independent variables]] to the total leverage of each observation. That is, PL is a measure of how ''<math>
-=== Partial leverage ===
+h_{ii}
-{{main|Partial leverage}}
+</math>'' changes as a variable is added to the regression model. It is computed as:
-Partial leverage is a measure of the contribution of the individual [[independent variable]]s to the total leverage of each observation.
-Modern computer packages for statistical analysis include, as part of their facilities for regression analysis, various quantitative measures for identifying [[influential observation]]s, including such a measure of how an independent variable contributes to the total leverage of a datum.
+: <math>
-=== Mahalanobis distance ===
+\left(\mathrm{PL}_j\right)_i = \frac{\left(\mathbf{X}_{j\bullet[j]}\right)_i^2}{\sum_{k=1}^n\left(\mathbf{X}_{j\bullet[j]}\right)_k^2}
+</math>
+where <math>j</math> is the index of independent variable, <math>i</math> is the index of observation and <math>\mathbf{X}_{j\bullet[j]}</math> are the [[Errors and residuals in statistics|residuals]] from regressing ''<math>\mathbf{X}_{j}</math>'' against the remaining independent variables. Note that the partial leverage is the leverage of the <math>{i}^{th}</math> point in the [[partial regression plot]] for the <math>{j}^{th}</math> variable. Data points with large partial leverage for an independent variable can exert undue influence on the selection of that variable in automatic regression model building procedures.
-Leverage is closely related to the [[Mahalanobis distance]]<ref>{{cite book|url={{google books |plainurl=y |id=WcgORxTJ9XkC}}|title=Handbook of Psychology, Research Methods in Psychology|last1=Weiner|first1=Irving B.|last2=Schinka|first2=John A.|last3=Velicer|first3=Wayne F.|date=23 October 2012|publisher=John Wiley & Sons|isbn=978-1-118-28203-8}}</ref> (see proof:<ref>[https://stats.stackexchange.com/a/200566/253 Prove the relation between Mahalanobis distance and Leverage?]</ref>).
-Specifically, for some matrix <math>X_{n,p}</math> the squared Mahalanobis distance of some row vector <math>\vec{x_i} = X_{i,\cdot}</math> from the vector of mean <math>\hat{\mu}=\bar{X}</math>, of length <math>p</math>, and with the estimated [[Covariance matrix#Estimation|covariance matrix]] <math>S = cov(X)</math> is:
-:<math>D^2(\vec{x_i}) = (\vec{x_i} - \hat{\mu})^T S^{-1} (\vec{x_i}-\hat{\mu}) </math>
-This is related to the leverage <math>h_{ii}</math> of the hat matrix of <math>X_{n,p}</math> after appending a column vector of 1's to it. The relationship between the two is:
-:<math>D^2(\vec{x_i}) = (n - 1)(h_{ii} - \tfrac{1}{n})</math>
-The relationship between leverage and  Mahalanobis distance enables us to
-decompose leverage into meaningful components  so that some sources of high leverage can be investigated analytically.<ref>{{cite arXiv |title=Sources of high leverage in linear regression model (Journal of Applied Mathematics and Computing, Vol 16, 509–513) |first=M. G. |last=Kim |date=2004 |class=math.ST |eprint=2006.04024 }}</ref>
 ==Software implementations==
@@ Line 91: / Line 69: @@
 |-
 | [[R (programming language)|R]] || <code>hat(x, intercept = TRUE)</code> or <code>hatvalues(model, ...)</code>  || See [https://stat.ethz.ch/R-manual/R-devel/library/stats/html/influence.measures.html]
+|-
+| Python || <code>(x * np.linalg.pinv(x).T).sum(-1)</code>  || See [https://gist.github.com/gabrieldernbach/ff81b0d826782719c8057eb64a3fcb18]
 |}
@@ Line 96: / Line 76: @@
 * [[Projection matrix]] – whose main diagonal entries are the leverages of the observations
 * [[Mahalanobis distance]] – a ([[Mahalanobis distance#Relationship to leverage|scaled]]) measure of leverage of a datum
+* [[Partial leverage]]
-* [[Cook's distance]] – a measure of changes in regression coefficients when an observation is deleted
+*[[Cook's distance]] – a measure of changes in regression coefficients when an observation is deleted
 * [[DFFITS]]
 * [[Outlier]] – observations with extreme ''Y'' values