Covariance matrix

In statistics and probability theory, the covariance matrix is a matrix of covariances between elements of a vector. It is the natural generalization to higher dimensions of the concept of the variance of a scalar-valued random variable.

Definition

If entries in the column vector

X={\begin{bmatrix}X_{1}\\\vdots \\X_{n}\end{bmatrix}}

are random variables, each with finite variance, then the covariance matrix Σ is the matrix whose (i, j) entry is the covariance

\Sigma _{ij}=\mathrm {E} {\begin{bmatrix}(X_{i}-\mu _{i})(X_{j}-\mu _{j})\end{bmatrix}}

where

\mu _{i}=\mathrm {E} (X_{i})\,

is the expected value of the ith entry in the vector X. In other words, we have

\Sigma ={\begin{bmatrix}\mathrm {E} [(X_{1}-\mu _{1})(X_{1}-\mu _{1})]&\mathrm {E} [(X_{1}-\mu _{1})(X_{2}-\mu _{2})]&\cdots &\mathrm {E} [(X_{1}-\mu _{1})(X_{n}-\mu _{n})]\\\\\mathrm {E} [(X_{2}-\mu _{2})(X_{1}-\mu _{1})]&\mathrm {E} [(X_{2}-\mu _{2})(X_{2}-\mu _{2})]&\cdots &\mathrm {E} [(X_{2}-\mu _{2})(X_{n}-\mu _{n})]\\\\\vdots &\vdots &\ddots &\vdots \\\\\mathrm {E} [(X_{n}-\mu _{n})(X_{1}-\mu _{1})]&\mathrm {E} [(X_{n}-\mu _{n})(X_{2}-\mu _{2})]&\cdots &\mathrm {E} [(X_{n}-\mu _{n})(X_{n}-\mu _{n})]\end{bmatrix}}.

As a generalization of the variance

The definition above is equivalent to the matrix equality

\Sigma =\mathrm {E} \left[\left({\textbf {X}}-\mathrm {E} [{\textbf {X}}]\right)\left({\textbf {X}}-\mathrm {E} [{\textbf {X}}]\right)^{\top }\right]

Thus, this is seen to generalize to higher dimensions the concept of variance of a scalar-valued random variable X, defined as

\sigma ^{2}=\mathrm {var} (X)=\mathrm {E} [(X-\mu )^{2}],\,

where

\mu =\mathrm {E} (X).\,

Conflicting nomenclatures and notations

Nomenclatures differ. Some statisticians, following the probabilist William Feller, call this matrix the variance of the random vector $X$ , because it is the natural generalization to higher dimensions of the 1-dimensional variance. Others call it the covariance matrix, because it is the matrix of covariances between the scalar components of the vector $X$ . Thus

\operatorname {var} ({\textbf {X}})=\operatorname {cov} ({\textbf {X}})=\mathrm {E} \left[({\textbf {X}}-\mathrm {E} [{\textbf {X}}])({\textbf {X}}-\mathrm {E} [{\textbf {X}}])^{\top }\right]

However, the notation for the "cross-covariance" between two vectors is standard:

\operatorname {cov} ({\textbf {X}},{\textbf {Y}})=\mathrm {E} \left[({\textbf {X}}-\mathrm {E} [{\textbf {X}}])({\textbf {Y}}-\mathrm {E} [{\textbf {Y}}])^{\top }\right]

The $var$ notation is found in William Feller's two-volume book An Introduction to Probability Theory and Its Applications, but both forms are quite standard and there is no ambiguity between them.

Properties

For $\Sigma =\mathrm {E} \left[\left({\textbf {X}}-\mathrm {E} [{\textbf {X}}]\right)\left({\textbf {X}}-\mathrm {E} [{\textbf {X}}]\right)^{\top }\right]$ and $\mu =\mathrm {E} ({\textbf {X}})$ the following basic properties apply:

$\Sigma =\mathrm {E} (\mathbf {XX^{\top }} )-\mathbf {\mu } \mathbf {\mu ^{\top }}$
$\operatorname {cov} (\mathbf {a^{\top }} \mathbf {X} )=\mathbf {a^{\top }} \operatorname {cov} (\mathbf {X} )\mathbf {a}$
$\mathbf {\Sigma }$ is positive semi-definite
$\operatorname {var} (\mathbf {AX} +\mathbf {a} )=\mathbf {A} \,\operatorname {var} (\mathbf {X} )\,\mathbf {A^{\top }}$
$\operatorname {cov} (\mathbf {X} ,\mathbf {Y} )=\operatorname {cov} (\mathbf {Y} ,\mathbf {X} )^{\top }$
$\operatorname {cov} (\mathbf {X_{1}} +\mathbf {X_{2}} ,\mathbf {Y} )=\operatorname {cov} (\mathbf {X_{1}} ,\mathbf {Y} )+\operatorname {cov} (\mathbf {X_{2}} ,\mathbf {Y} )$
If p = q, then $\operatorname {var} (\mathbf {X} +\mathbf {Y} )=\operatorname {var} (\mathbf {X} )+\operatorname {cov} (\mathbf {X} ,\mathbf {Y} )+\operatorname {cov} (\mathbf {Y} ,\mathbf {X} )+\operatorname {var} (\mathbf {Y} )$
$\operatorname {cov} (\mathbf {AX} ,\mathbf {BY} )=\mathbf {A} \,\operatorname {cov} (\mathbf {X} ,\mathbf {Y} )\,\mathbf {B} ^{\top }$
If $\mathbf {X}$ and $\mathbf {Y}$ are independent, then $\operatorname {cov} (\mathbf {X} ,\mathbf {Y} )=0$

where $\mathbf {X} ,\mathbf {X_{1}}$ and $\mathbf {X_{2}}$ are a random $\mathbf {(p\times 1)}$ vectors, $\mathbf {Y}$ is a random $\mathbf {(q\times 1)}$ vector, $\mathbf {a}$ is $\mathbf {(p\times 1)}$ vector, $\mathbf {A}$ and $\mathbf {B}$ are $\mathbf {(p\times q)}$ matrices.

This covariance matrix (though very simple) is a very useful tool in many very different areas. From it a transformation matrix can be derived that allows one to completely decorrelate the data or, from a different point of view, to find an optimal basis for representing the data in a compact way (see Rayleigh quotient for a formal proof and additional properties of covariance matrices). This is called principal components analysis (PCA) in statistics and Karhunen-Loève transform (KL-transform) in image processing.

Which matrices are covariance matrices

From the identity

\operatorname {var} (\mathbf {a^{\top }} \mathbf {X} )=\mathbf {a^{\top }} \operatorname {var} (\mathbf {X} )\mathbf {a} \,

and the fact that the variance of any real-valued random variable is nonnegative, it follows immediately that only a nonnegative-definite matrix can be a covariance matrix. The converse question is whether every nonnegative-definite symmetric matrix is a covariance matrix. The answer is "yes". To see this, suppose M is a p×p nonnegative-definite symmetric matrix. From the finite-dimensional case of the spectral theorem, it follows that M has a nonnegative symmetric square root, which let us call M^1/2. Let $\mathbf {X}$ be any p×1 column vector-valued random variable whose covariance matrix is the p×p identity matrix. Then

\operatorname {var} (M^{1/2}\mathbf {X} )=M^{1/2}(\operatorname {var} (\mathbf {X} ))M^{1/2}=M.\,

Complex random vectors

The variance of a complex scalar-valued random variable with expected value μ is conventionally defined using complex conjugation:

\operatorname {var} (z)=\operatorname {E} \left[(z-\mu )(z-\mu )^{*}\right]

where the complex conjugate of a complex number $z$ is denoted $z^{*}$ .

If $Z$ is a column-vector of complex-valued random variables, then we take the conjugate transpose by both transposing and conjugating, getting a square matrix:

\operatorname {E} \left[(Z-\mu )(Z-\mu )^{*}\right]

where $Z^{*}$ denotes the conjugate transpose, which is applicable to the scalar case since the transpose of a scalar is still a scalar.

Estimation

The derivation of the maximum-likelihood estimator of the covariance matrix of a multivariate normal distribution is perhaps surprisingly subtle. It involves the spectral theorem and the reason why it can be better to view a scalar as the trace of a 1 × 1 matrix than as a mere scalar. See estimation of covariance matrices.

External link

Covariance Matrix at MathWorld