Jump to content

Statistical distance: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
mNo edit summary
 
(42 intermediate revisions by 30 users not shown)
Line 1: Line 1:
{{Short description|Distance between two statistical objects}}
{{merge to|Divergence (statistics)|discuss=Talk:Divergence_(statistics)#Merger_proposal|date=December 2014}}
{{Multiple issues|
In [[statistics]], [[probability theory]], and [[information theory]], a '''statistical distance''' quantifies the distance between two statistical objects, which can be two [[random variable]]s, or two [[probability distribution]]s or [[Sample (statistics)|samples]], or the distance can be between an individual sample point and a population or a wider sample of points.
{{more footnotes|date=February 2012}}
{{refimprove|date=December 2020}}
}}

In [[statistics]], [[probability theory]], and [[information theory]], a '''statistical distance''' quantifies the [[distance]] between two statistical objects, which can be two [[random variable]]s, or two [[probability distribution]]s or [[Sample (statistics)|samples]], or the distance can be between an individual sample point and a population or a wider sample of points.


A distance between populations can be interpreted as measuring the distance between two [[probability distribution]]s and hence they are essentially measures of distances between [[probability measure]]s. Where statistical distance measures relate to the differences between [[random variable]]s, these may have [[statistical independence|statistical dependence]],<ref>Dodge, Y. (2003)&mdash;entry for distance</ref> and hence these distances are not directly related to measures of distances between probability measures. Again, a measure of distance between random variables may relate to the extent of dependence between them, rather than to their individual values.
A distance between populations can be interpreted as measuring the distance between two [[probability distribution]]s and hence they are essentially measures of distances between [[probability measure]]s. Where statistical distance measures relate to the differences between [[random variable]]s, these may have [[statistical independence|statistical dependence]],<ref>Dodge, Y. (2003)&mdash;entry for distance</ref> and hence these distances are not directly related to measures of distances between probability measures. Again, a measure of distance between random variables may relate to the extent of dependence between them, rather than to their individual values.


Statistical distance measures are mostly not [[metric (mathematics)|metric]]s and they need not be symmetric. Some types of distance measures are referred to as (statistical) '''[[divergence (statistics)|divergence]]s'''.
Many statistical distance measures are not [[metric (mathematics)|metric]]s, and some are not symmetric. Some types of distance measures, which generalize ''squared'' distance, are referred to as (statistical) ''[[divergence (statistics)|divergence]]s''.

==Terminology==
Many terms are used to refer to various notions of distance; these are often confusingly similar, and may be used inconsistently between authors and over time, either loosely or with precise technical meaning. In addition to "distance", similar terms include [[deviance (statistics)|deviance]], [[deviation (statistics)|deviation]], [[discrepancy (disambiguation)#Statistics|discrepancy]], discrimination, and [[divergence (statistics)|divergence]], as well as others such as [[contrast function]] and [[metric (mathematics)|metric]]. Terms from [[information theory]] include [[cross entropy]], [[relative entropy]], [[discrimination information]], and [[information gain]].


==Distances as metrics==
==Distances as metrics==


===Metrics===
===Metrics===
A '''metric''' on a set ''X'' is a [[function (mathematics)|function]] (called the ''distance function'' or simply '''distance''')
A '''metric''' on a set ''X'' is a [[function (mathematics)|function]] (called the ''distance function'' or simply '''distance''') ''d'' : ''X'' × ''X'' → '''R'''<sup>+</sup>

''d'' : ''X'' × ''X'' → '''R'''<sup>+</sup>
(where '''R'''<sup>+</sup> is the set of non-negative [[real number]]s). For all ''x'', ''y'', ''z'' in ''X'', this function is required to satisfy the following conditions:
(where '''R'''<sup>+</sup> is the set of non-negative [[real number]]s). For all ''x'', ''y'', ''z'' in ''X'', this function is required to satisfy the following conditions:


Line 20: Line 26:


===Generalized metrics===
===Generalized metrics===
Many statistical distances are not [[metric (mathematics)|metric]]s, because they lack one or more properties of proper metrics. For example, [[pseudometric space|pseudometric]]s can violate the "[[Positive-definite function#In dynamical systems|positive definiteness]]" (alternatively, [[metric (mathematics)#Pseudometrics|"identity of indescernibles"]] property); [[quasimetric]]s can violate the [[metric (mathematics)#Quasimetrics|symmetry]] property; and [[semimetric]]s can violate the [[metric (mathematics)#Semimetrics|triangle inequality]]. Some statistical distances are referred to as [[divergence (statistics)|divergence]]s.
Many statistical distances are not [[metric (mathematics)|metric]]s, because they lack one or more properties of proper metrics. For example, [[pseudometric space|pseudometric]]s violate property (2), identity of indiscernibles; [[quasimetric]]s violate property (3), symmetry; and [[semimetric]]s violate property (4), the triangle inequality. Statistical distances that satisfy (1) and (2) are referred to as [[divergence (statistics)|divergence]]s.

==Statistically close==
The [[Total variation distance of probability measures | total variation distance]] of two distributions <math>X</math> and <math>Y</math> over a finite domain <math>D</math>, (often referred to as ''statistical difference''<ref>
{{cite book
| last = Goldreich
| first = Oded
| authorlink = Oded Goldreich
| title = Foundations of Cryptography: Basic Tools
| publisher = [[Cambridge University Press]]
| edition = 1st
| location = Berlin
| date = 2001
| page = 106
| isbn = 0-521-79172-3
}}
</ref>
or ''statistical distance''<ref>
Reyzin, Leo. (Lecture Notes) [http://www.cs.bu.edu/~reyzin/teaching/s11cs937/notes-leo-1.pdf Extractors and the Leftover Hash Lemma]
</ref> in cryptography) is defined as

<math> \Delta(X,Y)=\frac{1}{2} \sum _{\alpha \in D} | \Pr[X=\alpha] - \Pr[Y=\alpha] |</math>.

We say that two [[probability ensembles]] <math>\{X_k\}_{k\in\N}</math> and <math>\{Y_k\}_{k\in\N}</math> are statistically close if <math>\Delta(X_k,Y_k)</math> is a [[negligible function]] in <math>k</math>.


==Examples==
==Examples==
===Metrics===
Some important statistical distances include the following:
* [[Total variation distance]] (sometimes just called "the" statistical distance)
* [[f-divergence]]: includes
* [[Hellinger distance]]
** [[Kullback–Leibler divergence]]
** [[Hellinger distance]]
** [[Total variation distance]] (sometimes just called "the" statistical distance)
* [[Rényi divergence|Rényi's divergence]]
* [[Jensen–Shannon divergence]]
* [[Lévy–Prokhorov metric]]
* [[Lévy–Prokhorov metric]]
* [[Wasserstein metric]]: also known as the Kantorovich metric, or [[earth mover's distance]]
* [[Bhattacharyya distance]]
* [[Wasserstein metric]]: also known as the [[Kantorovich metric]], or [[earth mover's distance]]
* [[Energy distance]]
* The [[Kolmogorov–Smirnov test|Kolmogorov–Smirnov statistic]] represents a distance between two probability distributions defined on a single real variable
* The '''maximum mean discrepancy''' which is defined in terms of the [[kernel embedding of distributions]]

Other approaches
* [[Signal-to-noise ratio]] distance
* [[Mahalanobis distance]]
* [[Mahalanobis distance]]
* [[Amari distance]]
* [[Distance correlation]] is a measure of dependence between two [[random variables]], it is zero if and only if the random variables are independent.
* [[Integral probability metric]]s generalize several metrics or pseudometrics on distributions
* The ''continuous ranked probability score'' is a measure how good forecasts that are expressed as probability distributions are in matching observed outcomes. Both the location and spread of the forecast distribution are taken into account in judging how close the distribution is the observed value: see [[probabilistic forecasting]].

* [[Łukaszyk–Karmowski metric]] is a function defining a distance between two [[random variable]]s or two [[random vector]]s. It does not satisfy the [[identity of indiscernibles]] condition of the metric and is zero if and only if both its arguments are certain events described by [[Dirac delta]] density [[probability distribution function]]s.
===Divergences===
* [[Kullback–Leibler divergence]]
* [[Rényi divergence]]
* [[Jensen–Shannon divergence]]
* [[Bhattacharyya distance]] (despite its name it is not a distance, as it violates the triangle inequality)
* [[f-divergence]]: generalizes several distances and divergences
* [[Discriminability index]], specifically the [[Discriminability index#Bayes discriminability index|Bayes discriminability index]], is a positive-definite symmetric measure of the overlap of two distributions.


== See also ==
== See also ==
*[[Probabilistic metric space]]
*[[Probabilistic metric space]]
*[[Randomness extractor]]

*[[Similarity measure]]
{{more footnotes|date=February 2012}}
*[[Zero-knowledge proof]]
{{refimprove|date=February 2012}}


==Notes==
==Notes==
Line 54: Line 79:


==External links==
==External links==
*[http://reference.wolfram.com/mathematica/guide/DistanceAndSimilarityMeasures.html Distance and Similarity Measures(Wolfram Alpha)]
*[http://reference.wolfram.com/mathematica/guide/DistanceAndSimilarityMeasures.html Distance and Similarity Measures (Wolfram Alpha)]

{{Statistics|inference|collapsed}}


==References==
==References==
*Dodge, Y. (2003) ''Oxford Dictionary of Statistical Terms'', OUP. ISBN 0-19-920613-9
*Dodge, Y. (2003) ''Oxford Dictionary of Statistical Terms'', OUP. {{ISBN|0-19-920613-9}}


[[Category:Statistical distance measures]]
[[Category:Statistical distance| ]]

Latest revision as of 17:20, 5 March 2024

In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be between an individual sample point and a population or a wider sample of points.

A distance between populations can be interpreted as measuring the distance between two probability distributions and hence they are essentially measures of distances between probability measures. Where statistical distance measures relate to the differences between random variables, these may have statistical dependence,[1] and hence these distances are not directly related to measures of distances between probability measures. Again, a measure of distance between random variables may relate to the extent of dependence between them, rather than to their individual values.

Many statistical distance measures are not metrics, and some are not symmetric. Some types of distance measures, which generalize squared distance, are referred to as (statistical) divergences.

Terminology

[edit]

Many terms are used to refer to various notions of distance; these are often confusingly similar, and may be used inconsistently between authors and over time, either loosely or with precise technical meaning. In addition to "distance", similar terms include deviance, deviation, discrepancy, discrimination, and divergence, as well as others such as contrast function and metric. Terms from information theory include cross entropy, relative entropy, discrimination information, and information gain.

Distances as metrics

[edit]

Metrics

[edit]

A metric on a set X is a function (called the distance function or simply distance) d : X × XR+ (where R+ is the set of non-negative real numbers). For all x, y, z in X, this function is required to satisfy the following conditions:

  1. d(x, y) ≥ 0     (non-negativity)
  2. d(x, y) = 0   if and only if   x = y     (identity of indiscernibles. Note that condition 1 and 2 together produce positive definiteness)
  3. d(x, y) = d(y, x)     (symmetry)
  4. d(x, z) ≤ d(x, y) + d(y, z)     (subadditivity / triangle inequality).

Generalized metrics

[edit]

Many statistical distances are not metrics, because they lack one or more properties of proper metrics. For example, pseudometrics violate property (2), identity of indiscernibles; quasimetrics violate property (3), symmetry; and semimetrics violate property (4), the triangle inequality. Statistical distances that satisfy (1) and (2) are referred to as divergences.

Statistically close

[edit]

The total variation distance of two distributions and over a finite domain , (often referred to as statistical difference[2] or statistical distance[3] in cryptography) is defined as

.

We say that two probability ensembles and are statistically close if is a negligible function in .

Examples

[edit]

Metrics

[edit]

Divergences

[edit]

See also

[edit]

Notes

[edit]
  1. ^ Dodge, Y. (2003)—entry for distance
  2. ^ Goldreich, Oded (2001). Foundations of Cryptography: Basic Tools (1st ed.). Berlin: Cambridge University Press. p. 106. ISBN 0-521-79172-3.
  3. ^ Reyzin, Leo. (Lecture Notes) Extractors and the Leftover Hash Lemma
[edit]

References

[edit]