Statistical distance: Difference between revisions
Removing stale merge proposal from 2014; no support for the proposal since it was made in 2014; see Talk:Divergence (statistics)#Merger proposal |
Scientific29 (talk | contribs) →Generalized metrics: Refer directly to the properties of a metric violated |
||
Line 19: | Line 19: | ||
===Generalized metrics=== |
===Generalized metrics=== |
||
Many statistical distances are not [[metric (mathematics)|metric]]s, because they lack one or more properties of proper metrics. For example, [[pseudometric space|pseudometric]]s |
Many statistical distances are not [[metric (mathematics)|metric]]s, because they lack one or more properties of proper metrics. For example, [[pseudometric space|pseudometric]]s violate the "[[Positive-definite function#In dynamical systems|positive definiteness]]" (alternatively, [[metric (mathematics)#Pseudometrics|"identity of indescernibles"]]) property (1 & 2 above); [[quasimetric]]s violate the [[metric (mathematics)#Quasimetrics|symmetry]] property (3); and [[semimetric]]s violate the [[metric (mathematics)#Semimetrics|triangle inequality]] (4). Some statistical distances are referred to as [[divergence (statistics)|divergence]]s. |
||
==Examples== |
==Examples== |
Revision as of 21:19, 12 January 2018
In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be between an individual sample point and a population or a wider sample of points.
A distance between populations can be interpreted as measuring the distance between two probability distributions and hence they are essentially measures of distances between probability measures. Where statistical distance measures relate to the differences between random variables, these may have statistical dependence,[1] and hence these distances are not directly related to measures of distances between probability measures. Again, a measure of distance between random variables may relate to the extent of dependence between them, rather than to their individual values.
Statistical distance measures are mostly not metrics and they need not be symmetric. Some types of distance measures are referred to as (statistical) divergences.
Distances as metrics
Metrics
A metric on a set X is a function (called the distance function or simply distance)
d : X × X → R+ (where R+ is the set of non-negative real numbers). For all x, y, z in X, this function is required to satisfy the following conditions:
- d(x, y) ≥ 0 (non-negativity)
- d(x, y) = 0 if and only if x = y (identity of indiscernibles. Note that condition 1 and 2 together produce positive definiteness)
- d(x, y) = d(y, x) (symmetry)
- d(x, z) ≤ d(x, y) + d(y, z) (subadditivity / triangle inequality).
Generalized metrics
Many statistical distances are not metrics, because they lack one or more properties of proper metrics. For example, pseudometrics violate the "positive definiteness" (alternatively, "identity of indescernibles") property (1 & 2 above); quasimetrics violate the symmetry property (3); and semimetrics violate the triangle inequality (4). Some statistical distances are referred to as divergences.
Examples
Some important statistical distances include the following:
- f-divergence: includes
- Kullback–Leibler divergence
- Hellinger distance
- Total variation distance (sometimes just called "the" statistical distance)
- Rényi's divergence
- Jensen–Shannon divergence
- Lévy–Prokhorov metric
- Bhattacharyya distance
- Wasserstein metric: also known as the Kantorovich metric, or earth mover's distance
- The Kolmogorov–Smirnov statistic represents a distance between two probability distributions defined on a single real variable
- The maximum mean discrepancy which is defined in terms of the kernel embedding of distributions
Other approaches
- Signal-to-noise ratio distance
- Mahalanobis distance
- Energy distance
- Distance correlation is a measure of dependence between two random variables, it is zero if and only if the random variables are independent.
- The continuous ranked probability score is a measure how good forecasts that are expressed as probability distributions are in matching observed outcomes. Both the location and spread of the forecast distribution are taken into account in judging how close the distribution is the observed value: see probabilistic forecasting.
- Łukaszyk–Karmowski metric is a function defining a distance between two random variables or two random vectors. It does not satisfy the identity of indiscernibles condition of the metric and is zero if and only if both its arguments are certain events described by Dirac delta density probability distribution functions.
See also
This article includes a list of general references, but it lacks sufficient corresponding inline citations. (February 2012) |
This article needs additional citations for verification. (February 2012) |
Notes
- ^ Dodge, Y. (2003)—entry for distance
External links
References
- Dodge, Y. (2003) Oxford Dictionary of Statistical Terms, OUP. ISBN 0-19-920613-9