Integral probability metric

In probability theory, integral probability metrics are types of distance functions between probability distributions, defined by how well a class of functions can distinguish the two distributions. Many important statistical distances are integral probability metrics, including the Wasserstein-1 distance and the total variation distance. In addition to theoretical importance, integral probability metrics are widely used in areas of statistics and machine learning.

The name "integral probability metric" was given by German statistician Alfred Müller;^[1] the distances had also previously been called "metrics with a $ζ$ -structure."^[2]

Definition

Integral probability metrics are distances on the space of distributions over a set ${\mathcal {X}}$ , defined by a class ${\mathcal {F}}$ of real-valued functions on ${\mathcal {X}}$ as $D_{\mathcal {F}}(P,Q)=\sup _{f\in {\mathcal {F}}}{\big |}\mathbb {E} _{X\sim P}f(X)-\mathbb {E} _{Y\sim Q}f(Y){\big |}=\sup _{f\in {\mathcal {F}}}{\big |}Pf-Qf{\big |};$ here the notation $P f$ refers to the expectation of $f$ under the distribution $P$ . The absolute value in the definition is unnecessary, and often omitted, for the usual case where for every $f\in {\mathcal {F}}$ its negation $-f$ is also in ${\mathcal {F}}$ .

The function $f$ being optimized over is known as the "witness function" or the "critic"; the term "witness" is particularly used if a particular $f^{*}\in {\mathcal {F}}$ achieves the supremum, as it "witnesses" the difference in the distributions. These functions try to have large values for samples from $P$ and small (likely negative) values for samples from .

The choice of ${\mathcal {F}}$ determines the particular distance; more than one ${\mathcal {F}}$ can generate the same distance.^[1]

For any choice of ${\mathcal {F}}$ , $D_{\mathcal {F}}$ satisfies all the definitions of a metric except that we may have we may have $D_{\mathcal {F}}(P,Q)=0$ for some $P \neq Q$ ; this is variously termed a "pseudometric" or a "semimetric" depending on the community. For instance, using the class ${\mathcal {F}}=\{x\mapsto 0\}$ which only contains the zero function, $D_{\mathcal {F}}(P,Q)$ is identically zero. $D_{\mathcal {F}}$ is a metric if and only if ${\mathcal {F}}$ separates points on the space of probability distributions, i.e. for any $P \neq Q$ there is some $f\in {\mathcal {F}}$ such that $Pf\neq Qf$ .^[1]

Examples

The Wasserstein-1 distance, via its dual representation, has ${\mathcal {F}}$ the set of 1-Lipschitz functions.
The related Dudley metric is generated by the set of bounded 1-Lipschitz functions.
The total variation distance can be generated by ${\mathcal {F}}=\{f:{\mathcal {X}}\to \{0,1\}\}$ , so that ${\mathcal {F}}$ is a set of indicator functions for any event, or by the larger class ${\mathcal {F}}=\{f:{\mathcal {X}}\to [0,1]\}$ .
The closely related Radon metric is generated by continuous functions bounded in $[-1, 1]$ .
The Kolmogorov metric used in the Kolmogorov-Smirnov test has a function class of indicator functions, ${\mathcal {F}}=\{1_{(-\infty ,t]}:t\in \mathbb {R} \}$ .
The kernel maximum mean discrepancy (MMD) has ${\mathcal {F}}$ the unit ball in a reproducing kernel Hilbert space. This distance is particularly easy to estimate from samples, requiring no optimization.
Variants of generative adversarial networks and classifer-based two-sample tests^[3]^[4] frequently use a "neural net distance"^[5]^[6] where ${\mathcal {F}}$ is a class of neural networks.

Relationship to $f$ -divergences

compare case of differing supports; does the KALE paper talk about this nicely, maybe?

TV is the only nontrivial function that's both^[7]

maybe also say that it's the only overlap with Lp distances? (is this proven somewhere?)

Estimation

Bharath's paper ^[7]

the data-splitting estimator from Demystifying?

References

^ ^a ^b ^c Müller, Alfred (June 1997). "Integral Probability Metrics and Their Generating Classes of Functions". Advances in Applied Probability. 29 (2): 429–443. doi:10.2307/1428011. JSTOR 1428011. S2CID 124648603.
^ Zolotarev, V. M. (January 1984). "Probability Metrics". Theory of Probability & Its Applications. 28 (2): 278–302. doi:10.1137/1128025.
^ Kim, Ilmun; Ramdas, Aaditya; Singh, Aarti; Wasserman, Larry (February 2021). "Classification accuracy as a proxy for two-sample testing". The Annals of Statistics. 49 (1). arXiv:1703.00573. doi:10.1214/20-AOS1962. S2CID 17668083.
^ Lopez-Paz, David; Oquab, Maxime (2017). "Revisiting Classifier Two-Sample Tests". International Conference on Learning Representations. arXiv:1610.06545.
^ Arora, Sanjeev; Ge, Rong; Liang, Yingyu; Ma, Tengyu; Zhang, Yi (2017). "Generalization and Equilibrium in Generative Adversarial Nets (GANs)". International Conference on Machine Learning. arXiv:1703.00573.
^ Ji, Kaiyi; Liang, Yingbin (2018). "Minimax Estimation of Neural Net Distance". Advances in Neural Information Processing Systems. arXiv:1811.01054.
^ ^a ^b Sriperumbudur, Bharath K.; Fukumizu, Kenji; Gretton, Arthur; Schölkopf, Bernhard; Lanckriet, Gert R. G. (2009). "On integral probability metrics, φ-divergences and binary classification". arXiv:0901.2698 [cs.IT].

[mueller-1] Müller, Alfred (June 1997). "Integral Probability Metrics and Their Generating Classes of Functions". Advances in Applied Probability. 29 (2): 429–443. doi:10.2307/1428011. JSTOR 1428011. S2CID 124648603.

[2] Zolotarev, V. M. (January 1984). "Probability Metrics". Theory of Probability & Its Applications. 28 (2): 278–302. doi:10.1137/1128025.

[3] Kim, Ilmun; Ramdas, Aaditya; Singh, Aarti; Wasserman, Larry (February 2021). "Classification accuracy as a proxy for two-sample testing". The Annals of Statistics. 49 (1). arXiv:1703.00573. doi:10.1214/20-AOS1962. S2CID 17668083.

[4] Lopez-Paz, David; Oquab, Maxime (2017). "Revisiting Classifier Two-Sample Tests". International Conference on Learning Representations. arXiv:1610.06545.

[5] Arora, Sanjeev; Ge, Rong; Liang, Yingyu; Ma, Tengyu; Zhang, Yi (2017). "Generalization and Equilibrium in Generative Adversarial Nets (GANs)". International Conference on Machine Learning. arXiv:1703.00573.

[6] Ji, Kaiyi; Liang, Yingbin (2018). "Minimax Estimation of Neural Net Distance". Advances in Neural Information Processing Systems. arXiv:1811.01054.

[on-ipms-7] Sriperumbudur, Bharath K.; Fukumizu, Kenji; Gretton, Arthur; Schölkopf, Bernhard; Lanckriet, Gert R. G. (2009). "On integral probability metrics, φ-divergences and binary classification". arXiv:0901.2698 [cs.IT].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

Definition

Examples

Relationship to f-divergences

Estimation

References

Relationship to $f$ -divergences