Jump to content

Information bottleneck method: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Adding short description: "Technique in information theory"
 
(199 intermediate revisions by 70 users not shown)
Line 1: Line 1:
{{Short description|Technique in information theory}}
The '''information bottleneck method''' is a technique for finding the best trade-off between [[accuracy]] and [[Data compression|compression]] when summarizing (e.g. [[data clustering|clustering]]) a [[random variable]] '''X''' when given a joint [[probability distribution]] between '''X''' and an observed variable '''Y'''.
The '''information bottleneck method''' is a technique in [[information theory]] introduced by [[Naftali Tishby]], Fernando C. Pereira, and [[William Bialek]].<ref name=":0">{{cite conference |url=http://www.cs.huji.ac.il/labs/learning/Papers/allerton.pdf|title=The Information Bottleneck Method|conference=The 37th annual Allerton Conference on Communication, Control, and Computing|last1=Tishby|first1=Naftali|author-link1=Naftali Tishby|last2=Pereira|first2=Fernando C.|last3=Bialek|first3=William|author-link3=William Bialek|date=September 1999|pages=368–377}}</ref> It is designed for finding the best tradeoff between [[accuracy]] and complexity ([[Data compression|compression]]) when [[random variable|summarizing]] (e.g. [[data clustering|clustering]]) a [[random variable]] '''X''', given a [[joint probability distribution]] '''p(X,Y)''' between '''X''' and an observed relevant variable '''Y''' - and self-described as providing ''"a surprisingly rich framework for discussing a variety of problems in signal processing and learning"''.<ref name=":0"/>


Applications include distributional clustering and [[dimension reduction]], and more recently it has been suggested as a theoretical foundation for [[deep learning]]. It generalized the classical notion of minimal [[sufficient statistics]] from [[parametric statistics]] to arbitrary distributions, not necessarily of exponential form. It does so by relaxing the sufficiency condition to capture some fraction of the [[mutual information]] with the relevant variable '''Y'''.


The information bottleneck can also be viewed as a [[Rate–distortion theory|rate distortion]] problem, with a distortion function that measures how well '''Y''' is predicted from a compressed representation '''T''' compared to its direct prediction from '''X'''. This interpretation provides a general iterative algorithm for solving the information bottleneck trade-off and calculating the information curve from the distribution '''p(X,Y)'''.
The compressed variable is <math>T\,</math> and the algorithm minimises the following quantity<br /><br />
<math>\underset {p(t|x)}{\min} \,\, I(X;T) - \beta I(T;Y)</math><br /><br />
where <math>I(X;T)\,\,I(T;Y)</math> are the mutual informations between <math>X;T \,</math> and <math>T;Y \,</math> respectively.


Let the compressed representation be given by random variable <math>T</math>. The algorithm minimizes the following functional with respect to conditional distribution <math>p(t|x)</math>:
= Gaussian Information Bottleneck [1] =


: <math> \inf_{p(t|x)} \,\, \Big( I(X;T) - \beta I(T;Y) \Big),</math>
A relatively simple application of the information bottleneck is to Gaussian variates and this has some semblance to a least squares reduced rank or canonical approximation. Assume <math>X, Y \,</math> are jointly multivariate zero mean normal vectors and <math>T\,</math> is a compressed version of <math>X\,</math> which must maintain a given value of mutual information with <math>Y\,</math>. It can be shown that the optimum <math>T\,</math> is a normal vector consisting of orthogonal linear combinations of the elements of <math>X , \,\, T=AX \,</math>.

where <math>I(X;T)</math> and <math>I(T;Y)</math> are the mutual information of <math>X</math> and <math>T</math>, and of <math>T</math> and <math>Y</math>, respectively, and <math>\beta</math> is a [[Lagrange multiplier]].

== Learning theory for deep learning ==
It has been mathematically proven that controlling information bottleneck is one way to control [[generalization error]] in deep learning.<ref>Kenji Kawaguchi, Zhun Deng, Xu Ji, Jiaoyang Huang.[https://proceedings.mlr.press/v202/kawaguchi23a.html "How Does Information Bottleneck Help Deep Learning?"] Proceedings of the 40th International Conference on Machine Learning, PMLR 202:16049-16096, 2023.</ref> Namely, the generalization error is proven to scale as <math>\tilde O\left(\sqrt{\frac{I(X,T)+1}{n}}\right)</math> where <math>n</math> is the number of training samples, <math>X</math> is the input to a deep neural network, and <math>T</math> is the output of a hidden layer. This generalization bound scale with the degree of information bottleneck, unlike the other generalization bounds that scale with the number of parameters, [[VC dimension]], [[Rademacher complexity]], stability or robustness.

== Phase transitions ==

{{Empty section|date=December 2018}}

== Information theory of deep learning ==

Theory of Information Bottleneck is recently used to study Deep Neural Networks (DNN).<ref name=":4">{{cite arXiv |last1=Shwartz-Ziv |first1=Ravid |last2=Tishby |first2=Naftali |title=Opening the black box of deep neural networks via information |eprint=1703.00810|class=cs.LG |year=2017 }}</ref>
Consider <math>X </math> and <math>Y </math> respectively as the input and output layers of a DNN, and let <math>T </math> be any hidden layer of the network.
Shwartz-Ziv and Tishby proposed the information bottleneck that expresses the tradeoff between the mutual information measures <math>I(X,T)</math> and <math>I(T,Y)</math>. In this case, <math>I(X,T)</math> and <math>I(T,Y) </math> respectively quantify the amount of information that the hidden layer contains about the input and the output.
They conjectured that the training process of a DNN consists of two separate phases; 1) an initial fitting phase in which <math>I(T,Y)</math> increases, and 2) a subsequent compression phase in which <math>I(X,T)</math> decreases. Saxe et al. in <ref>{{cite journal|last1=Andrew M|first1=Saxe|display-authors=etal|date=2018|title=On the information bottleneck theory of deep learning.|url=https://openreview.net/pdf?id=ry_WPG-A-|journal=ICLR 2018 Conference Blind Submission|volume=2019|issue=12|page=124020|doi=10.1088/1742-5468/ab3985|bibcode=2019JSMTE..12.4020S|s2cid=49584497}}</ref> countered the claim of Shwartz-Ziv and Tishby,<ref name=":4" /> stating that this compression phenomenon in DNNs is not comprehensive, and it depends on the particular activation function. In particular, they claimed that the compression does not happen with ReLu activation functions. Shwartz-Ziv and Tishby disputed these claims, arguing that Saxe et al. had not observed compression due to weak estimates of the mutual information. Recently, Noshad et al. used a rate-optimal estimator of mutual information to explore this controversy, observing that the optimal hash-based estimator reveals the compression phenomenon in a wider range of networks with ReLu and maxpooling activations.<ref>{{cite arXiv |last1=Noshad |first1=Morteza |display-authors=etal |title=Scalable Mutual Information Estimation using Dependence Graphs |eprint=1801.09125 |date=2018|class=cs.IT }}</ref> On the other hand, recently Goldfeld et al. have argued that the observed compression is a result of geometric, and not of information-theoretic phenomena,<ref>{{cite journal|last1=Goldfeld|first1=Ziv|display-authors=etal|date=2019|title=Estimating Information Flow in Deep Neural Networks|url=http://proceedings.mlr.press/v97/goldfeld19a.html|journal=Icml 2019|pages=2299–2308|arxiv=1810.05728}}</ref> a view that has been shared also in.<ref>{{cite journal|first1=Bernhard C.|last1=Geiger|title=On Information Plane Analyses of Neural Network Classifiers—A Review|journal=IEEE Transactions on Neural Networks and Learning Systems |date=2022|volume=33 |issue=12 |pages=7039–7051 |doi=10.1109/TNNLS.2021.3089037 |pmid=34191733 |arxiv=2003.09671|s2cid=214611728 }}</ref>

== Variational bottleneck ==

{{Empty section|date=December 2018}}

== Gaussian bottleneck ==

The Gaussian bottleneck,<ref>{{Cite journal|url = http://www.jmlr.org/papers/volume6/chechik05a/chechik05a.pdf|title = Information Bottleneck for Gaussian Variables|last1 = Chechik|first1 = Gal|date = 1 January 2005|journal = Journal of Machine Learning Research|issue = 6|publication-date = 1 May 2005|pages = 165–188|last2 = Globerson|first2 = Amir|last3 = Tishby|first3 = Naftali|last4 = Weiss|first4 = Yair|editor-last = Dayan|editor-first = Peter}}</ref> namely, applying the information bottleneck approach to Gaussian variables, leads to solutions related to [[canonical correlation]] analysis. Assume <math>X, Y \,</math> are jointly multivariate zero mean normal vectors with covariances <math>\Sigma_{XX}, \,\, \Sigma_{YY}</math> and <math>T\,</math> is a compressed version of <math>X\,</math> that must maintain a given value of mutual information with <math>Y\,</math>. It can be shown that the optimum <math>T\,</math> is a normal vector consisting of linear combinations of the elements of <math>X , \,\, T=AX \,</math> where matrix <math>A \,</math> has orthogonal rows.
The projection matrix <math>A\,</math> contains <math>M\,</math> rows selected from the weighted left eigenvectors of the singular value decomposition of the following matrix (generally asymmetric) <br /><br />
The projection matrix <math>A\,</math> in fact contains <math>M\,</math> rows selected from the weighted left [[eigenvectors]] of the [[singular value decomposition]] of the matrix (generally asymmetric)


<math>\Omega = \Sigma_{X|Y} \Sigma_{XX}^{-1} = I - \Sigma_{XY} \Sigma_{YY}^{-1} \Sigma_{XY}^T \Sigma_{XX}^{-1}</math><br /><br />
: <math>\Omega = \Sigma_{X|Y} \Sigma_{XX}^{-1} = I - \Sigma_{XY} \Sigma_{YY}^{-1} \Sigma_{XY}^T \Sigma_{XX}^{-1}.\,</math>
Define the singular value decomposition<br />
<math>\Omega = U\Lambda V^T \,</math> with <math>\lambda_1 \le \lambda_2 \cdots \lambda_N \,</math><br />
and the critical values<br /> <br />
<math>\beta_i^C \underset {\lambda_i < 1}{=} (1 - \lambda_i )^{-1}</math>.<br />


Define the singular value decomposition
then the number <math>M \,</math> of active eigenvectors in the projection, or order of approximation, is given by<br />
<math>\beta_{M-1}^C < \beta \le \beta_M^C</math> <br />
And we finally get<br /><br />
<math>A=[ w_1 U_1 , \dots , w_M U_M ]^T</math><br /><br />


: <math>\Omega = U\Lambda V^T\text{ with }\Lambda = \operatorname{Diag} \big ( \lambda_1 \le \lambda_2 \cdots \lambda_N \big ) \,</math>
In which the weights are given by<br /><br />
<math>w_i = \sqrt{(\beta (1- \lambda_i )/\lambda_i r_i}</math><br /><br />
where <math>r_i = U_i^T \Sigma_{xx} U_i</math> .


and the critical values
= Data Clustering using the Information Bottleneck =


: <math>\beta_i^C \underset {\lambda_i < 1}{=} (1 - \lambda_i )^{-1}. \, </math>
This application of the bottleneck method to non-Gaussian sampled data is described in [2] by Tishby et. el. The concept, as treated there, is not without complication as there are two independent phases in the exercise: firstly estimation of the unknown parent probability densities from which the data samples are drawn and secondly the use of these densities within the information theoretic framework of the bottleneck.


then the number <math>M \,</math> of active eigenvectors in the projection, or order of approximation, is given by
== Density Estimation ==


: <math>\beta_{M-1}^C < \beta \le \beta_M^C</math>
Since the bottleneck method is framed in probabilistic rather than statistical terms, we first need to estimate the underlying probability density at the sample points <math>X</math>. This is a well known problem with a number of solutions described by Silverman in [3]. In the present method, probability densities at the sample points are found by use of a Markov transition matrix method and this has some mathematical synergy with the bottleneck method itself.


And we finally get
Define an arbitrarily increasing distance metric <math>f \,</math> between all sample pairs and define distance matrix <math>d_{i,j}=f \Big ( \Big| x_i - x_j \Big | \Big )</math> . Then compute transition probabilities between sample pairs <math>P_{i,j}=exp (- \lambda d_{i,j} ) \,</math> for some <math>\lambda > 0 \,</math>. Treating samples as states, and <math>P \,</math> as a Markov state transition probability matrix, the vector of probabilities of the ‘states’ after <math>t</math> steps, conditioned on the initial state <math>p(0) \,</math>, is <math>p(t)=P^t p(0) \,</math>. We are here interested only in the equilibrium probability vector <math>p(\infty ) \,</math> given, in the usual way, by the dominant left eigenvector of matrix <math>P \,</math> and is independent of the initialising vector <math>p(0) \,</math>. This Markov transition method establishes a probability at the sample points which is claimed to be proportional to the probabilities densities there.


: <math>A=[ w_1 U_1 , \dots , w_M U_M ]^T</math>
== Clusters ==
In the following, the reference vector <math>Y \,</math> contains sample categories and the joint probability <math>p(X,Y) \,</math> is assumed known. A cluster <math>\tilde x_k \,</math> is defined by its probability distribution over the data samples <math>x: \,\,\, p( \tilde x_k |x)</math>. In [2] Tishby et al present the following iterative set of equations to determine the clusters


In which the weights are given by
<math>\begin{cases}

p(\tilde x|x)=Kp(\tilde x) exp \Big( -\beta\,D_{KL} \Big[ p(y|x) \,|| \, p(y| \tilde x)\Big ] \Big)\\
: <math>w_i = \sqrt{\left(\beta (1- \lambda_i )-1\right)/\lambda_i r_i}</math>
p(y| \tilde x)=\textstyle \sum_x p(y|x)p( \tilde x | x) p(x) \big / p(\tilde x) \\

p(\tilde x) = \textstyle \sum_x p(\tilde x | x) p(x) \\
where <math>r_i = U_i^T \Sigma_{XX} U_i.\,</math>

Applying the Gaussian information bottleneck to [[time series]] (processes), yields solutions related to optimal predictive coding. This procedure is formally equivalent to '''linear''' Slow Feature Analysis.<ref>{{Cite journal|title = Predictive Coding and the Slowness Principle: An Information-Theoretic Approach|journal = Neural Computation|date = 2007-12-17|issn = 0899-7667|pages = 1026–1041|volume = 20|issue = 4|doi = 10.1162/neco.2008.01-07-455|pmid = 18085988|first1 = Felix|last1 = Creutzig|first2 = Henning|last2 = Sprekeler|author1-link=Felix Creutzig|citeseerx = 10.1.1.169.6917|s2cid = 2138951}}</ref>

Optimal temporal structures in linear dynamic systems can be revealed in the so-called past-future information bottleneck, an application of the bottleneck method to non-Gaussian sampled data.<ref>{{Cite journal|title = Past-future information bottleneck in dynamical systems|journal = Physical Review E|date = 2009-04-27|pages = 041925|volume = 79|issue = 4|doi = 10.1103/PhysRevE.79.041925|pmid = 19518274|first1 = Felix|last1 = Creutzig|first2 = Amir|last2 = Globerson|first3 = Naftali|last3 = Tishby|bibcode = 2009PhRvE..79d1925C}}</ref> The concept, as treated by Creutzig, Tishby et al., is not without complication as two independent phases make up in the exercise: firstly estimation of the unknown parent probability densities from which the data samples are drawn and secondly the use of these densities within the information theoretic framework of the bottleneck.

=== Density estimation ===

{{Main|Density estimation}}

Since the bottleneck method is framed in probabilistic rather than statistical terms, the underlying probability density at the sample points <math>X = {x_i} \,</math>must be estimated. This is a well known problem with multiple solutions described by [[Bernard Silverman|Silverman]].<ref name=":2" /> In the present method, joint sample probabilities are found by use of a [[Stochastic matrix|Markov transition matrix]] method and this has some mathematical synergy with the bottleneck method itself.

The arbitrarily increasing distance metric <math>f \,</math> between all sample pairs and [[distance matrix]] is <math>d_{i,j}=f \Big ( \Big| x_i - x_j \Big | \Big )</math> . Then transition probabilities between sample pairs <math>P_{i,j}=\exp (- \lambda d_{i,j} ) \,</math> for some <math>\lambda > 0 \,</math>must be computed. Treating samples as states, and a normalised version of <math>P \,</math> as a Markov state transition probability matrix, the vector of probabilities of the 'states' after <math>t \,</math> steps, conditioned on the initial state <math>p(0) \,</math>, is <math>p(t)=P^t p(0) \,</math>. The equilibrium probability vector <math>p(\infty ) \,</math> given, in the usual way, by the dominant eigenvector of matrix <math>P \,</math> which is independent of the initialising vector <math>p(0) \,</math>. This Markov transition method establishes a probability at the sample points which is claimed to be proportional to the probabilities' densities there.

Other interpretations of the use of the eigenvalues of distance matrix <math>d \,</math> are discussed in Silverman's ''Density Estimation for Statistics and Data Analysis''.<ref name=":2">{{cite book|last = Silverman|first = Bernie|title = Density Estimation for Statistics and Data Analysis|journal = Monographs on Statistics and Applied Probability|publisher = Chapman & Hall|year = 1986|isbn = 978-0412246203|author-link = Bernie Silverman|bibcode = 1986desd.book.....S|url-access = registration|url = https://archive.org/details/densityestimatio00silv_0}}</ref>

=== Clusters ===
In the following soft clustering example, the reference vector <math>Y \,</math> contains sample categories and the joint probability <math>p(X,Y) \,</math> is assumed known. A soft cluster <math>c_k \,</math> is defined by its probability distribution over the data samples <math>x_i: \,\,\, p( c_k |x_i)</math>. Tishby et al. presented<ref name=":0" /> the following iterative set of equations to determine the clusters which are ultimately a generalization of the [[Blahut-Arimoto algorithm]], developed in [[rate distortion theory]]. The application of this type of algorithm in neural networks appears to originate in entropy arguments arising in the application of [[Boltzmann distribution|Gibbs Distributions]] in deterministic annealing.<ref name=":3">{{Cite book|publisher = ACM|date = 2000-01-01|location = New York, NY, USA|isbn = 978-1-58113-226-7|pages = 208–215|series = SIGIR '00|doi = 10.1145/345508.345578|first1 = Noam|last1 = Slonim|first2 = Naftali|last2 = Tishby| title=Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval | chapter=Document clustering using word clusters via the information bottleneck method |citeseerx = 10.1.1.21.3062|s2cid = 1373541}}</ref><ref>D. J. Miller, A. V. Rao, K. Rose, A. Gersho: "An Information-theoretic Learning Algorithm for Neural Network Classification". NIPS 1995: pp.&nbsp;591–597</ref>

: <math>\begin{cases}
p(c|x)=Kp(c) \exp \Big( -\beta\,D^{KL} \Big[ p(y|x) \,|| \, p(y| c)\Big ] \Big)\\
p(y| c)=\textstyle \sum_x p(y|x)p( c | x) p(x) \big / p(c) \\
p(c) = \textstyle \sum_x p(c | x) p(x) \\
\end{cases}
\end{cases}
</math>
</math>


The function of each line of the iteration is expanded as follows.
The function of each line of the iteration expands as

'''Line 1:''' This is a matrix valued set of conditional probabilities

: <math>A_{i,j} = p(c_i | x_j )=Kp(c_i) \exp \Big( -\beta\,D^{KL} \Big[ p(y|x_j) \,|| \, p(y| c_i)\Big ] \Big)</math>

The [[Kullback–Leibler divergence]] <math>D^{KL} \,</math> between the <math>Y \,</math> vectors generated by the sample data <math>x \,</math> and those generated by its reduced information proxy <math>c \,</math> is applied to assess the fidelity of the compressed vector with respect to the reference (or categorical) data <math>Y \,</math> in accordance with the fundamental bottleneck equation. <math>D^{KL}(a||b)\,</math> is the Kullback–Leibler divergence between distributions <math>a, b \,</math>

: <math>D^{KL} (a||b)= \sum_i p(a_i) \log \Big ( \frac{p(a_i)}{p(b_i)} \Big ) </math>

and <math>K \,</math> is a scalar normalization. The weighting by the negative exponent of the distance means that prior cluster probabilities are downweighted in line 1 when the Kullback–Leibler divergence is large, thus successful clusters grow in probability while unsuccessful ones decay.

'''Line 2: '''Second matrix-valued set of conditional probabilities. By definition

: <math>\begin{align}
p(y_i|c_k) & = \sum_j p(y_i|x_j)p(x_j|c_k) \\
& =\sum_j p(y_i|x_j)p(x_j, c_k ) \big / p(c_k) \\
& =\sum_j p(y_i|x_j)p(c_k | x_j) p(x_j) \big / p(c_k) \\
\end{align}</math>
where the Bayes identities <math>p(a,b)=p(a|b)p(b)=p(b|a)p(a) \,</math> are used.

'''Line 3:''' this line finds the marginal distribution of the clusters <math>c \,</math>


: <math>\begin{align}
'''Line 1:''' This is a matrix valued set of conditional probabilities
p(c_i)& =\sum_j p(c_i , x_j)
& = \sum_j p(c_i | x_j) p(x_j)
\end{align}</math>


This is a standard result.
<math>A_{i,j} = p(\tilde x_i | x_j )=Kp(\tilde x_i) exp \Big( -\beta\,D^{KL} \Big[ p(y|x_j) \,|| \, p(y| \tilde x_i)\Big ] \Big)</math><br />


Further inputs to the algorithm are the marginal sample distribution <math>p(x) \,</math> which has already been determined by the dominant eigenvector of <math>P \,</math> and the matrix valued Kullback–Leibler divergence function
The Kullback Leibler distance <math>D^{KL} \,</math> between the <math>Y \,</math> vectors generated by the sample data <math>x \,</math> and those generated by its reduced information proxy <math>\tilde x \,</math> is applied to assess the fidelity of the compressed vector with respect to the categorical data Y in accordance with the fundamental bottleneck equation. <math>D^{KL}(a||b)\,</math> is the Kullback Leibler distance between distributions <math>a, b \,</math><br /><br />
<math>D^{KL}= \int p(a) log \Big ( \frac{p(a)}{p(b)} \Big ) da </math><br /><br />


: <math>D_{i,j}^{KL}=D^{KL} \Big[ p(y|x_j) \,|| \, p(y| c_i)\Big ] \Big)</math>
and <math>K \,</math> is a scalar normalization. The weighting by the negative exponent of the distance means that prior cluster probabilities are downweighted in line 1 when the Kullback Liebler distance is large, thus successful clusters grow in probability while unsuccessful ones decay.<br />


'''Line 2: '''This is a second matrix valued set of conditional probabilities<br /><br />
derived from the sample spacings and transition probabilities.
<math>B_{k,l}=p(y_i | x_k ) = \sum_k p(y_i | x_k ) p(\tilde x_j | x_k )p(x)\big / p(\tilde x_j )</math><br />
The steps in deriving this are as follows. We have, by definition


The matrix <math>p(y_i | c_j) \,</math> can be initialized randomly or with a reasonable guess, while matrix <math>p(c_i | x_j) \,</math> needs no prior values. Although the algorithm converges, multiple minima may exist that would need to be resolved.<ref name=":1">{{cite conference |url = https://papers.nips.cc/paper/1896-data-clustering-by-markovian-relaxation-and-the-information-bottleneck-method.pdf|title = Data clustering by Markovian Relaxation and the Information Bottleneck Method|conference = Neural Information Processing Systems (NIPS) 2000|last1 = Tishby|first1 = Naftali|author-link1 = Naftali Tishby|last2 = Slonim|first2 = N|pages = 640–646}}</ref>
<math>\begin{align}
p(y|\tilde x) & = \int_x p(y|x)p(x|\tilde x) \\
& =\int_x p(y|x)p(x, \tilde x ) \big / p(\tilde x) \\
& =\int_x p(y|x)p(\tilde x | x) p(x) \big / p(\tilde x) \\
\end{align}</math><br /><br />
where the Bayes identities <math>p(a,b)=p(a|b)p(b)=p(b|a)p(a) \,</math> are used. Finally the integral is rewritten as the summation over the sample points <math>k</math> as in the first equation above.<br /><br />


==Defining decision contours ==
'''Line 3:''' this line finds the marginal distribution of <math>\tilde x \,</math><br /><br />
<math>\begin{align}
p(\tilde x_i)& =\sum_j p(\tilde x_i | x_j) p(x_j)
& = \sum_j p(\tilde x_i , x_j)
\end{align}</math><br />


To categorize a new sample <math> x' \,</math> external to the training set <math>X \,</math>, the previous distance metric finds the transition probabilities between <math> x' \,</math> and all samples in <math>X: \,\,</math>, <math> \tilde p(x_i )= p(x_i | x')= \Kappa \exp \Big (-\lambda f \big ( \Big| x_i - x' \Big | \big ) \Big )</math> with <math>\Kappa \,</math> a normalization. Secondly apply the last two lines of the 3-line algorithm to get cluster and conditional category probabilities.
This is also derived from standard results.


: <math>\begin{align}
Further inputs to the algorithm are the marginal sample distribution <math>p(x) \,</math> which has already been determined by the dominant eigenvector of <math>P \,</math> and the matrix valued Kullback Leibler distance function
& \tilde p(c_i ) = p(c_i | x' ) = \sum_j p(c_i | x_j)p(x_j | x') =\sum_j p(c_i | x_j) \tilde p(x_j)\\
& p(y_i | c_j) = \sum_k p(y_i | x_k) p(c_j | x_k)p(x_k | x') / p(c_j | x' )
= \sum_k p(y_i | x_k) p(c_j | x_k) \tilde p(x_k) / \tilde p(c_j) \\
\end{align}</math>


Finally
<math>D_{i,j}^{KL}=D^{KL} \Big[ p(y|x_j) \,|| \, p(y| \tilde x_i)\Big ] \Big)</math> derived from the sample spacings and transition probabilities.


The matrices <math>p(y_i | \tilde x_j), p(\tilde x_i | x_j)</math> can be initialised randomly.<br /><br />
: <math>p(y_i | x')= \sum_j p(y_i | c_j) p(c_j | x') )= \sum_j p(y_i | c_j) \tilde p(c_j) \,</math>


Parameter <math>\beta \,</math> must be kept under close supervision since, as it is increased from zero, increasing numbers of features, in the category probability space, snap into focus at certain critical thresholds.
==Defining Decision Contours ==


=== An example ===
To categorize a new sample <math> x' \,</math> external to the training set <math>X</math>, first calculate the probabilities that it belongs to each of the various clusters which is the conditional probability <math>p(\tilde x | x' ) \,</math>. In order to find this, apply the previous distance metric to find the transition probabilities between <math> x' \,</math> and all samples in <math>X: \,\,</math>, <math>p(x_i)= exp \Big (-\lambda f \big ( \Big| x_i - x' \Big | \big ) \Big )</math>. Secondly apply the last two lines of the 3-line algorithm to get cluster, and conditional category probabilities.<br /><br />
The following case examines clustering in a four quadrant multiplier with random inputs <math>u, v \,</math> and two categories of output, <math>\pm 1 \,</math>, generated by <math>y=\operatorname{sign}(uv) \,</math>. This function has two spatially separated clusters for each category and so demonstrates that the method can handle such distributions.


20 samples are taken, uniformly distributed on the square <math>[-1,1]^2 \,</math> . The number of clusters used beyond the number of categories, two in this case, has little effect on performance and the results are shown for two clusters using parameters <math>\lambda = 3,\, \beta = 2.5</math>.
<math>\begin{align}
p(\tilde x_i ) & = \sum_j p(\tilde x_i | x_j)p(x_j) \\
p(y_i | \tilde x_j) & = \sum_k p(y_i | x_k) p(\tilde x_j | x_k)p(x_k) / p(\tilde x_j ) \\
\end{align}</math><br /><br />
Finally we have


The distance function is <math>d_{i,j} = \Big| x_i - x_j \Big |^2</math> where <math>x_i = (u_i,v_i)^T \, </math> while the conditional distribution <math>p(y|x)\, </math> is a 2&nbsp;×&nbsp;20 matrix


: <math>\begin{align} & Pr(y_i=1) = 1\text{ if }\operatorname{sign}(u_iv_i)=1\, \\
<math>p(y_i)= \sum_j p(y_i | \tilde x_j) p(\tilde x_j)</math><br />
& Pr(y_i= -1) = 1\text{ if }\operatorname{sign}(u_iv_i)= -1\,
\end{align}</math>


and zero elsewhere.
Generally the algorithm converges rapidly, often in tens of iterations. However parameter <math>\beta \,</math> must be kept under close supervision since, as it is increased from zero, increasing numbers of features, in the category probability space, click into focus at certain critical values.


The summation in line 2 incorporates only two values representing the training values of +1 or &minus;1, but nevertheless works well. The figure shows the locations of the twenty samples with '0' representing ''Y'' = 1 and 'x' representing ''Y'' = &minus;1. The contour at the unity likelihood ratio level is shown,
There is some analogy between this algorithm and a neural network with a single hidden layer. The nodes are represented by the clusters <math>\tilde x_j \,</math>. The first and second layers of network weights are the conditional probabilities <math>p(\tilde x_i | y_j)</math> and <math>p(y_i | \tilde x_j) </math> respectively. However, unlike a standard neural network, the present algorithm always uses probabilities of samples as inputs rather than the sample values themselves and non linear function are encapsulated in the Kullback Leibler distances and the transition probabilities rather than sigmoid functions. Compared to a neural network this algorithm seems to converge much more quickly and by varying <math>\beta \,</math> and <math>\lambda \,</math> various levels of focus on features can be achieved. There are also similarities to some varieties of Fuzzy Logic algorithms.


: <math>L= \frac{\Pr(1)}{\Pr(-1)} = 1</math>
For blind classification and clustering, the transient behaviour of <math>p(t) \,</math> is analysed and this is discussed in more detail in [2] but this extra complication is not necessary for the supervised training described here.


as a new sample <math>x' \,</math>is scanned over the square. Theoretically the contour should align with the <math>u=0 \,</math> and <math>v=0 \,</math> coordinates but for such small sample numbers they have instead followed the spurious clusterings of the sample points.
== An Example ==
[[Image:BottleCateg 1.jpg|thumb|Decision contours]]
In the following simple case we investigate clustering in a four quadrant multiplier with random inputs <math>u, v \,</math> and two categories of output, <math>\pm 1 \,</math>, generated by
<math>y=sign(uv) \,</math>. This function has the property that there are two spatially separated clusters for each category and so it demonstrates that the method can handle such distributions.


===Neural network/fuzzy logic analogies===
20 samples are taken, uniformly distributed on the square <math>[-1,1]^2 \,</math> . The number of clusters used beyond the number of categories, two in this case, has little effect on performance and the results are shown for two clusters using parameters <math>\lambda = 3,\, \beta = 2.5</math> and the distance function
This algorithm is somewhat analogous to a neural network with a single hidden layer. The internal nodes are represented by the clusters <math>c_j \,</math> and the first and second layers of network weights are the conditional probabilities <math>p(c_j | x_i) \,</math> and <math>p(y_k | c_j) \,</math> respectively. However, unlike a standard neural network, the algorithm relies entirely on probabilities as inputs rather than the sample values themselves, while internal and output values are all conditional probability density distributions. Nonlinear functions are encapsulated in distance metric <math>f(.) \,</math> (or ''influence functions/radial basis functions'') and transition probabilities instead of [[sigmoid function]]s.
<math>d_{i,j} = f \Big ( \Big| X_i - X_j \Big | \Big ) = \Big| X_i - X_j \Big |^2</math> where <math>X_i = (u_i,v_i)^T \,</math>. The figure shows the locations of the twenty samples with '0' representing ''Y'' = 1 and 'x' representing ''Y'' = -1. The contour at the unity likelihood ratio level is shown,
<math>L= Pr(1) \big/ Pr(-1) = 1</math>
as a new sample <math>x' \,</math>is scanned over the square. Theoretically the contour should align with the <math>X=0 \,</math> and <math>Y=0 \,</math> coordinates but for such small sample numbers they have instead followed the spurious clusterings of the sample points.
[[Image:Therustyone Bottle.jpg|thumb|Decision Contours]]


The Blahut-Arimoto three-line algorithm converges rapidly, often in tens of iterations, and by varying <math>\beta \,</math>, <math>\lambda \,</math> and <math>f \,</math> and the cardinality of the clusters, various levels of focus on features can be achieved.
==bibliography==
[1] G. Chechik, A Globerson, N. Tishby and Y. Weiss: “ Information Bottleneck for Gaussian Variables”. Journal of Machine Learning Research 6, Jan 2005, pp. 165-188


The statistical soft clustering definition <math>p(c_i | x_j) \,</math> has some overlap with the verbal fuzzy membership concept of [[fuzzy logic]].
[2] N Tishby, N Slonim: “Data clustering by Markovian Relaxation and the Information Bottleneck Method”, Neural Information Processing Systems (NIPS) 2000, pp. 640-646


==Extensions==
[3] B.W. Silverman: “Density Estimation for Statistical Data Analysis”, Chapman and Hall, 1986.
An interesting extension is the case of information bottleneck with side information.<ref>{{Cite journal|title = Extracting Relevant Structures with Side Information |url = http://papers.nips.cc/paper/2223-extracting-relevant-structures-with-side-information.pdf|journal = Advances in Neural Information Processing Systems|date = 2002|pages = 857–864|first1 = Gal|last1 = Chechik|first2 = Naftali|last2 = Tishby}}</ref> Here information is maximized about one target variable and minimized about another, learning a representation that is informative about selected aspects of data. Formally


: <math> \min_{p(t|x)} \,\, I(X;T) - \beta^+ I(T;Y^+) + \beta^- I(T;Y^-)</math>
==See also==
* [[Information theory]]


==External links==
==Bibliography==
* {{Citation|url=https://www.cs.huji.ac.il/~yweiss/iccv99.pdf|title=Proceedings IEEE International Conference on Computer Vision|contribution=Segmentation using eigenvectors: a unifying view|last=Weiss|first=Y.|date=1999|pages=975–982}}
* [http://citeseer.ist.psu.edu/tishby99information.html Paper by N. Tishby, et. al]
* P. Harremoës and N. Tishby [http://www.cs.huji.ac.il/labs/learning/Papers/flaske2.pdf "The Information Bottleneck Revisited or How to Choose a Good Distortion Measure". In proceedings of the International Symposium on Information Theory (ISIT) 2007]


== References ==
{{math-stub}}
{{Reflist|30em}}


[[Category:Machine learning]]
[[Category:Cluster analysis algorithms]]
[[Category:Multivariate statistics]]

Latest revision as of 17:20, 28 November 2024

The information bottleneck method is a technique in information theory introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek.[1] It is designed for finding the best tradeoff between accuracy and complexity (compression) when summarizing (e.g. clustering) a random variable X, given a joint probability distribution p(X,Y) between X and an observed relevant variable Y - and self-described as providing "a surprisingly rich framework for discussing a variety of problems in signal processing and learning".[1]

Applications include distributional clustering and dimension reduction, and more recently it has been suggested as a theoretical foundation for deep learning. It generalized the classical notion of minimal sufficient statistics from parametric statistics to arbitrary distributions, not necessarily of exponential form. It does so by relaxing the sufficiency condition to capture some fraction of the mutual information with the relevant variable Y.

The information bottleneck can also be viewed as a rate distortion problem, with a distortion function that measures how well Y is predicted from a compressed representation T compared to its direct prediction from X. This interpretation provides a general iterative algorithm for solving the information bottleneck trade-off and calculating the information curve from the distribution p(X,Y).

Let the compressed representation be given by random variable . The algorithm minimizes the following functional with respect to conditional distribution :

where and are the mutual information of and , and of and , respectively, and is a Lagrange multiplier.

Learning theory for deep learning

[edit]

It has been mathematically proven that controlling information bottleneck is one way to control generalization error in deep learning.[2] Namely, the generalization error is proven to scale as where is the number of training samples, is the input to a deep neural network, and is the output of a hidden layer. This generalization bound scale with the degree of information bottleneck, unlike the other generalization bounds that scale with the number of parameters, VC dimension, Rademacher complexity, stability or robustness.

Phase transitions

[edit]

Information theory of deep learning

[edit]

Theory of Information Bottleneck is recently used to study Deep Neural Networks (DNN).[3] Consider and respectively as the input and output layers of a DNN, and let be any hidden layer of the network. Shwartz-Ziv and Tishby proposed the information bottleneck that expresses the tradeoff between the mutual information measures and . In this case, and respectively quantify the amount of information that the hidden layer contains about the input and the output. They conjectured that the training process of a DNN consists of two separate phases; 1) an initial fitting phase in which increases, and 2) a subsequent compression phase in which decreases. Saxe et al. in [4] countered the claim of Shwartz-Ziv and Tishby,[3] stating that this compression phenomenon in DNNs is not comprehensive, and it depends on the particular activation function. In particular, they claimed that the compression does not happen with ReLu activation functions. Shwartz-Ziv and Tishby disputed these claims, arguing that Saxe et al. had not observed compression due to weak estimates of the mutual information. Recently, Noshad et al. used a rate-optimal estimator of mutual information to explore this controversy, observing that the optimal hash-based estimator reveals the compression phenomenon in a wider range of networks with ReLu and maxpooling activations.[5] On the other hand, recently Goldfeld et al. have argued that the observed compression is a result of geometric, and not of information-theoretic phenomena,[6] a view that has been shared also in.[7]

Variational bottleneck

[edit]

Gaussian bottleneck

[edit]

The Gaussian bottleneck,[8] namely, applying the information bottleneck approach to Gaussian variables, leads to solutions related to canonical correlation analysis. Assume are jointly multivariate zero mean normal vectors with covariances and is a compressed version of that must maintain a given value of mutual information with . It can be shown that the optimum is a normal vector consisting of linear combinations of the elements of where matrix has orthogonal rows.

The projection matrix in fact contains rows selected from the weighted left eigenvectors of the singular value decomposition of the matrix (generally asymmetric)

Define the singular value decomposition

and the critical values

then the number of active eigenvectors in the projection, or order of approximation, is given by

And we finally get

In which the weights are given by

where

Applying the Gaussian information bottleneck to time series (processes), yields solutions related to optimal predictive coding. This procedure is formally equivalent to linear Slow Feature Analysis.[9]

Optimal temporal structures in linear dynamic systems can be revealed in the so-called past-future information bottleneck, an application of the bottleneck method to non-Gaussian sampled data.[10] The concept, as treated by Creutzig, Tishby et al., is not without complication as two independent phases make up in the exercise: firstly estimation of the unknown parent probability densities from which the data samples are drawn and secondly the use of these densities within the information theoretic framework of the bottleneck.

Density estimation

[edit]

Since the bottleneck method is framed in probabilistic rather than statistical terms, the underlying probability density at the sample points must be estimated. This is a well known problem with multiple solutions described by Silverman.[11] In the present method, joint sample probabilities are found by use of a Markov transition matrix method and this has some mathematical synergy with the bottleneck method itself.

The arbitrarily increasing distance metric between all sample pairs and distance matrix is . Then transition probabilities between sample pairs for some must be computed. Treating samples as states, and a normalised version of as a Markov state transition probability matrix, the vector of probabilities of the 'states' after steps, conditioned on the initial state , is . The equilibrium probability vector given, in the usual way, by the dominant eigenvector of matrix which is independent of the initialising vector . This Markov transition method establishes a probability at the sample points which is claimed to be proportional to the probabilities' densities there.

Other interpretations of the use of the eigenvalues of distance matrix are discussed in Silverman's Density Estimation for Statistics and Data Analysis.[11]

Clusters

[edit]

In the following soft clustering example, the reference vector contains sample categories and the joint probability is assumed known. A soft cluster is defined by its probability distribution over the data samples . Tishby et al. presented[1] the following iterative set of equations to determine the clusters which are ultimately a generalization of the Blahut-Arimoto algorithm, developed in rate distortion theory. The application of this type of algorithm in neural networks appears to originate in entropy arguments arising in the application of Gibbs Distributions in deterministic annealing.[12][13]

The function of each line of the iteration expands as

Line 1: This is a matrix valued set of conditional probabilities

The Kullback–Leibler divergence between the vectors generated by the sample data and those generated by its reduced information proxy is applied to assess the fidelity of the compressed vector with respect to the reference (or categorical) data in accordance with the fundamental bottleneck equation. is the Kullback–Leibler divergence between distributions

and is a scalar normalization. The weighting by the negative exponent of the distance means that prior cluster probabilities are downweighted in line 1 when the Kullback–Leibler divergence is large, thus successful clusters grow in probability while unsuccessful ones decay.

Line 2: Second matrix-valued set of conditional probabilities. By definition

where the Bayes identities are used.

Line 3: this line finds the marginal distribution of the clusters

This is a standard result.

Further inputs to the algorithm are the marginal sample distribution which has already been determined by the dominant eigenvector of and the matrix valued Kullback–Leibler divergence function

derived from the sample spacings and transition probabilities.

The matrix can be initialized randomly or with a reasonable guess, while matrix needs no prior values. Although the algorithm converges, multiple minima may exist that would need to be resolved.[14]

Defining decision contours

[edit]

To categorize a new sample external to the training set , the previous distance metric finds the transition probabilities between and all samples in , with a normalization. Secondly apply the last two lines of the 3-line algorithm to get cluster and conditional category probabilities.

Finally

Parameter must be kept under close supervision since, as it is increased from zero, increasing numbers of features, in the category probability space, snap into focus at certain critical thresholds.

An example

[edit]

The following case examines clustering in a four quadrant multiplier with random inputs and two categories of output, , generated by . This function has two spatially separated clusters for each category and so demonstrates that the method can handle such distributions.

20 samples are taken, uniformly distributed on the square . The number of clusters used beyond the number of categories, two in this case, has little effect on performance and the results are shown for two clusters using parameters .

The distance function is where while the conditional distribution is a 2 × 20 matrix

and zero elsewhere.

The summation in line 2 incorporates only two values representing the training values of +1 or −1, but nevertheless works well. The figure shows the locations of the twenty samples with '0' representing Y = 1 and 'x' representing Y = −1. The contour at the unity likelihood ratio level is shown,

as a new sample is scanned over the square. Theoretically the contour should align with the and coordinates but for such small sample numbers they have instead followed the spurious clusterings of the sample points.

Decision contours

Neural network/fuzzy logic analogies

[edit]

This algorithm is somewhat analogous to a neural network with a single hidden layer. The internal nodes are represented by the clusters and the first and second layers of network weights are the conditional probabilities and respectively. However, unlike a standard neural network, the algorithm relies entirely on probabilities as inputs rather than the sample values themselves, while internal and output values are all conditional probability density distributions. Nonlinear functions are encapsulated in distance metric (or influence functions/radial basis functions) and transition probabilities instead of sigmoid functions.

The Blahut-Arimoto three-line algorithm converges rapidly, often in tens of iterations, and by varying , and and the cardinality of the clusters, various levels of focus on features can be achieved.

The statistical soft clustering definition has some overlap with the verbal fuzzy membership concept of fuzzy logic.

Extensions

[edit]

An interesting extension is the case of information bottleneck with side information.[15] Here information is maximized about one target variable and minimized about another, learning a representation that is informative about selected aspects of data. Formally

Bibliography

[edit]
  • Weiss, Y. (1999), "Segmentation using eigenvectors: a unifying view", Proceedings IEEE International Conference on Computer Vision (PDF), pp. 975–982
  • P. Harremoës and N. Tishby "The Information Bottleneck Revisited or How to Choose a Good Distortion Measure". In proceedings of the International Symposium on Information Theory (ISIT) 2007

References

[edit]
  1. ^ a b c Tishby, Naftali; Pereira, Fernando C.; Bialek, William (September 1999). The Information Bottleneck Method (PDF). The 37th annual Allerton Conference on Communication, Control, and Computing. pp. 368–377.
  2. ^ Kenji Kawaguchi, Zhun Deng, Xu Ji, Jiaoyang Huang."How Does Information Bottleneck Help Deep Learning?" Proceedings of the 40th International Conference on Machine Learning, PMLR 202:16049-16096, 2023.
  3. ^ a b Shwartz-Ziv, Ravid; Tishby, Naftali (2017). "Opening the black box of deep neural networks via information". arXiv:1703.00810 [cs.LG].
  4. ^ Andrew M, Saxe; et al. (2018). "On the information bottleneck theory of deep learning". ICLR 2018 Conference Blind Submission. 2019 (12): 124020. Bibcode:2019JSMTE..12.4020S. doi:10.1088/1742-5468/ab3985. S2CID 49584497.
  5. ^ Noshad, Morteza; et al. (2018). "Scalable Mutual Information Estimation using Dependence Graphs". arXiv:1801.09125 [cs.IT].
  6. ^ Goldfeld, Ziv; et al. (2019). "Estimating Information Flow in Deep Neural Networks". Icml 2019: 2299–2308. arXiv:1810.05728.
  7. ^ Geiger, Bernhard C. (2022). "On Information Plane Analyses of Neural Network Classifiers—A Review". IEEE Transactions on Neural Networks and Learning Systems. 33 (12): 7039–7051. arXiv:2003.09671. doi:10.1109/TNNLS.2021.3089037. PMID 34191733. S2CID 214611728.
  8. ^ Chechik, Gal; Globerson, Amir; Tishby, Naftali; Weiss, Yair (1 January 2005). Dayan, Peter (ed.). "Information Bottleneck for Gaussian Variables" (PDF). Journal of Machine Learning Research (6) (published 1 May 2005): 165–188.
  9. ^ Creutzig, Felix; Sprekeler, Henning (2007-12-17). "Predictive Coding and the Slowness Principle: An Information-Theoretic Approach". Neural Computation. 20 (4): 1026–1041. CiteSeerX 10.1.1.169.6917. doi:10.1162/neco.2008.01-07-455. ISSN 0899-7667. PMID 18085988. S2CID 2138951.
  10. ^ Creutzig, Felix; Globerson, Amir; Tishby, Naftali (2009-04-27). "Past-future information bottleneck in dynamical systems". Physical Review E. 79 (4): 041925. Bibcode:2009PhRvE..79d1925C. doi:10.1103/PhysRevE.79.041925. PMID 19518274.
  11. ^ a b Silverman, Bernie (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall. Bibcode:1986desd.book.....S. ISBN 978-0412246203. {{cite book}}: |journal= ignored (help)
  12. ^ Slonim, Noam; Tishby, Naftali (2000-01-01). "Document clustering using word clusters via the information bottleneck method". Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR '00. New York, NY, USA: ACM. pp. 208–215. CiteSeerX 10.1.1.21.3062. doi:10.1145/345508.345578. ISBN 978-1-58113-226-7. S2CID 1373541.
  13. ^ D. J. Miller, A. V. Rao, K. Rose, A. Gersho: "An Information-theoretic Learning Algorithm for Neural Network Classification". NIPS 1995: pp. 591–597
  14. ^ Tishby, Naftali; Slonim, N. Data clustering by Markovian Relaxation and the Information Bottleneck Method (PDF). Neural Information Processing Systems (NIPS) 2000. pp. 640–646.
  15. ^ Chechik, Gal; Tishby, Naftali (2002). "Extracting Relevant Structures with Side Information" (PDF). Advances in Neural Information Processing Systems: 857–864.