Information bottleneck method
The information bottleneck method is a technique for finding the best trade-off between accuracy and compression when summarizing (e.g. clustering) a random variable X when given a joint probability distribution between X and an observed variable Y.
The compressed variable is and the algorithm minimises the following quantity
where are the mutual informations between and respectively.
Gaussian Information Bottleneck [1]
A relatively simple application of the information bottleneck is to Gaussian variates and this has some semblance to a least squares reduced rank or canonical approximation. Assume are jointly multivariate zero mean normal vectors and is a compressed version of which must maintain a given value of mutual information with . It can be shown that the optimum is a normal vector consisting of orthogonal linear combinations of the elements of .
The projection matrix contains rows selected from the weighted left eigenvectors of the singular value decomposition of the following matrix (generally asymmetric)
Define the singular value decomposition
with
and the critical values
.
then the number of active eigenvectors in the projection, or order of approximation, is given by
And we finally get
In which the weights are given by
where .
[1] G. Chechik, A Globerson, N. Tishby and Y. Weiss: “ Information Bottleneck for Gaussian Variables”. Journal of Machine Learning Research 6, Jan 2005, pp. 165-188
Data Clustering using the Information Bottleneck
This application of the bottleneck method to non-Gaussian sampled data is described in [2]. The concept, as treated there, is not without complication as there are two independent phases in the exercise: firstly estimation of the unknown parent probability densities from which the data samples are drawn and secondly the use of these densities within the information theoretic framework of the bottleneck.
Density Estimation
Since the bottleneck method is framed in probabilistic rather than statistical terms, we first need to estimate the underlying probability density at the sample points . This is a well known problem with a number of solutions [3]. In the present method, probability densities at the sample points are found by use of a Markov transition matrix method and this has some mathematical synergy with the bottleneck method itself.
Define an arbitrarily increasing distance metric between all sample pairs and define distance matrix . Then compute transition probabilities between sample pairs for some . Treating samples as states, and as a Markov state transition probability matrix, the vector of probabilities of the ‘states’ after steps, conditioned on the initial state , is . We are here interested only in the equilibrium probability vector given, in the usual way, by the dominant left eigenvector of matrix and is independent of the initialising vector . This Markov transition method establishes a probability at the sample points which is claimed to be proportional to the probabilities densities here.
Clusters
In the following, the reference vector contains sample categories and the joint probability is assumed known. A cluster is defined by its probability distribution over the data samples . In [1] Tishby et al present the following iterative set of equations to determine the clusters
The function of each line of the iteration is expanded as follows.
Line 1: This is a matrix valued set of conditional probabilities
The Kullback Leibler distance between the vectors generated by the sample data and those generated by its reduced information proxy is applied to assess the fidelity of the compressed vector with respect to the categorical data Y in accordance with the fundamental bottleneck equation. is the Kullback Leibler distance between distributions
and is a scalar normalization. The weighting by the negative exponent of the distance means that prior cluster probabilities are downweighted in line 1 when the Kullback Liebler distance is large, thus successful clusters grow in probability while unsuccessful ones decay.
Line 2: This is a second matrix valued set of conditional probabilities
The steps in deriving this are as follows. We have, by definition
where the Bayes identities are used. Finally the integral is rewritten as the summation over the sample points as in the first equation above.
Line 3: this line finds the marginal distribution of
This is also derived from standard results.
Further inputs to the algorithm are the marginal sample distribution which has already been determined by the dominant eigenvector of and the matrix valued Kullback Leibler distance function
derived from the sample spacings and transition probabilities.
The matrices can be initialised randomly.
Defining Decision Contours
To categorize a new sample external to the training set , first calculate the probabilities that it belongs to each of the various clusters which is the conditional probability . In order to find this, apply the previous distance metric to find the transition probabilities between and all samples in , . Secondly apply the last two lines of the 3-line algorithm to get cluster, and conditional category probabilities.
Finally we have
Generally the algorithm converges rapidly, often in tens of iterations. However parameter must be kept under close supervision since, as it is increased from zero, increasing numbers of features, in the category probability space, click into focus at certain critical values.
There is some analogy between this algorithm and a neural network with a single hidden layer. The nodes are represented by the clusters . The first and second layers of network weights are the conditional probabilities and respectively. However, unlike a standard neural network, the present algorithm always uses probabilities of samples as inputs rather than the sample values themselves and non linear function are encapsulated in the Kullback Leibler distances and the transition probabilities rather than sigmoid functions. Compared to a neural network this algorithm seems to converge much more quickly and by varying and various levels of focus on features can be achieved. There are also similarities to some varieties of Fuzzy Logic algorithms.
For blind classification and clustering, the transient behaviour of is analysed and this is discussed in more detail in [2] but this extra complication is not necessary for the supervised training described here.
An Example
In the following simple case we investigate clustering in a four quadrant multiplier with random inputs and two categories of output, , generated by . This function has the property that there are two spatially separated clusters for each category and so it demonstrates that the method can handle such distributions.
20 samples are taken, uniformly distributed on the square . The number of clusters used beyond the number of categories, two in this case, has little effect on performance and the results are shown for two clusters using parameters <m ath>\lambda = 3,\, \beta = 2.5</math> adn the distance function where . The figure shows the locations of the twenty samples with '0' representing Y = 1 and 'x' representing Y = -1. The contour at the unity likelihood ratio level is shown, as a new sample is scanned over the square. Theoretically the contour should align with the and coordinates but for such small sample numbers they have instead followed the spurious clusterings of the sample points.
bibliography
[2] N Tishby, N Slonim: “Data clustering by Markovian Relaxation and the Information Bottleneck Method”, Neural Information Processing Systems (NIPS) 2000, pp. 640-646
[3] B.W. Silverman: “Density Estimation for Statistical Data Analysis”, Chapman and Hall, 1986.