Consensus clustering: Difference between revisions

Content deleted Content added

Inline

Revision as of 17:27, 15 April 2009

Clustering is the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. Often similarity is assessed according to a distance measure. Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.

Consensus clustering has emerged as an important elaboration of the classical clustering problem. Consensus clustering, also called aggregation of clustering (or partitions), refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings. Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NP-complete.

Issues with existing clustering techniques

Current clustering techniques do not address all the requirements adequately.
Dealing with large number of dimensions and large number of data items can be problematic because of time complexity;
Effectiveness of the method depends on the definition of “distance” (for distance based clustering)
If an obvious distance measure doesn’t exist we must “define” it, which is not always easy, especially in multidimensional spaces.
The result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways.

Why Consensus Clustering?

There are potential shortcomings for each of the known clustering techniques.
Interpretation of results are difficult in a few cases.
When there is no knowledge about the number of clusters, it becomes difficult.
They are extremely sensitive to the initial settings.
Some algorithms can never undo what was done previously.
Iterative descent clustering methods, such as the SOM and K-Means clustering circumvent some of the shortcomings of Hierarchical clustering by providing for univocally defined clusters and cluster boundaries. However, they lack the intuitive and visual appeal of Hierarchical clustering, and the number of clusters must be chosen a priori.
An extremely important issue in cluster analysis is the validation of the clustering results, that is, how to gain confidence about the significance of the clusters provided by the clustering technique, (cluster numbers and cluster assignments). Lacking an external objective criterion (the equivalent of a known class label in supervised learning) this validation becomes somewhat elusive.

Advantages of Consensus Clustering

Provides for a method to represent the consensus across multiple runs of a clustering algorithm, to determine the number of clusters in the data, and to assess the stability of the discovered clusters.
The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for * its sensitivity to the initial conditions.
It also provides for a visualization tool to inspect cluster number, membership, and boundaries.
We will be able to extract lot of features / attributes from multiples runs of different clustering algorithms on the data. These features can give us valuable information in doing a final consensus clustering.

Related Work

1. Clustering Ensemble (Strehl and Ghosh): They consider various formulations for the problem, most of which reduce the problem to a hyper-graph partitioning problem. In one of their formulations they consider the same graph as in the correlation clustering problem. The solution they propose is to compute the best k-partition of the graph, which does not take into account the penalty for merging two nodes that are far apart.

2. Clustering Aggregation (Fern and Brodley): They apply the clustering aggregation idea to a collection of soft clusterings they obtain by random projections. They use an agglomerative algorithm and do not penalize for merging dissimilar nodes.

3. Fred and Jain: propose to use a single linkage algorithm to combine multiple runs of the k-means algorithm.

4. Dana Cristofor and Dan Simovici: observe the connection between clustering aggregation and clustering of categorical data. They propose information theoretic distance measures, and they propose genetic algorithms for finding the best aggregation solution.

5. Topchy et al.: They define clustering aggregation as a maximum likelihood estimation problem, and they propose an EM algorithm for finding the consensus clustering.

@@ Line 44: / Line 44: @@
 * {{cite conference|url=http://www.siam.org/proceedings/alenex/2008/alx08_011godera.pdf|format=PDF|publisher=Society for Industrial and Applied Mathematics|title=Consensus Clustering Algorithms: Comparison and Refinement|author=Andrey Goder and Vladimir Filkov|booktitle=2008 Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments (ALENEX) &mdash; San Francisco, January 19, 2008}}
 * {{cite conference|title=Weighted Consensus Clustering|author=Tao Li and Chris Ding|url=http://www.siam.org/proceedings/datamining/2008/dm08_72_li.pdf|format=PDF|booktitle=Proceedings of the 2008 SIAM International Conference on Data Mining &mdash; Atlanta, April 24&ndash;26, 2008|publisher=Society for Industrial and Applied Mathematics}}
+* Stefano Monti, Pablo Tamayo, Jill Mesirov and Todd Golub. [http://www.broad.mit.edu/mpr/publications/projects/Bioinformatics/consensus4pdflatex.pdf "Consensus Clustering - A resampling-based method for class discovery and visualization of gene expression microarray data"]