Jump to content

Consensus clustering: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
BattyBot (talk | contribs)
m fixed CS1 errors: dates & General fixes using AWB (9832)
Citation bot (talk | contribs)
Altered bibcode. | Use this bot. Report bugs. | Suggested by Dominic3203 | Category:NP-complete problems | #UCB_Category 5/181
 
(163 intermediate revisions by 56 users not shown)
Line 1: Line 1:
{{Short description|Method of result aggregation from multiple clustering algorithms}}
{{Cleanup|date=February 2009}}
'''Consensus clustering''' is a method of aggregating (potentially conflicting) results from multiple [[clustering algorithm]]s. Also called '''cluster ensembles'''<ref name=StrehlEnsembles>{{cite journal|last1=Strehl|first1=Alexander|authorlink1=Alexander Strehl|author2=Ghosh, Joydeep|title=Cluster ensembles – a knowledge reuse framework for combining multiple partitions|journal=Journal on Machine Learning Research (JMLR)|date=2002|volume=3|pages=583–617|url=http://www.jmlr.org/papers/volume3/strehl02a/strehl02a.pdf|doi=10.1162/153244303321897735|s2cid=3068944 |quote=This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. We first identify several application scenarios for the resultant 'knowledge reuse' framework that we call cluster ensembles. The cluster ensemble problem is then formalized as a combinatorial optimization problem in terms of shared [[mutual information]]}}</ref> or aggregation of clustering (or partitions), it refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings.<ref name=RuizSurvey2011>{{cite journal|last=VEGA-PONS|first=SANDRO|author2=RUIZ-SHULCLOPER, JOSÉ|s2cid=4643842|journal=International Journal of Pattern Recognition and Artificial Intelligence|date=1 May 2011|volume=25|issue=3|pages=337–372|doi=10.1142/S0218001411008683|title=A Survey of Clustering Ensemble Algorithms}}</ref> Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be [[NP-complete]],<ref name=Filkov2003>{{cite book|last=Filkov|first=Vladimir|title=Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence|chapter=Integrating microarray data by consensus clustering|year=2003|pages=418–426|doi=10.1109/TAI.2003.1250220|isbn=978-0-7695-2038-4|citeseerx=10.1.1.116.8271|s2cid=1515525}}</ref> even when the number of input clusterings is three.<ref name=Bonizzoni2008>{{cite journal|last=Bonizzoni|first=Paola|author2=Della Vedova, Gianluca| author3= Dondi, Riccardo| author4= Jiang, Tao| title=On the Approximation of Correlation Clustering and Consensus Clustering|journal=Journal of Computer and System Sciences|volume=74|number=5|year=2008|pages=671–696|doi=10.1016/j.jcss.2007.06.024|doi-access=free}}</ref> Consensus clustering for unsupervised learning is analogous to [[ensemble learning]] in supervised learning.
'''Clustering''' is the assignment of objects into groups (called ''clusters'') so that objects from the same cluster are more similar to each other than objects from different clusters.<ref name=Gobbons2002>{{cite journal|last=Gibbons|first=F. D.|coauthors=Johnson, GF; Yalow, RS|title=Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation|journal=Genome Research|date=30 September 2002|volume=12|issue=10|pages=1574–1581|doi=10.1101/gr.397002}}</ref> Often similarity is assessed according to a [[metric (mathematics)|distance measure]]. Clustering is a common technique for [[statistics|statistical]] [[data analysis]], which is used in many fields, including [[machine learning]], [[data mining]], [[pattern recognition]], [[image analysis]] and [[bioinformatics]].

'''Consensus clustering''' has emerged as an important elaboration of the classical clustering problem. Consensus clustering, also called aggregation of clustering (or partitions), refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings.<ref name=RuizSurvey2011>{{cite journal|last=VEGA-PONS|first=SANDRO|coauthors=RUIZ-SHULCLOPER, JOSÉ|title=A SURVEY OF CLUSTERING ENSEMBLE ALGORITHMS|journal=International Journal of Pattern Recognition and Artificial Intelligence|date=1 May 2011|volume=25|issue=03|pages=337–372|doi=10.1142/S0218001411008683}}</ref><ref name=AbuJamousPLOSONE>{{cite journal|last=Abu-Jamous|first=Basel|coauthors=Fa, Rui; Roberts, David J.; Nandi, Asoke K.; Peddada, Shyamal D.|title=Paradigm of Tunable Clustering Using Binarization of Consensus Partition Matrices (Bi-CoPaM) for Gene Discovery|journal=PLoS ONE|date=11 February 2013|volume=8|issue=2|pages=e56432|doi=10.1371/journal.pone.0056432|url=http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0056432|pmid=23409186|pmc=3569426}}</ref> Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be [[NP-complete]].<ref name=Filkov2003>{{cite journal|last=Filkov|first=Vladimir|title=Integrating microarray data by consensus clustering|journal=In Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence.|year=2003|pages=418–426|doi=10.1109/TAI.2003.1250220}}</ref> Consensus clustering for unsupervised learning is analogous to [[ensemble learning]] in supervised learning.<ref name=AbuJamousPLOSONE />


==Issues with existing clustering techniques==
==Issues with existing clustering techniques==
* Current clustering techniques do not address all the requirements adequately.
* Current clustering techniques do not address all the requirements adequately.
* Dealing with large number of dimensions and large number of data items can be problematic because of time complexity;
* Dealing with large number of dimensions and large number of data items can be problematic because of time complexity;
* Effectiveness of the method depends on the definition of "distance" (for distance based clustering)
* Effectiveness of the method depends on the definition of "[[distance]]" (for distance-based clustering)
* If an obvious distance measure doesn’t exist we must "define" it, which is not always easy, especially in multidimensional spaces.
* If an obvious distance measure doesn't exist, we must "define" it, which is not always easy, especially in multidimensional spaces.
* The result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways.
* The result of the clustering algorithm (that, in many cases, can be arbitrary itself) can be interpreted in different ways.


==Why consensus clustering?==
==Justification for using consensus clustering==
There are potential shortcomings for all existing clustering techniques. This may cause interpretation of results to become difficult, especially when there is no knowledge about the number of clusters. Clustering methods are also very sensitive to the initial clustering settings, which can cause non-significant data to be amplified in non-reiterative methods. An extremely important issue in cluster analysis is the validation of the clustering results, that is, how to gain confidence about the significance of the clusters provided by the clustering technique (cluster numbers and cluster assignments). Lacking an external objective criterion (the equivalent of a known class label in supervised analysis), this validation becomes somewhat elusive.
* There are potential shortcomings for each of the known clustering techniques.
Iterative descent clustering methods, such as the [[Self-organizing map|SOM]] and [[k-means clustering]] circumvent some of the shortcomings of [[hierarchical clustering]] by providing for univocally defined clusters and cluster boundaries. Consensus clustering provides a method that represents the consensus across multiple runs of a clustering algorithm, to determine the number of clusters in the data, and to assess the stability of the discovered clusters. The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for its sensitivity to the initial conditions. It can provide data for a visualization tool to inspect cluster number, membership, and boundaries. However, they lack the intuitive and visual appeal of hierarchical clustering dendrograms, and the number of clusters must be chosen a priori.
* Interpretation of results are difficult in a few cases.
* When there is no knowledge about the number of clusters, it becomes difficult.
* They are extremely sensitive to the initial settings.
* Some algorithms can never undo what was done previously.
* Iterative descent clustering methods, such as the [[Self-organizing map|SOM]] and [[K-means algorithm|''K''-means]] clustering circumvent some of the shortcomings of [[Hierarchical clustering]] by providing for univocally defined clusters and cluster boundaries. However, they lack the intuitive and visual appeal of [[Hierarchical clustering]], and the number of clusters must be chosen a priori.
* An extremely important issue in cluster analysis is the validation of the clustering results, that is, how to gain confidence about the significance of the clusters provided by the clustering technique, (cluster numbers and cluster assignments). Lacking an external objective criterion (the equivalent of a known class label in supervised learning) this validation becomes somewhat elusive.


==Advantages of consensus clustering==
==The Monti consensus clustering algorithm==
The Monti consensus clustering algorithm<ref>{{Cite journal|last1=Monti|first1=Stefano|last2=Tamayo|first2=Pablo|last3=Mesirov|first3=Jill|last4=Golub|first4=Todd|date=2003-07-01|title=Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data|journal=Machine Learning|language=en|volume=52|issue=1|pages=91–118|doi=10.1023/A:1023949509487|issn=1573-0565|doi-access=free}}</ref> is one of the most popular consensus clustering algorithms and is used to determine the number of clusters, <math>K</math>. Given a dataset of <math>N</math> total number of points to cluster, this algorithm works by resampling and clustering the data, for each <math>K</math> and a <math>N \times N</math> consensus matrix is calculated, where each element represents the fraction of times two samples clustered together. A perfectly stable matrix would consist entirely of zeros and ones, representing all sample pairs always clustering together or not together over all resampling iterations. The relative stability of the consensus matrices can be used to infer the optimal <math>K</math>.
* Provides for a method to represent the consensus across multiple runs of a clustering algorithm, to determine the number of clusters in the data, and to assess the stability of the discovered clusters.
* The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for * its sensitivity to the initial conditions.
* It also provides for a visualization tool to inspect cluster number, membership, and boundaries.
* We will be able to extract lot of features / attributes from multiples runs of different clustering algorithms on the data. These features can give us valuable information in doing a final consensus clustering.


More specifically, given a set of points to cluster, <math>D=\{e_1,e_2,...e_N\}</math>, let <math>D^1,D^2,...,D^H</math> be the list of <math>H</math> perturbed (resampled) datasets of the original dataset <math>D</math>, and let <math>M^h</math> denote the <math>N \times N</math> connectivity matrix resulting from applying a clustering algorithm to the dataset <math>D^h</math>. The entries of <math>M^h</math> are defined as follows:
==Related work==
1. '''Clustering ensemble (Strehl and Ghosh)''': They considered various formulations for the problem, most of which reduce the problem to a hyper-graph partitioning problem. In one of their formulations they considered the same graph as in the correlation clustering problem. The solution they proposed is to compute the best ''k''-partition of the graph, which does not take into account the penalty for merging two nodes that are far apart.


<math>M^h(i,j)= \begin{cases} 1, & \text{if}\text{ points i and j belong to the same cluster} \\ 0, & \text{otherwise} \end{cases}</math>
2. '''Clustering aggregation (Fern and Brodley)''': They applied the clustering aggregation idea to a collection of soft clusterings they obtained by random projections. They used an agglomerative algorithm and did not penalize for merging dissimilar nodes.


Let <math>I^h</math> be the <math>N \times N</math> identicator matrix where the <math>(i,j)</math>-th entry is equal to 1 if points <math>i</math> and <math>j</math> are in the same perturbed dataset <math>D^h</math>, and 0 otherwise. The indicator matrix is used to keep track of which samples were selected during each resampling iteration for the normalisation step. The consensus matrix <math>C</math> is defined as the normalised sum of all connectivity matrices of all the perturbed datasets and a different one is calculated for every <math>K</math>.
3. '''Fred and Jain''': They proposed to use a single linkage algorithm to combine multiple runs of the ''k''-means algorithm.


<math>C(i,j)=\left ( \frac{\textstyle \sum_{h=1}^H M^h(i,j) \displaystyle}{\sum_{h=1}^H I^h(i,j)} \right )</math>
4. '''Dana Cristofor and Dan Simovici''': They observed the connection between clustering aggregation and clustering of categorical data. They proposed information theoretic distance measures, and they propose genetic algorithms for finding the best aggregation solution.


That is the entry <math>(i,j)</math> in the consensus matrix is the number of times points <math>i</math> and <math>j</math> were clustered together divided by the total number of times they were selected together. The matrix is symmetric and each element is defined within the range <math>[0,1]</math>. A consensus matrix is calculated for each <math>K</math> to be tested, and the stability of each matrix, that is how far the matrix is towards a matrix of perfect stability (just zeros and ones) is used to determine the optimal <math>K</math>. One way of quantifying the stability of the <math>K</math>th consensus matrix is examining its CDF curve (see below).
5. '''Topchy et al.''': They defined clustering aggregation as a maximum likelihood estimation problem, and they proposed an EM algorithm for finding the consensus clustering.


==Over-interpretation potential of the Monti consensus clustering algorithm==
6. '''Abu-Jamous et al.''': They proposed their [[binarization of consensus partition matrices]] (Bi-CoPaM) method to enhance ensemble clustering in two major aspects. The first is to consider clustering the same set of objects by various clustering methods as well as by considering their features measured in multiple datasets; this seems perfectly relevant in the context of microarray gene expression clustering, which is the context they initially proposed the method in. The second aspect is the format of the final result; based on the consistency of inclusion of a data object in the same cluster by the multiple single clustering results, they allowed any single data object to have any of the three eventualities; to be exclusively assigned to one and only one cluster, to be unassigned from all clusters, or to be simultaneously assigned to multiple clusters at the same time. They made it possible to produce, in a perfectly tunable way, wide overlapping clusters, tight specific clusters, as well as complementary clusters. Therefore, they proposed their work as a new paradigm of clustering rather than merely a new ensemble clustering method.<ref name=AbuJamousPLOSONE /><ref name=AbuJamousInterface>{{cite journal|last=Abu-Jamous|first=B.|coauthors=Fa, R.; Roberts, D. J.; Nandi, A. K.|title=Yeast gene CMR1/YDL156W is consistently co-expressed with genes participating in DNA-metabolic processes in a variety of stringent clustering experiments|journal=Journal of The Royal Society Interface|date=24 January 2013|volume=10|issue=81|pages=20120990–20120990|doi=10.1098/rsif.2012.0990|url=http://rsif.royalsocietypublishing.org/content/10/81/20120990.full}}</ref>
[[File:PACexplained.png|400px|thumb|PAC measure (proportion of ambiguous clustering) explained. Optimal K is the K with lowest PAC value.]]
Monti consensus clustering can be a powerful tool for identifying clusters, but it needs to be applied with caution as shown by Şenbabaoğlu ''et al.'' <ref name="SenbabaogluSREP" /> It has been shown that the Monti consensus clustering algorithm is able to claim apparent stability of chance partitioning of null datasets drawn from a unimodal distribution, and thus has the potential to lead to over-interpretation of cluster stability in a real study.<ref name=SenbabaogluSREP>{{cite journal|last=Şenbabaoğlu|first=Y.|author2=Michailidis, G. |author3=Li, J. Z. |title=Critical limitations of consensus clustering in class discovery|journal=Scientific Reports|date=2014|doi=10.1038/srep06207|volume=4|pages=6207|pmid=25158761|pmc=4145288|bibcode=2014NatSR...4.6207.}}</ref><ref name=SenbabaogluRXV>{{cite bioRxiv|last=Şenbabaoğlu|first=Y.|author2=Michailidis, G. |author3=Li, J. Z. |title=A reassessment of consensus clustering for class discovery|date=Feb 2014|biorxiv=10.1101/002642}}</ref> If clusters are not well separated, consensus clustering could lead one to conclude apparent structure when there is none, or declare cluster stability when it is subtle. Identifying false positive clusters is a common problem throughout cluster research,<ref name=":0">{{Cite journal|last1=Liu|first1=Yufeng|last2=Hayes|first2=David Neil|last3=Nobel|first3=Andrew|last4=Marron|first4=J. S.|date=2008-09-01|title=Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data|journal=Journal of the American Statistical Association|volume=103|issue=483|pages=1281–1293|doi=10.1198/016214508000000454|s2cid=120819441|issn=0162-1459}}</ref> and has been addressed by methods such as SigClust<ref name=":0" /> and the GAP-statistic.<ref>{{Cite journal|last1=Tibshirani|first1=Robert|last2=Walther|first2=Guenther|last3=Hastie|first3=Trevor|date=2001|title=Estimating the number of clusters in a data set via the gap statistic|journal=Journal of the Royal Statistical Society, Series B (Statistical Methodology)|language=en|volume=63|issue=2|pages=411–423|doi=10.1111/1467-9868.00293|s2cid=59738652 |issn=1467-9868|doi-access=free}}</ref> However, these methods rely on certain assumptions for the null model that may not always be appropriate.

Şenbabaoğlu ''et al'' <ref name="SenbabaogluSREP" /> demonstrated the original delta K metric to decide <math>K</math> in the Monti algorithm performed poorly, and proposed a new superior metric for measuring the stability of consensus matrices using their CDF curves. In the CDF curve of a consensus matrix, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle segment represent those with ambiguous assignments in different clustering runs. The proportion of ambiguous clustering (PAC) score measure quantifies this middle segment; and is defined as the fraction of sample pairs with consensus indices falling in the interval (u<sub>1</sub>, u<sub>2</sub>) ∈ [0, 1] where u<sub>1</sub> is a value close to 0 and u<sub>2</sub> is a value close to 1 (for instance u<sub>1</sub>=0.1 and u<sub>2</sub>=0.9). A low value of PAC indicates a flat middle segment, and a low rate of discordant assignments across permuted clustering runs. One can therefore infer the optimal number of clusters by the <math>K</math> value having the lowest PAC.<ref name="SenbabaogluSREP" /><ref name="SenbabaogluRXV" />

==Related work==
#'''Clustering ensemble (Strehl and Ghosh)''': They considered various formulations for the problem, most of which reduce the problem to a [[hyper-graph]] partitioning problem. In one of their formulations they considered the same graph as in the correlation clustering problem. The solution they proposed is to compute the best ''k''-partition of the graph, which does not take into account the penalty for merging two nodes that are far apart.<ref name=StrehlEnsembles/>
#'''Clustering aggregation (Fern and Brodley)''': They applied the clustering aggregation idea to a collection of [[soft clustering]]s they obtained by random projections. They used an agglomerative algorithm and did not penalize for merging dissimilar nodes.<ref>{{cite journal|author1=Fern, Xiaoli |author2= Brodley, Carla|year=2004|title=Cluster ensembles for high dimensional clustering: an empirical study|journal=J Mach Learn Res|volume=22|url=https://www.researchgate.net/publication/228476517}} </ref>
#'''Fred and Jain''': They proposed to use a single linkage algorithm to combine multiple runs of the ''k''-means algorithm.<ref name="Fred Jain 2005 pp. 835–850">{{cite journal | last1=Fred | first1=Ana L.N. | last2=Jain | first2=Anil K. | title=Combining multiple clusterings using evidence accumulation | journal=IEEE Transactions on Pattern Analysis and Machine Intelligence | publisher=Institute of Electrical and Electronics Engineers (IEEE) | volume=27 | issue=6 | year=2005 | issn=0162-8828 | doi=10.1109/tpami.2005.113 | pages=835–850|pmid= 15943417| s2cid=10316033 |url=http://dataclustering.cse.msu.edu/papers/TPAMI-0239-0504.R1.pdf}}</ref>
#'''Dana Cristofor and Dan Simovici''': They observed the connection between clustering aggregation and clustering of [[categorical variable|categorical data]]. They proposed information theoretic distance measures, and they propose [[genetic algorithm]]s for finding the best aggregation solution.<ref>{{cite journal|author=Dana Cristofor, Dan Simovici|title=Finding Median Partitions Using Information-Theoretical-Based Genetic Algorithms|journal=Journal of Universal Computer Science|volume=8|issue=2|pages=153–172|url=https://www.jucs.org/jucs_8_2/finding_median_partitions_using/Cristofor_D.pdf|date=February 2002|doi=10.3217/jucs-008-02-0153}}</ref>
#'''Topchy et al.''': They defined clustering aggregation as a maximum likelihood estimation problem, and they proposed an [[EM algorithm]] for finding the consensus clustering.<ref>Alexander Topchy, Anil K. Jain, William Punch. [http://dataclustering.cse.msu.edu/papers/TPAMI-ClusteringEnsembles.pdf Clustering Ensembles: Models of Consensus and Weak Partitions]. IEEE International Conference on Data Mining, ICDM 03 & SIAM International Conference on Data Mining, SDM 04</ref>


== Hard ensemble clustering ==
== Hard ensemble clustering ==
Line 45: Line 45:
=== Efficient consensus functions ===
=== Efficient consensus functions ===


#'''Cluster-based similarity partitioning algorithm (CSPA)''':In CSPA the similarity between two data-points is defined to be directly proportional to number of constituent clusterings of the ensemble in which they are clustered together. The intuition is that the more similar two data-points are the higher is the chance that constituent clusterings will place them in the same cluster. CSPA is the simplest heuristic, but its computational and storage complexity are both quadratic in ''n''. [http://bioconductor.org/packages/release/bioc/html/SC3.html SC3] is an example of a CSPA type algorithm.<ref>{{Cite journal|last1=Kiselev|first1=Vladimir Yu|last2=Kirschner|first2=Kristina|last3=Schaub|first3=Michael T|last4=Andrews|first4=Tallulah|last5=Yiu|first5=Andrew|last6=Chandra|first6=Tamir|last7=Natarajan|first7=Kedar N|last8=Reik|first8=Wolf|last9=Barahona|first9=Mauricio|last10=Green|first10=Anthony R|last11=Hemberg|first11=Martin|date=May 2017|title=SC3: consensus clustering of single-cell RNA-seq data|journal=Nature Methods|language=en|volume=14|issue=5|pages=483–486|doi=10.1038/nmeth.4236|issn=1548-7091|pmc=5410170|pmid=28346451}}</ref> The following two methods are computationally less expensive:
'''1. Cluster-based similarity partitioning algorithm (CSPA)'''
#'''Hyper-graph partitioning algorithm (HGPA)''': The HGPA algorithm takes a very different approach to finding the consensus clustering than the previous method. The cluster ensemble problem is formulated as partitioning the hypergraph by cutting a minimal number of hyperedges. They make use of [http://glaros.dtc.umn.edu/gkhome/metis/hmetis/overview hMETIS] which is a hypergraph partitioning package system.

#'''Meta-clustering algorithm (MCLA)''':The meta-cLustering algorithm (MCLA) is based on clustering clusters. First, it tries to solve the cluster correspondence problem and then uses voting to place data-points into the final consensus clusters. The cluster correspondence problem is solved by grouping the clusters identified in the individual clusterings of the ensemble. The clustering is performed using [http://glaros.dtc.umn.edu/gkhome/views/metis METIS] and Spectral clustering.
In CSPA the similarity between two data-points is defined to be directly proportional to number of constituent clusterings of the ensemble in which they are clustered together. The intuition is that the more similar two data-points are the higher is the chance that constituent clusterings will place them in the same cluster. CSPA is the simplest heuristic, but its computational and storage complexity are both quadratic in ''n''. The following two methods are computationally less expensive:

'''2. Hyper-graph partitioning algorithm (HGPA)'''

The HGPA algorithm takes a very different approach to finding the consensus clustering than the previous method.
The cluster ensemble problem is formulated as partitioning the hypergraph by cutting a minimal number of hyperedges. They make use of [http://glaros.dtc.umn.edu/gkhome/metis/hmetis/overview hMETIS] which is a hypergraph partitioning package system.

'''3. Meta-clustering algorithm (MCLA)'''

The meta-cLustering algorithm (MCLA) is based on clustering clusters.
First, it tries to solve the cluster correspondence problem and then uses voting to place data-points into the final consensus clusters. The cluster correspondence problem is solved by grouping the clusters identified in the individual clusterings of the ensemble.
The clustering is performed using [http://glaros.dtc.umn.edu/gkhome/views/metis METIS] and Spectral clustering.


== Soft clustering ensembles ==
== Soft clustering ensembles ==


''Punera'' and ''Ghosh'' extended the idea of hard clustering ensembles to the soft clustering scenario. Each instance in a soft ensemble is represented by a concatenation of ''r'' posterior membership probability distributions obtained from the constituent clustering algorithms. We can define a distance measure between two instances using the [[Kullback–Leibler divergence|Kullback–Leibler (KL) divergence]], which calculates the “distance” between two probability distributions.
''Punera'' and ''Ghosh'' extended the idea of hard clustering ensembles to the soft clustering scenario. Each instance in a soft ensemble is represented by a concatenation of ''r'' posterior membership probability distributions obtained from the constituent clustering algorithms. We can define a distance measure between two instances using the [[Kullback–Leibler divergence|Kullback–Leibler (KL) divergence]], which calculates the "distance" between two probability distributions.<ref>Kunal Punera, Joydeep Ghosh. [https://web.archive.org/web/20081201150950/http://www.ideal.ece.utexas.edu/papers/2007/punera07softconsensus.pdf Consensus Based Ensembles of Soft Clusterings]</ref>


#'''{{Proper name|sCSPA}}''': extends CSPA by calculating a similarity matrix. Each object is visualized as a point in dimensional space, with each dimension corresponding to probability of its belonging to a cluster. This technique first transforms the objects into a label-space and then interprets the [[dot product]] between the vectors representing the objects as their similarity.
'''1. sCSPA'''
#'''{{Proper name|sMCLA}}''':extends MCLA by accepting soft clusterings as input. sMCLA's working can be divided into the following steps:
#* Construct Soft Meta-Graph of Clusters
#* Group the Clusters into Meta-Clusters
#* Collapse Meta-Clusters using Weighting
#* Compete for Objects
#'''{{Proper name|sHBGF}}''':represents the ensemble as a [[bipartite graph]] with clusters and instances as nodes, and edges between the instances and the clusters they belong to.<ref>Solving cluster ensemble problems by bipartite graph partitioning, Xiaoli Zhang Fern and [[Carla Brodley]], Proceedings of the twenty-first international conference on Machine learning</ref> This approach can be trivially adapted to consider soft ensembles since the graph partitioning algorithm METIS accepts weights on the edges of the graph to be partitioned. In sHBGF, the graph has ''n''&nbsp;+&nbsp;''t'' vertices, where t is the total number of underlying clusters.
#'''Bayesian consensus clustering (BCC)''': defines a fully [[Bayesian probability|Bayesian]] model for soft consensus clustering in which multiple source clusterings, defined by different input data or different probability models, are assumed to adhere loosely to a consensus clustering.<ref name=LockBCC>{{cite journal|last=Lock|first=E.F.|author2=Dunson, D.B. |title=Bayesian consensus clustering|journal=Bioinformatics|date=2013|doi=10.1093/bioinformatics/btt425|pmid=23990412|pmc=3789539|volume=29|number=20|pages=2610–2616|arxiv=1302.7280|bibcode=2013Bioin..29.2610L}}</ref> The full posterior for the separate clusterings, and the consensus clustering, are inferred simultaneously via [[Gibbs sampling]].
#'''Ensemble Clustering Fuzzification Means (ECF-Means)''': ECF-means is a clustering algorithm, which combines different clustering results in ensemble, achieved by different runs of a chosen algorithm ([[k-means]]), into a single final clustering configuration.<ref name=ZazzECF>{{cite journal|last=Zazzaro|first=Gaetano|author2=Martone, Angelo |title=ECF-means - Ensemble Clustering Fuzzification Means. A novel algorithm for clustering aggregation, fuzzification, and optimization |journal=IMM 2018: The Eighth International Conference on Advances in Information Mining and Management|date=2018}} [https://www.thinkmind.org/articles/immm_2018_2_10_50010.pdf]</ref>


== References ==
sCSPA extends CSPA by calculating a similarity matrix. Each object is visualized as a point in dimensional space, with each dimension corresponding to probability of its belonging to a cluster. This technique first transforms the objects into a label-space and then interprets the dot product between the vectors representing the objects as their similarity.

'''2. sMCLA'''

sMCLA extends MCLA by accepting soft clusterings as input. sMCLA’s working can be divided into the following steps:

* Construct Soft Meta-Graph of Clusters
* Group the Clusters into Meta-Clusters
* Collapse Meta-Clusters using Weighting
* Compete for Objects

'''3. sHBGF'''

HBGF represents the ensemble as a bipartite graph with clusters and instances as nodes, and edges between the instances and the clusters they belong to.<ref>Solving cluster ensemble problems by bipartite graph partitioning, Xiaoli Zhang Fern and Carla E. Brodley,Proceedings of the twenty-first international conference on Machine learning</ref> This approach can be trivially adapted to consider soft ensembles since the graph partitioning algorithm METIS accepts weights on the edges of the graph to be partitioned. In sHBGF, the graph has ''n''&nbsp;+&nbsp;''t'' vertices, where t is the total number of underlying clusters.

==Tunable-tightness partitions==

In this different form of clustering, each data object is allowed to be exclusively assigned to one and only one cluster, to be unassigned from all clusters, or to be simultaneously assigned to multiple clusters, in a completely tunable way.<ref name=AbuJamousPLOSONE /> In some applications like gene clustering, this matches the biological reality that many of the genes considered for clustering in a particular gene discovery study might be irrelevant to the case of study in hand and should be ideally not assigned to any of the output clusters, moreover, any single gene can be participating in multiple processes and would be useful to be included in multiple clusters simultaneously. This has been proposed in the recent method of the [[binarization of consensus partition matrices]] (Bi-CoPaM)<ref name=AbuJamousPLOSONE /> and is being used currently in the field of [[bioinformatics]].<ref name=AbuJamousInterface />

{{more footnotes|date=December 2010}}


<references />
<references />


* Aristides Gionis, [[Heikki Mannila]], Panayiotis Tsaparas. [https://web.archive.org/web/20060828084525/http://www.cs.helsinki.fi/u/tsaparas/publications/aggregated-journal.pdf Clustering Aggregation]. 21st International Conference on Data Engineering (ICDE 2005)
== References ==
* Hongjun Wang, Hanhuai Shan, Arindam Banerjee. [http://www.siam.org/proceedings/datamining/2009/SDM09_022_wangh.pdf Bayesian Cluster Ensembles]{{Dead link|date=November 2019 |bot=InternetArchiveBot |fix-attempted=yes }}, SIAM International Conference on Data Mining, SDM 09
* [[Alexander Strehl]] and J. Ghosh, [http://strehl.com/download/strehl-jmlr02.pdf Cluster ensembles – a knowledge reuse framework for combining multiple partitions], Journal on Machine Learning Research (JMLR) 2002
*{{cite conference | last1=Nguyen | first1=Nam | last2=Caruana | first2=Rich | title=Seventh IEEE International Conference on Data Mining (ICDM 2007) | chapter=Consensus Clusterings | publisher=IEEE | year=2007 | pages=607–612 | doi=10.1109/icdm.2007.73 | isbn=978-0-7695-3018-5 |quote=...we address the problem of combining multiple clusterings without access to the underlying features of the data. This process is known in the literature as clustering ensembles, clustering aggregation, or consensus clustering. Consensus clustering yields a stable and robust final clustering that is in agreement with multiple clusterings. We find that an iterative EM-like method is remarkably effective for this problem. We present an iterative algorithm and its variations for finding clustering consensus. An extensive empirical study compares our proposed algorithms with eleven other consensus clustering methods on four data sets using three different clustering performance metrics. The experimental results show that the new ensemble clustering methods produce clusterings that are as good as, and often better than, these other methods.}}
* Kunal Punera, Joydeep Ghosh. [http://www.ideal.ece.utexas.edu/papers/2007/punera07softconsensus.pdf Consensus Based Ensembles of Soft Clusterings].
* Aristides Gionis, [[Heikki Mannila]], Panayiotis Tsaparas. [http://www.cs.helsinki.fi/u/tsaparas/publications/aggregated-journal.pdf Clustering Aggregation]. 21st International Conference on Data Engineering (ICDE 2005)
* Hongjun Wang, Hanhuai Shan, Arindam Banerjee. [http://www.siam.org/proceedings/datamining/2009/SDM09_022_wangh.pdf Bayesian Cluster Ensembles], SIAM International Conference on Data Mining, SDM 09
* Nam Nguyen, Rich Caruana. Consensus Clusterings. Seventh IEEE International Conference on Data Mining.
* Alexander Topchy, Anil K. Jain, William Punch. [http://dataclustering.cse.msu.edu/papers/TPAMI-ClusteringEnsembles.pdf Clustering Ensembles: Models of Consensus and Weak Partitions]. IEEE International Conference on Data Mining, ICDM 03 & SIAM International Conference on Data Mining, SDM 04
* Basel Abu-Jamous, Rui Fa, David J. Roberts and Asoke K. Nandi. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3569426/pdf/pone.0056432.pdf Paradigm of Tunable Clustering using Binarization of Consensus Partition Matrices (Bi-CoPaM) for Gene Discovery], PLOS ONE 8(2) (doi:10.1371/journal.pone.0056432) 2013

== Further reading ==
* {{cite conference|url=http://www.siam.org/proceedings/alenex/2008/alx08_011godera.pdf|format=PDF|publisher=Society for Industrial and Applied Mathematics|title=Consensus Clustering Algorithms: Comparison and Refinement|author=Andrey Goder and Vladimir Filkov|booktitle=2008 Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments (ALENEX) &mdash; San Francisco, January 19, 2008}}
* {{cite conference|title=Weighted Consensus Clustering|author=Tao Li and Chris Ding|url=http://www.siam.org/proceedings/datamining/2008/dm08_72_li.pdf|format=PDF|booktitle=Proceedings of the 2008 SIAM International Conference on Data Mining &mdash; Atlanta, April 24&ndash;26, 2008|publisher=Society for Industrial and Applied Mathematics}}
* Stefano Monti, Pablo Tamayo, Jill Mesirov and Todd Golub. [http://www.broad.mit.edu/mpr/publications/projects/Bioinformatics/consensus4pdflatex.pdf "Consensus Clustering – A resampling-based method for class discovery and visualization of gene expression microarray data"]


[[Category:Cluster analysis]]
[[Category:Cluster analysis]]
[[Category:NP-complete problems]]

Latest revision as of 15:36, 28 November 2024

Consensus clustering is a method of aggregating (potentially conflicting) results from multiple clustering algorithms. Also called cluster ensembles[1] or aggregation of clustering (or partitions), it refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings.[2] Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NP-complete,[3] even when the number of input clusterings is three.[4] Consensus clustering for unsupervised learning is analogous to ensemble learning in supervised learning.

Issues with existing clustering techniques

[edit]
  • Current clustering techniques do not address all the requirements adequately.
  • Dealing with large number of dimensions and large number of data items can be problematic because of time complexity;
  • Effectiveness of the method depends on the definition of "distance" (for distance-based clustering)
  • If an obvious distance measure doesn't exist, we must "define" it, which is not always easy, especially in multidimensional spaces.
  • The result of the clustering algorithm (that, in many cases, can be arbitrary itself) can be interpreted in different ways.

Justification for using consensus clustering

[edit]

There are potential shortcomings for all existing clustering techniques. This may cause interpretation of results to become difficult, especially when there is no knowledge about the number of clusters. Clustering methods are also very sensitive to the initial clustering settings, which can cause non-significant data to be amplified in non-reiterative methods. An extremely important issue in cluster analysis is the validation of the clustering results, that is, how to gain confidence about the significance of the clusters provided by the clustering technique (cluster numbers and cluster assignments). Lacking an external objective criterion (the equivalent of a known class label in supervised analysis), this validation becomes somewhat elusive. Iterative descent clustering methods, such as the SOM and k-means clustering circumvent some of the shortcomings of hierarchical clustering by providing for univocally defined clusters and cluster boundaries. Consensus clustering provides a method that represents the consensus across multiple runs of a clustering algorithm, to determine the number of clusters in the data, and to assess the stability of the discovered clusters. The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for its sensitivity to the initial conditions. It can provide data for a visualization tool to inspect cluster number, membership, and boundaries. However, they lack the intuitive and visual appeal of hierarchical clustering dendrograms, and the number of clusters must be chosen a priori.

The Monti consensus clustering algorithm

[edit]

The Monti consensus clustering algorithm[5] is one of the most popular consensus clustering algorithms and is used to determine the number of clusters, . Given a dataset of total number of points to cluster, this algorithm works by resampling and clustering the data, for each and a consensus matrix is calculated, where each element represents the fraction of times two samples clustered together. A perfectly stable matrix would consist entirely of zeros and ones, representing all sample pairs always clustering together or not together over all resampling iterations. The relative stability of the consensus matrices can be used to infer the optimal .

More specifically, given a set of points to cluster, , let be the list of perturbed (resampled) datasets of the original dataset , and let denote the connectivity matrix resulting from applying a clustering algorithm to the dataset . The entries of are defined as follows:

Let be the identicator matrix where the -th entry is equal to 1 if points and are in the same perturbed dataset , and 0 otherwise. The indicator matrix is used to keep track of which samples were selected during each resampling iteration for the normalisation step. The consensus matrix is defined as the normalised sum of all connectivity matrices of all the perturbed datasets and a different one is calculated for every .

That is the entry in the consensus matrix is the number of times points and were clustered together divided by the total number of times they were selected together. The matrix is symmetric and each element is defined within the range . A consensus matrix is calculated for each to be tested, and the stability of each matrix, that is how far the matrix is towards a matrix of perfect stability (just zeros and ones) is used to determine the optimal . One way of quantifying the stability of the th consensus matrix is examining its CDF curve (see below).

Over-interpretation potential of the Monti consensus clustering algorithm

[edit]
PAC measure (proportion of ambiguous clustering) explained. Optimal K is the K with lowest PAC value.

Monti consensus clustering can be a powerful tool for identifying clusters, but it needs to be applied with caution as shown by Şenbabaoğlu et al. [6] It has been shown that the Monti consensus clustering algorithm is able to claim apparent stability of chance partitioning of null datasets drawn from a unimodal distribution, and thus has the potential to lead to over-interpretation of cluster stability in a real study.[6][7] If clusters are not well separated, consensus clustering could lead one to conclude apparent structure when there is none, or declare cluster stability when it is subtle. Identifying false positive clusters is a common problem throughout cluster research,[8] and has been addressed by methods such as SigClust[8] and the GAP-statistic.[9] However, these methods rely on certain assumptions for the null model that may not always be appropriate.

Şenbabaoğlu et al [6] demonstrated the original delta K metric to decide in the Monti algorithm performed poorly, and proposed a new superior metric for measuring the stability of consensus matrices using their CDF curves. In the CDF curve of a consensus matrix, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle segment represent those with ambiguous assignments in different clustering runs. The proportion of ambiguous clustering (PAC) score measure quantifies this middle segment; and is defined as the fraction of sample pairs with consensus indices falling in the interval (u1, u2) ∈ [0, 1] where u1 is a value close to 0 and u2 is a value close to 1 (for instance u1=0.1 and u2=0.9). A low value of PAC indicates a flat middle segment, and a low rate of discordant assignments across permuted clustering runs. One can therefore infer the optimal number of clusters by the value having the lowest PAC.[6][7]

[edit]
  1. Clustering ensemble (Strehl and Ghosh): They considered various formulations for the problem, most of which reduce the problem to a hyper-graph partitioning problem. In one of their formulations they considered the same graph as in the correlation clustering problem. The solution they proposed is to compute the best k-partition of the graph, which does not take into account the penalty for merging two nodes that are far apart.[1]
  2. Clustering aggregation (Fern and Brodley): They applied the clustering aggregation idea to a collection of soft clusterings they obtained by random projections. They used an agglomerative algorithm and did not penalize for merging dissimilar nodes.[10]
  3. Fred and Jain: They proposed to use a single linkage algorithm to combine multiple runs of the k-means algorithm.[11]
  4. Dana Cristofor and Dan Simovici: They observed the connection between clustering aggregation and clustering of categorical data. They proposed information theoretic distance measures, and they propose genetic algorithms for finding the best aggregation solution.[12]
  5. Topchy et al.: They defined clustering aggregation as a maximum likelihood estimation problem, and they proposed an EM algorithm for finding the consensus clustering.[13]

Hard ensemble clustering

[edit]

This approach by Strehl and Ghosh introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. They discuss three approaches towards solving this problem to obtain high quality consensus functions. Their techniques have low computational costs and this makes it feasible to evaluate each of the techniques discussed below and arrive at the best solution by comparing the results against the objective function.

Efficient consensus functions

[edit]
  1. Cluster-based similarity partitioning algorithm (CSPA):In CSPA the similarity between two data-points is defined to be directly proportional to number of constituent clusterings of the ensemble in which they are clustered together. The intuition is that the more similar two data-points are the higher is the chance that constituent clusterings will place them in the same cluster. CSPA is the simplest heuristic, but its computational and storage complexity are both quadratic in n. SC3 is an example of a CSPA type algorithm.[14] The following two methods are computationally less expensive:
  2. Hyper-graph partitioning algorithm (HGPA): The HGPA algorithm takes a very different approach to finding the consensus clustering than the previous method. The cluster ensemble problem is formulated as partitioning the hypergraph by cutting a minimal number of hyperedges. They make use of hMETIS which is a hypergraph partitioning package system.
  3. Meta-clustering algorithm (MCLA):The meta-cLustering algorithm (MCLA) is based on clustering clusters. First, it tries to solve the cluster correspondence problem and then uses voting to place data-points into the final consensus clusters. The cluster correspondence problem is solved by grouping the clusters identified in the individual clusterings of the ensemble. The clustering is performed using METIS and Spectral clustering.

Soft clustering ensembles

[edit]

Punera and Ghosh extended the idea of hard clustering ensembles to the soft clustering scenario. Each instance in a soft ensemble is represented by a concatenation of r posterior membership probability distributions obtained from the constituent clustering algorithms. We can define a distance measure between two instances using the Kullback–Leibler (KL) divergence, which calculates the "distance" between two probability distributions.[15]

  1. sCSPA: extends CSPA by calculating a similarity matrix. Each object is visualized as a point in dimensional space, with each dimension corresponding to probability of its belonging to a cluster. This technique first transforms the objects into a label-space and then interprets the dot product between the vectors representing the objects as their similarity.
  2. sMCLA:extends MCLA by accepting soft clusterings as input. sMCLA's working can be divided into the following steps:
    • Construct Soft Meta-Graph of Clusters
    • Group the Clusters into Meta-Clusters
    • Collapse Meta-Clusters using Weighting
    • Compete for Objects
  3. sHBGF:represents the ensemble as a bipartite graph with clusters and instances as nodes, and edges between the instances and the clusters they belong to.[16] This approach can be trivially adapted to consider soft ensembles since the graph partitioning algorithm METIS accepts weights on the edges of the graph to be partitioned. In sHBGF, the graph has n + t vertices, where t is the total number of underlying clusters.
  4. Bayesian consensus clustering (BCC): defines a fully Bayesian model for soft consensus clustering in which multiple source clusterings, defined by different input data or different probability models, are assumed to adhere loosely to a consensus clustering.[17] The full posterior for the separate clusterings, and the consensus clustering, are inferred simultaneously via Gibbs sampling.
  5. Ensemble Clustering Fuzzification Means (ECF-Means): ECF-means is a clustering algorithm, which combines different clustering results in ensemble, achieved by different runs of a chosen algorithm (k-means), into a single final clustering configuration.[18]

References

[edit]
  1. ^ a b Strehl, Alexander; Ghosh, Joydeep (2002). "Cluster ensembles – a knowledge reuse framework for combining multiple partitions" (PDF). Journal on Machine Learning Research (JMLR). 3: 583–617. doi:10.1162/153244303321897735. S2CID 3068944. This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. We first identify several application scenarios for the resultant 'knowledge reuse' framework that we call cluster ensembles. The cluster ensemble problem is then formalized as a combinatorial optimization problem in terms of shared mutual information
  2. ^ VEGA-PONS, SANDRO; RUIZ-SHULCLOPER, JOSÉ (1 May 2011). "A Survey of Clustering Ensemble Algorithms". International Journal of Pattern Recognition and Artificial Intelligence. 25 (3): 337–372. doi:10.1142/S0218001411008683. S2CID 4643842.
  3. ^ Filkov, Vladimir (2003). "Integrating microarray data by consensus clustering". Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence. pp. 418–426. CiteSeerX 10.1.1.116.8271. doi:10.1109/TAI.2003.1250220. ISBN 978-0-7695-2038-4. S2CID 1515525.
  4. ^ Bonizzoni, Paola; Della Vedova, Gianluca; Dondi, Riccardo; Jiang, Tao (2008). "On the Approximation of Correlation Clustering and Consensus Clustering". Journal of Computer and System Sciences. 74 (5): 671–696. doi:10.1016/j.jcss.2007.06.024.
  5. ^ Monti, Stefano; Tamayo, Pablo; Mesirov, Jill; Golub, Todd (2003-07-01). "Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data". Machine Learning. 52 (1): 91–118. doi:10.1023/A:1023949509487. ISSN 1573-0565.
  6. ^ a b c d Şenbabaoğlu, Y.; Michailidis, G.; Li, J. Z. (2014). "Critical limitations of consensus clustering in class discovery". Scientific Reports. 4: 6207. Bibcode:2014NatSR...4.6207.. doi:10.1038/srep06207. PMC 4145288. PMID 25158761.
  7. ^ a b Şenbabaoğlu, Y.; Michailidis, G.; Li, J. Z. (Feb 2014). "A reassessment of consensus clustering for class discovery". bioRxiv 10.1101/002642.
  8. ^ a b Liu, Yufeng; Hayes, David Neil; Nobel, Andrew; Marron, J. S. (2008-09-01). "Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data". Journal of the American Statistical Association. 103 (483): 1281–1293. doi:10.1198/016214508000000454. ISSN 0162-1459. S2CID 120819441.
  9. ^ Tibshirani, Robert; Walther, Guenther; Hastie, Trevor (2001). "Estimating the number of clusters in a data set via the gap statistic". Journal of the Royal Statistical Society, Series B (Statistical Methodology). 63 (2): 411–423. doi:10.1111/1467-9868.00293. ISSN 1467-9868. S2CID 59738652.
  10. ^ Fern, Xiaoli; Brodley, Carla (2004). "Cluster ensembles for high dimensional clustering: an empirical study". J Mach Learn Res. 22.
  11. ^ Fred, Ana L.N.; Jain, Anil K. (2005). "Combining multiple clusterings using evidence accumulation" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 27 (6). Institute of Electrical and Electronics Engineers (IEEE): 835–850. doi:10.1109/tpami.2005.113. ISSN 0162-8828. PMID 15943417. S2CID 10316033.
  12. ^ Dana Cristofor, Dan Simovici (February 2002). "Finding Median Partitions Using Information-Theoretical-Based Genetic Algorithms" (PDF). Journal of Universal Computer Science. 8 (2): 153–172. doi:10.3217/jucs-008-02-0153.
  13. ^ Alexander Topchy, Anil K. Jain, William Punch. Clustering Ensembles: Models of Consensus and Weak Partitions. IEEE International Conference on Data Mining, ICDM 03 & SIAM International Conference on Data Mining, SDM 04
  14. ^ Kiselev, Vladimir Yu; Kirschner, Kristina; Schaub, Michael T; Andrews, Tallulah; Yiu, Andrew; Chandra, Tamir; Natarajan, Kedar N; Reik, Wolf; Barahona, Mauricio; Green, Anthony R; Hemberg, Martin (May 2017). "SC3: consensus clustering of single-cell RNA-seq data". Nature Methods. 14 (5): 483–486. doi:10.1038/nmeth.4236. ISSN 1548-7091. PMC 5410170. PMID 28346451.
  15. ^ Kunal Punera, Joydeep Ghosh. Consensus Based Ensembles of Soft Clusterings
  16. ^ Solving cluster ensemble problems by bipartite graph partitioning, Xiaoli Zhang Fern and Carla Brodley, Proceedings of the twenty-first international conference on Machine learning
  17. ^ Lock, E.F.; Dunson, D.B. (2013). "Bayesian consensus clustering". Bioinformatics. 29 (20): 2610–2616. arXiv:1302.7280. Bibcode:2013Bioin..29.2610L. doi:10.1093/bioinformatics/btt425. PMC 3789539. PMID 23990412.
  18. ^ Zazzaro, Gaetano; Martone, Angelo (2018). "ECF-means - Ensemble Clustering Fuzzification Means. A novel algorithm for clustering aggregation, fuzzification, and optimization". IMM 2018: The Eighth International Conference on Advances in Information Mining and Management. [1]
  • Aristides Gionis, Heikki Mannila, Panayiotis Tsaparas. Clustering Aggregation. 21st International Conference on Data Engineering (ICDE 2005)
  • Hongjun Wang, Hanhuai Shan, Arindam Banerjee. Bayesian Cluster Ensembles[permanent dead link], SIAM International Conference on Data Mining, SDM 09
  • Nguyen, Nam; Caruana, Rich (2007). "Consensus Clusterings". Seventh IEEE International Conference on Data Mining (ICDM 2007). IEEE. pp. 607–612. doi:10.1109/icdm.2007.73. ISBN 978-0-7695-3018-5. ...we address the problem of combining multiple clusterings without access to the underlying features of the data. This process is known in the literature as clustering ensembles, clustering aggregation, or consensus clustering. Consensus clustering yields a stable and robust final clustering that is in agreement with multiple clusterings. We find that an iterative EM-like method is remarkably effective for this problem. We present an iterative algorithm and its variations for finding clustering consensus. An extensive empirical study compares our proposed algorithms with eleven other consensus clustering methods on four data sets using three different clustering performance metrics. The experimental results show that the new ensemble clustering methods produce clusterings that are as good as, and often better than, these other methods.