Jump to content

Talk:Cluster analysis: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
Bunty.Gill (talk | contribs)
No edit summary
Bunty.Gill (talk | contribs)
Line 133: Line 133:


==Affinity Propagation==
==Affinity Propagation==
I believe that this algorithm which appeared in [[Science]] Feb 07 will change the way people think about clustering. See www.psi.toronto.edu/affinitypropagation/ and www.sciencemag.org/cgi/content/full/sci;315/5814/972 . However I am not capable of writing a full introduction, so I hope someone better equiped for the job will do that. Including the AP breakthrough is a must in my view to retain the currency of this article. [[User:Bunty.Gill|Bunty.Gill]] 08:50, 24 April 2007 (UTC)
I believe that this algorithm which appeared in [[Science(journal)]] Feb 07 will change the way people think about clustering. See www.psi.toronto.edu/affinitypropagation/ and www.sciencemag.org/cgi/content/full/sci;315/5814/972 . However I am not capable of writing a full introduction, so I hope someone better equiped for the job will do that. Including the AP breakthrough is a must in my view to retain the currency of this article. [[User:Bunty.Gill|Bunty.Gill]] 08:50, 24 April 2007 (UTC)

Revision as of 08:51, 24 April 2007

Template:Directory request

Sabotage

This page appears to have been deliberately vandalised.

Elbow criterion section

Someone should edit out the blatant advertising for the excel plugin product. It would also be nice to have some additional info on picking the number of clusters?

Sorry, I just had Excel at hand when making the graph. This is not an advertisement, this is not real data, but a crappy Excel hand made graph to help visualize the EC heuristic. If you look through the archives, you will see I also made an extremely ugly representation for the hierarchical clustering, which was thankfully replaced.
As for how to tweak clustering, either you have a good idea of how many clusters you want (number criterion), or a good idea of how much total variance (or another perf metric) you want to explain, or all you are left with is heuristic. That's from a Computer science POV.
Looking in another direction, for example statistics, there are ways to compare models with a different number of parameters, like Akaike information criterion and the methods linked from there, or maybe something based on information entropy. It will again help you choose in the tradeoff between having too many cluster or having too low perf because of low cluser number. I'm sorry, I don't have any relevant article nor the time at the moment to find one. Bryan
Actually ran into an article about using entropy criterion to stop clustering: Cluster Identification Using Maximum Configuration Entropy, by C.H. Li. —The preceding unsigned comment was added by 86.53.54.179 (talk) 19:02, 1 April 2007 (UTC).[reply]

V-means clustering

A Google search for "V-means clustering" only returns this Wikipedia article. Can someone provide a citation for this?

for future ref, this is the V-means paragraph that was removed

V-means clustering

V-means clustering utilizes cluster analysis and nonparametric statistical tests to key researchers into segments of data that may contain distinct homogenous sub-sets. The methodology embraced by V-means clustering circumvents many of the problems that traditionally beleaguer standard techniques for categorizing data. First, instead of relying on analyst predictions for the number of distinct sub-sets (k-means clustering), V-means clustering generates a pareto optimal number of sub-sets. V-means clustering is calibrated to a user-defined confidence level p, whereby the algorithm divides the data and then recombines the resulting groups until the probability that any given group belongs to the same distribution as either of its neighbors is less than p.

Second, V-means clustering makes use of repeated iterations of the nonparametric Kolmogorov-Smirnov test. Standard methods of dividing data into its constituent parts are often entangled in definitions of distances (distance measure clustering) or in assumptions about the normality of the data (expectation maximization clustering), but nonparametric analysis draws inference from the distribution functions of sets.

Third, the method is conceptually simple. Some methods combine multiple techniques in sequence in order to produce more robust results. From a practical standpoint this muddles the meaning of the results and frequently leads to conclusions typical of “data dredging.”

Fuzzy c-means clarification

I believe ther is a typo at "typological analysis"; should be "topological"

The explanation of the fuzzy c-means algorithm seems quite difficult to follow, the actual order of the bullet points is correct but which bit is to be repeated and when is misleading.

"The fuzzy c-means algorithm is greatly similar to the k-means algorithm:

  • Choose a number of clusters
  • Assign randomly to each point coefficients for being in the clusters
  • Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) :
    • Compute the centroid for each cluster, using the formula above
    • For each point, compute its coefficients of being in the clusters, using the formula above"

Also aren't c-means and k-means just different names for the same thing, in which case can they be changed to be consistent throughout?



The c-means clustering relates only to the fuzzy logic clustering algorithm. You could say that k-means is teh convergence of c-clustering with ordinary logic, rather than fuzzy logic.

On-line versus off-line clustering

Beyond the division of clustering methodologies hierarchical/partitional agglomerative/divisive, it is possible to differentiate betewen: Arrivial of data points: on-line/off-line Type of data: stationary/non-stationary


Additionally it may be helpful to discuss some of the difficulties in clustering data, in particular choosing the correct number of centroids, limits on memory or processing time, and techniques for solving them, such as measuring phase transition (Rose, Gurewitz & Fox)


Another division would be Extensional vs. Conceptual Clustering. --Beau 10:40, 28 June 2006 (UTC)[reply]

Missing reference

I find this in the article:

This is the basic structure of the algorithm (J. MacQueen, 1967):

But when I looked at the bibliograpy, it was not there. If anyone has the information, could they add it? Michael Hardy 18:36, 21 November 2005 (UTC)[reply]

Impossibility Theorem

The clustering impossibility theorem should be mentioned. it is similar to Arrow's impossibility theorem, except for clustering. I don't know who created it though. Briefly, it states that any clustering algorithm should have these 3 properties:

  • Richness - "all labelings are possible"
  • scale invarience
  • Consistancy - "if you increase inter-cluster distance and decrease intra-cluster distance, cluster assignments should not change"

No clustering algorithm can have all 3.

-- BAxelrod 03:53, 16 December 2005 (UTC)[reply]

It is a good thing to have and mostly one should reference, I guess \bibitem{JK02} Jon Kleinberg. An impossibility theorem for clustering - Advances in Neural Information Processing Systems, 2002.

However, if there is not another source, then I'd mention that there is a little problem with this theorem as it is presented in that article.

First, it deals with graphs G(V,E), having |V| >= 2 and having distances d(i,j) = 0 iff i==j, i,j in V. Thus, take richness and scale invariance (which means that a graph with some fixed weights has the same clustering if all the weights are multiplied by some positive constant), a graph with |V| = 2, and boom - here you go. For each clustering we get either scale invariance or richness. If there is richness, then scale invariance does not work and the other way round. Sweet, is not it? Or am I wrong somewhere?

Could you please explain something about Isodata Algorithm for data clustering

The last external link on this page has an example on ISODATA clustering. I will try to do a digest when I have time, but feel free to beat me to it. Bryan

Relation to NLP

It seems a large amount of the effort in text mining related to text clustering is left out of this article, but it seems to be most appropriate place. Josh Froelich 20:16, 9 January 2007 (UTC)[reply]

No strict definition for the problem itself.

There are lots of details about different methods and metrics used to solve the problem, which is defined too unstrictly.

Here they are, should we put them back ?

Software implementations

Free

  • The flexclust package for R
  • COMPACT - Comparative Package for Clustering Assessment (in Matlab)
  • YALE (Yet Another Learning Environment): freely available open-source software for data pre-processing, knowledge discovery, data mining, machine learning, visualization, etc. also including a plugin for clustering, fully integrating Weka, easily extendible, and featuring a graphical user interface as well as a XML-based scripting language for data mining;
  • mixmod : Model Based Cluster And Discriminant Analysis. Code in C++, interface with Matlab and Scilab
  • LingPipe Clustering Tutorial Tutorial for doing complete- and single-link clustering using LingPipe, a Java text data mining package distributed with source.
  • Weka : Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
  • Tanagra : a free data mining software including several clustering algorithms such as K-MEANS, SOM, Clustering Tree, HAC and more.
  • Cluster : Open source clustering software. The routines are available in the form of a C clustering library, an extension module to Python, a module to Perl.
  • python-cluster Pure python implementation

Non-free

I removed these, about a month ago. Wikipedia's external link guidelines discourage large link directories like this. None of the links appear to meet the guidelines, each is just an instance of data clustering software with nothing to indicate it is especially notable to the topic. Normally I would have replaced them with a DMOZ link, but there doesn't appear to be a category for this. -SpuriousQ (talk) 10:14, 22 March 2007 (UTC)[reply]
Actually it did take some time to build up such a list of software. Also, since this is an article about algorithms, any software implementing them is relevant to the topic (if you have an interest in the algo, it might very well be because you need a working implementation, or a reference implementation). Bryan
I understand that, but that list was just an invitation for spam and links to data clustering software someone wrote one day. WP:EL is clear that links should be kept to a minimum. I would probably have no problem with links to clearly notable implementations; for example, one that was the subject of a peer-reviewed published paper or one created by noted researchers in the field. -SpuriousQ (talk) 15:12, 28 March 2007 (UTC)[reply]
No problem, I accept the policy and see its utility, but think the links and comments are important, that's why I asked for a DMOZ directory to be created. Also, there is a problem with the fact that you removed some internal wikipedia links in the process (the YALE, Weka, Peltarion Synapse links). So, if we want to keep the links somewhere, do we try to move them to DMOZ or do we create an intermediary Wikipedia page for each package ? Bryan
Hmm. The standard way of linking to internal articles is in the See also section. But I feel it's a bit questionable to give such prominence to the three instances that happen to have Wikipedia articles. Another idea would be to have an article List of data clustering software and link to that in See also. I would prefer that option, but it's also iffy to have a list article with so few examples. Do you know of any other articles we could add to that list? -SpuriousQ (talk) 01:58, 30 March 2007 (UTC)[reply]
Data mining, software section. The software links there actually suffer the same problem as the ones here, although to a lesser extent (more wikipedia articles, less external links). Please see discussion with David Eppstein. Bryan

I think the list should either be kept out of the article altogether, or it should be explicitly limited to a small number (say half a dozen) of particularly notable clustering systems, for which some strong argument can be made for their inclusion beyond "it's a clustering system" or even "it's a free clustering system". By a strong argument, I mean statements such as "it's the most widely used free clustering system for Linux" or the like. Anything less restrictive is just an invitation for an unencyclopedic link farm. —David Eppstein 03:22, 30 March 2007 (UTC)[reply]

I agree, it's time for a cleanup. Just to give a little "historical perspective", the list was started when commercial spam appeared, it kept it under very good control because it gave space for people tempted to spam the page with their soft's link to express themselves. The list has grown, and it now has many annotations, so I'm quite happy with it and would like to keep some of its info.
Now, to continue the conversation together with SpuriousQ, imagine we move everything to a list of Data mining software. We would lose info about implementations (libraries,scripts) that are not autonomous software, thus failing to highlight "reference implementations" that people may want to examine at a code level, only linking to "working implementations" that people would like to use. So my proposition would be: if the soft is autonomous, we add it to the data mining software list. If the soft is just a library, and seems to be from academia, we keep a link to it. In our case, that would keep flexclust, COMPACT, mixmod and Cluster, the python-cluster link would be lost forever, and the rest would be transferred to the more general and richer data mining software list. What do you think ? Bryan.

Affinity Propagation

I believe that this algorithm which appeared in Science(journal) Feb 07 will change the way people think about clustering. See www.psi.toronto.edu/affinitypropagation/ and www.sciencemag.org/cgi/content/full/sci;315/5814/972 . However I am not capable of writing a full introduction, so I hope someone better equiped for the job will do that. Including the AP breakthrough is a must in my view to retain the currency of this article. Bunty.Gill 08:50, 24 April 2007 (UTC)[reply]