Clustering high-dimensional data

Clustering high-dimensional data describes the process of performing cluster analysis with data that exhibits a large number of dimensions. The "large number" can range from a few dozen to many thousands. Such high-dimensional data spaces are frequently encountered in application areas like medicine (where DNA microarray technology can produce a large number of measurements at once) or in clustering of text documents (if a word-frequency vector is used, the number of dimensions equals the size of the dictionary).

Clustering in high-dimensional data describes various approaches that address the particular problems that result from the large number of dimensions.

Problems

According to ^[1], four problems need to be overcome for clustering in high-dimensional data:

Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential growth of the number of possible values with each dimension, impossible to enumerate. This problem is known as the curse of dimensionality.
For spatial data, the concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges. Especially, the discrimination of the nearest and farthest point becomes meaningless:

\lim _{d\to \infty }{\frac {dist_{\max }-dist_{\min }}{dist_{\min }}}\to 0

A cluster is intended to group objects that are related, based on observations of their attribute's values. However, given a large number of attributes some of the attributes will usually not be meaningful for a given cluster. For example, in newborn screening a cluster of samples might identify newborns that share similar blood values, which might lead to insights about the relevance of certain blood values for a disease. But for different diseases, different blood values might form a cluster, and other values might be uncorrelated. This is known as the local feature relevance problem: different clusters might be found in different subspaces, so a global filtering of attributes is not sufficient.
Given a large number of attributes, it is likely that some attributes are correlated. Hence, clusters might exist in arbitrarily oriented affine subspaces.

Approaches

Subspace Clustering

Following ^[1], subspace clustering is the task of detecting all clusters in all subspaces. This means that a point might be a member of multiple clusters, each existing in a different subspace. Subspaces can either be axis-parallel or affine. The term is often used synonymous with general clustering in high-dimensional data.

The image on the right shows a mere two-dimensional spatial space where a number of clusters can be identified. In the one-dimensional subspaces, the clusters $c_{a}$ (in subspace $\{x\}$ ) and $c_{b}$ , $c_{c}$ , $c_{d}$ (in subspace $\{x\}$ ) can be found. $c_{c}$ cannot be considered a cluster in a two-dimensional (sub-)space, since it is too sparsely distributed in the $x$ axis. In two dimensions, the two clusters $c_{ab}$ and $c_{ad}$ can be identified.

The problem of subspace clustering is given by the fact that there are $2^{d}$ different subspaces of a space with $d$ dimensions. If the subspaces are not axis-parallel, an infinite number of subspaces is possible. Hence, subspace clustering algorithm utilize some kind of heuristic to remain computationally feasible, at the risk of producing inferior results. For example, the downward-closure property (cf. association rules) can be used to build higher-dimensional subspaces only by combining lower-dimensional ones, as any subspace of a space containing a cluster will also contain a cluster, an approach taken by most of the traditional algorithms such as CLIQUE^[2] and SUBCLU^[3].

Existing algorithms can be divided into top-down algorithms that try to learn the subspaces a point can possibly be a cluster-member in (e.g., by investigating the variance of the attributes compared to neighboring points, with a low variance resulting in an increased likelihood that this point is part of a cluster in that subspace; this approach taken by DiSH^[4], though this is not a pure subspace clustering algorithm.)

Projected Clustering

Projected clustering seeks to assign each point to a unique cluster, but clusters may exist in different subspaces.

References

^ ^a ^b Hans-Peter Kriegel, Peer Kröger and Arthur Zimek: Clustering High-Dimensional Data: A Survey on Subspace Clustering, Pattern-Based Clustering, and Correlation Clustering. In: ACM Transactions on Knowledge Discovery from Data. Vol. 3, No. 1, pp. 1-58, 2009.
^ Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, pp.94-105, 1998
^ Karin Kailing, Hans-Peter Kriegel and Peer Kröger. Density-Connected Subspace Clustering for High-Dimensional Data. In: Proc. SIAM Int. Conf. on Data Mining (SDM'04), pp. 246-257, 2004.
^ Elke Achtert, Christian Böhm, Hans-Peter Kriegel, Peer Kröger, Ina Müller-Gorman and Arthur Zimek. Detection and Visualization of Subspace Cluster Hierarchies. In: DASFAA '07, vol. 4443 of Lecture Notes in Computer Science, pp. 152-163, 2007.

[hpk-1] Hans-Peter Kriegel, Peer Kröger and Arthur Zimek: Clustering High-Dimensional Data: A Survey on Subspace Clustering, Pattern-Based Clustering, and Correlation Clustering. In: ACM Transactions on Knowledge Discovery from Data. Vol. 3, No. 1, pp. 1-58, 2009.

[2] Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, pp.94-105, 1998

[3] Karin Kailing, Hans-Peter Kriegel and Peer Kröger. Density-Connected Subspace Clustering for High-Dimensional Data. In: Proc. SIAM Int. Conf. on Data Mining (SDM'04), pp. 246-257, 2004.

[4] Elke Achtert, Christian Böhm, Hans-Peter Kriegel, Peer Kröger, Ina Müller-Gorman and Arthur Zimek. Detection and Visualization of Subspace Cluster Hierarchies. In: DASFAA '07, vol. 4443 of Lecture Notes in Computer Science, pp. 152-163, 2007.

[1]

[2]

[3]

[4]