Jump to content

SUBCLU: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m exactely fix
 
(22 intermediate revisions by 15 users not shown)
Line 1: Line 1:
{{one source|date= February 2010}}
SUBCLU is an algorithm for [[clustering high-dimensional data]] by Karin Kailing, Hans-Peter Kriegel and Peer Kröger<ref>Karin Kailing, Hans-Peter Kriegel and Peer Kröger. ''Density-Connected Subspace Clustering for High-Dimensional Data''. In: ''Proc. SIAM Int. Conf. on Data Mining (SDM'04)'', pp. 246-257, 2004. </ref>. It is a [[subspace clustering]] algorithm that builds on the density-based clustering algorithm [[DBSCAN]]. SUBCLU can find [[cluster analysis|clusters]] in [[axis-parallel]] subspaces, and uses a [[Top-down_and_bottom-up_design|bottom-up]], [[greedy algorithm|greedy]] strategy to remain efficient.
'''SUBCLU''' is an algorithm for [[clustering high-dimensional data]] by Karin Kailing, [[Hans-Peter Kriegel]] and Peer Kröger.<ref>Karin Kailing, [[Hans-Peter Kriegel]] and Peer Kröger. ''[http://fogo.dbs.ifi.lmu.de/~kroegerp/papers/SDM04-SUBCLU.pdf Density-Connected Subspace Clustering for High-Dimensional Data]''. In: ''Proc. SIAM Int. Conf. on Data Mining (SDM'04)'', pp. 246-257, 2004.</ref> It is a [[subspace clustering]] algorithm that builds on the density-based clustering algorithm [[DBSCAN]]. SUBCLU can find [[cluster analysis|clusters]] in [[axis-parallel]] subspaces, and uses a [[Top-down and bottom-up design|bottom-up]], [[greedy algorithm|greedy]] strategy to remain efficient.


==Approach==
==Approach==
SUBCLU uses a [[monotonicity]] criteria: if a cluster is found in a subspace <math>S</math>, then each subspace <math>T \subseteq S</math> also contains a cluster. However, a cluster <math>C \subseteq DB</math> in subspace <math>S</math> is not necessarily a cluster in <math>T \subseteq S</math>, since clusters are required to be maximal, and more objects might be contained in the cluster in <math>T</math> that contains <math>C</math>. However, a [[DBSCAN|density-connected set]] in a subspace <math>S</math> is also a density-connected set in <math>T \subseteq S</math>.
SUBCLU uses a [[monotonicity]] criteria: if a cluster is found in a subspace <math>S</math>, then each subspace <math>T \subseteq S</math> also contains a cluster. However, a cluster <math>C \subseteq DB</math> in subspace <math>S</math> is not necessarily a cluster in <math>T \subseteq S</math>, since clusters are required to be maximal, and more objects might be contained in the cluster in <math>T</math> that contains <math>C</math>. However, a [[DBSCAN|density-connected set]] in a subspace <math>S</math> is also a density-connected set in <math>T \subseteq S</math>.


This is utilized by SUBCLU in a way similar to the [[Apriori algorithm]]: first, all 1-dimensional subspaces are clustered. All clusters in a higher-dimensional subspace will be subsets of the clusters detected in this first clustering. SUBCLU hence recursively produces <math>k+1</math>-dimensional candidate subspaces by combining <math>k</math>-dimensional subspaces with clusters sharing <math>k-1</math> attributes. After pruning irrelevant candidates, [[DBSCAN]] is applied to the candidate subspace to find out if it still contains clusters. If it does, the candidate subspace is used for the next combination of subspaces.
This ''downward-closure property'' is utilized by SUBCLU in a way similar to the [[Apriori algorithm]]: first, all 1-dimensional subspaces are clustered. All clusters in a higher-dimensional subspace will be subsets of the clusters detected in this first clustering. SUBCLU hence recursively produces <math>k+1</math>-dimensional candidate subspaces by combining <math>k</math>-dimensional subspaces with clusters sharing <math>k-1</math> attributes. After pruning irrelevant candidates, [[DBSCAN]] is applied to the candidate subspace to find out if it still contains clusters. If it does, the candidate subspace is used for the next combination of subspaces. In order to improve the runtime of [[DBSCAN]], only the points known to belong to clusters in one <math>k</math>-dimensional subspace (which is chosen to contain as little clusters as possible) are considered. Due to the downward-closure property, other point cannot be part of a <math>k+1</math>-dimensional cluster anyway.

==Pseudocode==
SUBCLU takes two parameters, <math>\epsilon\!\,</math> and <math>MinPts</math>, which serve the same role as in [[DBSCAN]]. In a first step, DBSCAN is used to find 1D-clusters in each subspace spanned by a single attribute:

<math>\mathtt{SUBCLU}(DB, eps, MinPts)</math>
:<math>S_1 := \emptyset</math>
:<math>C_1 := \emptyset</math>
:<math>\mathtt{for\, each}\, a \in Attributes</math>
::<math>C^{\{a\}} = \mathtt{DBSCAN}(DB, \{a\}, eps, MinPts)\!\,</math>
::<math>\mathtt{if} (C^{\{a\}} \neq \emptyset)</math>
:::<math>S_1 := S_1 \cup \{a\}</math>
:::<math>C_1 := C_1 \cup C^{\{a\}}</math>
::<math>\mathtt{end\, if}</math>
:<math>\mathtt{end\, for}</math>

: // In a second step, <math>k+1</math>-dimensional clusters are built from <math>k</math>-dimensional ones:

:<math>k := 1\!\,</math>
:<math>\mathtt{while} (C_k \neq \emptyset)</math>
::<math>\mathtt{CandS}_{k+1} := \mathtt{GenerateCandidateSubspaces}(S_k)\!\,</math>
::<math>\mathtt{for\, each}\, cand \in \mathtt{CandS}_{k+1}</math>
:::<math>\mathtt{bestSubspace :=} \min_{s \in S_k \wedge s \subset cand} \sum_{C_i \in C^s} |C_i|</math>
:::<math>C^{cand} := \emptyset</math>
:::<math>\mathtt{for\, each\, cluster}\, cl \in C^\mathtt{bestSubspace}</math>
::::<math>C^{cand} := C^{cand} \cup \mathtt{DBSCAN}(cl, cand, eps, MinPts)</math>
::::<math>\mathtt{if}\, (C^{cand} \neq \emptyset)</math>
:::::<math>S_{k+1} := S_{k+1} \cup cand</math>
:::::<math>C_{k+1} := C_{k+1} \cup C^{cand}</math>
::::<math>\mathtt{end\, if}</math>
:::<math>\mathtt{end\, for}</math>
::<math>\mathtt{end\, for}</math>
::<math>k:=k+1\!\,</math>
:<math>\mathtt{end\, while}</math>
<math>\mathtt{end}\!\,</math>

The set <math>S_k</math> contains all the <math>k</math>-dimensional subspaces that are known to contain clusters. The set <math>C_k</math> contains the sets of clusters found in the subspaces. The <math>bestSubspace</math> is chosen to minimize the runs of DBSCAN (and the number of points that need to be considered in each run) for finding the clusters in the candidate subspaces.

Candidate subspaces are generated much alike the [[Apriori algorithm]] generates the frequent itemset candidates: Pairs of the <math>k</math>-dimensional subspaces are compared, and if they differ in one attribute only, they form a <math>k+1</math>-dimensional candidate. However, a number of irrelevant candidates are found as well; they contain a <math>k</math>-dimensional subspace that does not contain a cluster. Hence, these candidates are removed in a second step:

<math>\mathtt{GenerateCandidateSubspaces}(S_k)</math>
:<math>\mathtt{CandS}_{k+1} := \emptyset</math>
:<math>\mathtt{for\,each}\,s_1 \in S_k</math>
::<math>\mathtt{for\,each}\,s_2 \in S_k</math>
:::<math>\mathtt{if}\,(s_1\,\mathtt{and}\,s_2\,\,\mathtt{differ\,\,in\,\,exactly\,\,one\,\,attribute})</math>
::::<math>\mathtt{CandS}_{k+1} := \mathtt{CandS}_{k+1} \cup \{s_1 \cup s_2\}</math>
:::<math>\mathtt{end\,if}</math>
::<math>\mathtt{end\,for}</math>
:<math>\mathtt{end\,for}</math>

:''// Pruning of irrelevant candidate subspaces''
:<math>\mathtt{for\,each}\,cand \in \mathtt{CandS}_{k+1}</math>
::<math>\mathtt{for\, each}\, k\texttt{-element}\,s \subset cand</math>
:::<math>\mathtt{if} \, (s \not \in S_k)</math>
::::<math>\mathtt{CandS}_{k+1} = \mathtt{CandS}_{k+1} \setminus \{cand\}</math>
:::<math>\mathtt{end\,if}</math>
::<math>\mathtt{end\,for}</math>
:<math>\mathtt{end\,for}</math>
<math>\mathtt{end}\,\!</math>

==Availability==
An example implementation of SUBCLU is available in the [[Environment for DeveLoping KDD-Applications Supported by Index-Structures|ELKI framework]].


==References==
==References==
<references/>
<references/>

[[Category:Cluster analysis algorithms]]

Latest revision as of 22:15, 7 December 2022

SUBCLU is an algorithm for clustering high-dimensional data by Karin Kailing, Hans-Peter Kriegel and Peer Kröger.[1] It is a subspace clustering algorithm that builds on the density-based clustering algorithm DBSCAN. SUBCLU can find clusters in axis-parallel subspaces, and uses a bottom-up, greedy strategy to remain efficient.

Approach

[edit]

SUBCLU uses a monotonicity criteria: if a cluster is found in a subspace , then each subspace also contains a cluster. However, a cluster in subspace is not necessarily a cluster in , since clusters are required to be maximal, and more objects might be contained in the cluster in that contains . However, a density-connected set in a subspace is also a density-connected set in .

This downward-closure property is utilized by SUBCLU in a way similar to the Apriori algorithm: first, all 1-dimensional subspaces are clustered. All clusters in a higher-dimensional subspace will be subsets of the clusters detected in this first clustering. SUBCLU hence recursively produces -dimensional candidate subspaces by combining -dimensional subspaces with clusters sharing attributes. After pruning irrelevant candidates, DBSCAN is applied to the candidate subspace to find out if it still contains clusters. If it does, the candidate subspace is used for the next combination of subspaces. In order to improve the runtime of DBSCAN, only the points known to belong to clusters in one -dimensional subspace (which is chosen to contain as little clusters as possible) are considered. Due to the downward-closure property, other point cannot be part of a -dimensional cluster anyway.

Pseudocode

[edit]

SUBCLU takes two parameters, and , which serve the same role as in DBSCAN. In a first step, DBSCAN is used to find 1D-clusters in each subspace spanned by a single attribute:

// In a second step, -dimensional clusters are built from -dimensional ones:

The set contains all the -dimensional subspaces that are known to contain clusters. The set contains the sets of clusters found in the subspaces. The is chosen to minimize the runs of DBSCAN (and the number of points that need to be considered in each run) for finding the clusters in the candidate subspaces.

Candidate subspaces are generated much alike the Apriori algorithm generates the frequent itemset candidates: Pairs of the -dimensional subspaces are compared, and if they differ in one attribute only, they form a -dimensional candidate. However, a number of irrelevant candidates are found as well; they contain a -dimensional subspace that does not contain a cluster. Hence, these candidates are removed in a second step:

// Pruning of irrelevant candidate subspaces

Availability

[edit]

An example implementation of SUBCLU is available in the ELKI framework.

References

[edit]
  1. ^ Karin Kailing, Hans-Peter Kriegel and Peer Kröger. Density-Connected Subspace Clustering for High-Dimensional Data. In: Proc. SIAM Int. Conf. on Data Mining (SDM'04), pp. 246-257, 2004.