OPTICS algorithm: Difference between revisions

Content deleted Content added

Inline

Revision as of 08:03, 22 April 2009

OPTICS is an algorithm for finding cluster in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander^[1]. In its basic idea, it is similar to DBSCAN, but addresses one of DBSCAN's major weaknesses: the problem to detect meaningful cluster in data of varying density. In order to do so, the points of the database are (linearily) ordered such that points which are spatially closest become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that needs to be accepted for a cluster in order to have both points belong to the same cluster. This is best represented in a dendrogram.

Basic idea

Like DBSCAN, OPTICS requires two parameters: $\epsilon \!\,$ , which describes the lowest density we want to consider, and $MinPts$ , describing the number of points required to form a cluster. A point $p$ is a core point if at least $MinPts$ points are found within its $\epsilon \!\,$ -neighborhood $N_{\epsilon }(p)$ . Contrary to DBSCAN, OPTICS also consideres points that are part of a more densely packed cluster, so each point is assigned a core distance that basically describes the distance to its $MinPts$ th point:

core-distance_{\epsilon ,MinPts}(p)={\begin{cases}{\text{UNDEFINED}},&{\text{if }}|N_{\epsilon }(p)|<MinPts\\{\text{distance to the }}MinPts{\text{th point}},&{\text{otherwise}}\end{cases}}

The reachability-distance of a point $p$ from another point $o$ describes the $\!\,\epsilon '<\epsilon$ that we need to assume as the density of a cluster in order to have $p$ belong to the same cluster as $o$ :

reachability-distance_{\epsilon ,MinPts}(p,o)={\begin{cases}{\text{UNDEFINED}},&{\text{if }}|N_{\epsilon }(o)|<MinPts\\\max({\mathit {core-distance}}_{\epsilon ,MinPts}(o),distance(o,p))&,{\text{otherwise}}\end{cases}}

Both the core-distance and the reachability-distance are undefined if no sufficiently dense cluster (w.r.t. $\!\,\epsilon$ ) is available. Given a sufficiently large $\epsilon \,\!$ , this will never happen, but then every $\,\!\epsilon$ -neighborhood query will return the entire database, resulting in an untractable runtime cost. Hence, the $\,\!\epsilon$ parameter is required to cut off the density of clusters that is no longer considered to be interesting.

Pseudocode

The basic approach of OPTICS is similar to DBSCAN, but instead of maintaining a set of cluster members, a priority queue is used.

 OPTICS(DB, eps, MinPts)
    for each point p of DB
       p.reachability-distance = UNDEFINED
    for each unprocessed point p of DB
       N = getNeighbors(p, eps)
       mark p as processed
       output p to the ordered list
       Seeds = empty priority queue
       if (core-distance(p, eps, Minpts) != UNDEFINED)
          update(N, p, Seeds, eps, Minpts)
          for each next q in Seeds
             N' = getNeighbors(q, eps)
             mark q as processed
             output q to the ordered list
             if (core-distance(q, eps, Minpts) != UNDEFINED)
                update(N', q, Seeds, eps, Minpts)

In update(), the priority queue Seeds is updated with the $\,\!\epsilon$ -neighborhood of $p$ and $q$ , respectively:

 update(N, p, Seeds, eps, Minpts)
    coredist = core-distance(p, eps, MinPts)
    for each o in N
       if (o is not processed)
          new-reach-dist = max(coredist, dist(p,o))
          if (o.reachability-distance == UNDEFINED) // o is not in Seeds
              o.reachability-distance = new-reach-dist
              Seeds.insert(o, new-reach-dist)
          else               // o in Seeds, check for improvement
              if (new-reach-dist < o.reachability-distance)
                 o.reachability-distance = new-reach-dist
                 Seeds.move-up(o, new-reach-dist)

OPTICS hence outputs the points in a particular ordering, annotated with their smallest reachability distance (in the original algorithm, the core distance is also exported, but this is not required for further processing).

Extracting the Clusters

Using a reachability plot to identify clusters

Using a reachability-plot (a special kind of dendrogram), the hierarchical structure of the clusters can be obtained easily. It is a 2D plot, with the ordering of the points on the x-axis and the reachability distance on the y-axis. Since points belonging to a cluster have a low reachability distance to their nearest neighbor, the clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster.

The image on the right illustrates this concept. In its upper half, an artificial example of a database consisting of two-dimensional, spatial points is shown. The lower part shows the reachability plot as computed by OPTICS. The black lines link some clusters to their respective valleys. The horizontal red line is an example on how to obtain a clustering. Each valley it crosses is made an cluster of its own. If the line was moved down, more clusters would emerge, especially for the topmost cluster, which features varying densities.

Note that deriving clusters in such a way yields the same result of running DBSCAN on the data with $\epsilon \!\,$ set to the chosen reachability-distance treshold.

Complexity

As DBSCAN, OPTICS processes each point once, and performs one $\,\!\epsilon$ -neighborhood query during this processing. Given a spatial index that grants a neighborhood query in $O(\log n)$ runtime, an overall runtime of $O(n\cdot \log n)$ is obtained. The authors of the original OPTICS paper report an actual constant slowdown factor of 1.6 compared to DBSCAN. Note that the value of $\,\!\epsilon$ might heavily influence the cost of the algorithm, since a value too large might raise the cost of a neighborhood query to linear complexity.

References

^ Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander (1999). "OPTICS: Ordering Points To Identify the Clustering Structure". ACM SIGMOD international conference on Management of data. ACM Press. pp. 49–60. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)

[1] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander (1999). "OPTICS: Ordering Points To Identify the Clustering Structure". ACM SIGMOD international conference on Management of data. ACM Press. pp. 49–60. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)

[1]

@@ Line 66: / Line 66: @@
 Note that deriving clusters in such a way yields the same result of running [[DBSCAN]] on the data with <math>\epsilon\!\,</math> set to the chosen reachability-distance treshold.
+==Complexity==
+As [[DBSCAN]], OPTICS processes each point once, and performs one <math>\,\!\epsilon</math>-neighborhood query during this processing. Given a [[spatial index]] that grants a neighborhood query in <math>O(\log n)</math> runtime, an overall runtime of <math>O(n \cdot \log n)</math> is obtained. The authors of the original OPTICS paper report an actual constant slowdown factor of 1.6 compared to DBSCAN. Note that the value of <math>\,\!\epsilon</math> might heavily influence the cost of the algorithm, since a value too large might raise the cost of a neighborhood query to linear complexity.
 ==References==