Data Mining_ Concepts and Techniques - Jiawei Han [290]
To overcome the difficulty in using one set of global parameters in clustering analysis, a cluster analysis method called OPTICS was proposed. OPTICS does not explicitly produce a data set clustering. Instead, it outputs a cluster ordering. This is a linear list of all objects under analysis and represents the density-based clustering structure of the data. Objects in a denser cluster are listed closer to each other in the cluster ordering. This ordering is equivalent to density-based clustering obtained from a wide range of parameter settings. Thus, OPTICS does not require the user to provide a specific density threshold. The cluster ordering can be used to extract basic clustering information (e.g., cluster centers, or arbitrary-shaped clusters), derive the intrinsic clustering structure, as well as provide a visualization of the clustering.
To construct the different clusterings simultaneously, the objects are processed in a specific order. This order selects an object that is density-reachable with respect to the lowest ϵ value so that clusters with higher density (lower ϵ) will be finished first. Based on this idea, OPTICS needs two important pieces of information per object:
■ The core-distance of an object p is the smallest value ϵ′ such that the ϵ′-neighborhood of p has at least MinPts objects. That is, ϵ′ is the minimum distance threshold that makes p a core object. If p is not a core object with respect to ϵ and MinPts, the core-distance of p is undefined.
■ The reachability-distance to object p from q is the minimum radius value that makes p density-reachable from q. According to the definition of density-reachability, q has to be a core object and p must be in the neighborhood of q. Therefore, the reachability-distance from q to p is max{core-distance(q), dist(p, q)}. If q is not a core object with respect to ϵ and MinPts, the reachability-distance to p from q is undefined.
An object p may be directly reachable from multiple core objects. Therefore, p may have multiple reachability-distances with respect to different core objects. The smallest reachability-distance of p is of particular interest because it gives the shortest path for which p is connected to a dense cluster.
Core-distance and reachability-distance
Figure 10.16 illustrates the concepts of core-distance and reachability-distance. Suppose that #x03F5; = 6 mm and MinPts = 5. The core-distance of p is the distance, ϵ′, between p and the fourth closest data object from p. The reachability-distance of q1 from p is the core-distance of p (i.e., ϵ′ = 3 mm) because this is greater than the Euclidean distance from p to q1. The reachability-distance of q2 with respect to p is the Euclidean distance from p to q2 because this is greater than the core-distance of p.
Figure 10.16 OPTICS terminology. Source: Based on Ankerst, Breunig, Kriegel, and Sander [ABKS99].
OPTICS computes an ordering of all objects in a given database and, for each object in the database, stores the core-distance and a suitable reachability-distance. OPTICS maintains a list called OrderSeeds to generate the output ordering. Objects in OrderSeeds are sorted by the reachability-distance from their respective closest core objects, that is, by the smallest reachability-distance of each object.
OPTICS begins with an arbitrary object from the input database as the current object, p. It retrieves the ϵ-neighborhood of p, determines the core-distance, and sets the reachability-distance to undefined. The current object, p, is then written to output. If p is not a core object, OPTICS simply moves on to the next object in the OrderSeeds list (or the input database if OrderSeeds is empty). If p is a core object, then for each object, q, in the ϵ-neighborhood of p, OPTICS updates its reachability-distance from p and inserts q into OrderSeeds if q has not yet been processed. The iteration continues until the input is fully consumed and OrderSeeds is empty.
A data set's cluster ordering can be represented graphically,