Online Book Reader

Home Category

Data Mining_ Concepts and Techniques - Jiawei Han [289]

By Root 1550 0
objects o1, o2, and o3, if o1 and o2 are density-connected, and o2 and o3 are density-connected, then so are o1 and o3.

Density-reachability and density-connectivity

Consider Figure 10.14 for a given ϵ represented by the radius of the circles, and, say, let MinPts = 3.

Figure 10.14 Density-reachability and density-connectivity in density-based clustering. Source: Based on Ester, Kriegel, Sander, and Xu [EKSX96].

Of the labeled points, m, p, o, r are core objects because each is in an ϵ-neighborhood containing at least three points. Object q is directly density-reachable from m. Object m is directly density-reachable from p and vice versa.

Object q is (indirectly) density-reachable from p because q is directly density-reachable from m and m is directly density-reachable from p. However, p is not density-reachable from q because q is not a core object. Similarly, r and s are density-reachable from o and o is density-reachable from r. Thus, o, r, and s are all density-connected.


We can use the closure of density-connectedness to find connected dense regions as clusters. Each closed set is a density-based cluster. A subset C ⊆ D is a cluster if (1) for any two objects o1, o2 ∈ C, o1 and o2 are density-connected; and (2) there does not exist an object o ∈ C and another object o′ ∈ (D − C) such that o and o′ are density-connected.

“How does DBSCAN find clusters?” Initially, all objects in a given data set D are marked as “unvisited.” DBSCAN randomly selects an unvisited object p, marks p as “visited,” and checks whether the ϵ-neighborhood of p contains at least MinPts objects. If not, p is marked as a noise point. Otherwise, a new cluster C is created for p, and all the objects in the ϵ-neighborhood of p are added to a candidate set, N. DBSCAN iteratively adds to C those objects in N that do not belong to any cluster. In this process, for an object p′ in N that carries the label “unvisited,” DBSCAN marks it as “visited” and checks its ϵ-neighborhood. If the ϵ-neighborhood of p′ has at least MinPts objects, those objects in the ϵ-neighborhood of p′ are added to N. DBSCAN continues adding objects to C until C can no longer be expanded, that is, N is empty. At this time, cluster C is completed, and thus is output.

To find the next cluster, DBSCAN randomly selects an unvisited object from the remaining ones. The clustering process continues until all objects are visited. The pseudocode of the DBSCAN algorithm is given in Figure 10.15.

Figure 10.15 DBSCAN algorithm.

If a spatial index is used, the computational complexity of DBSCAN is O(n log n), where n is the number of database objects. Otherwise, the complexity is O(n2). With appropriate settings of the user-defined parameters, ϵ and MinPts, the algorithm is effective in finding arbitrary-shaped clusters.

10.4.2. OPTICS: Ordering Points to Identify the Clustering Structure

Although DBSCAN can cluster objects given input parameters such as ϵ(the maximum radius of a neighborhood) and MinPts (the minimum number of points required in the neighborhood of a core object), it encumbers users with the responsibility of selecting parameter values that will lead to the discovery of acceptable clusters. This is a problem associated with many other clustering algorithms. Such parameter settings are usually empirically set and difficult to determine, especially for real-world, high-dimensional data sets. Most algorithms are sensitive to these parameter values: Slightly different settings may lead to very different clusterings of the data. Moreover, real-world, high-dimensional data sets often have very skewed distributions such that their intrinsic clustering structure may not be well characterized by a single set of global density parameters.

Note that density-based clusters are monotonic with respect to the neighborhood threshold. That is, in DBSCAN, for a fixed MinPts value and two neighborhood thresholds, ϵ1 < ϵ2, a cluster C with respect to ϵ1 and MinPts must be a subset of a cluster C′ with respect to ϵ2 and MinPts. This means that if two objects are in a density-based

Return Main Page Previous Page Next Page

®Online Book Reader