Online Book Reader

Home Category

Data Mining_ Concepts and Techniques - Jiawei Han [288]

By Root 1427 0
nor probabilistic approaches can find the distribution of such hierarchies. Recently, Bayesian tree-structured models have been developed to handle such problems. Bayesian and other sophisticated probabilistic clustering methods are considered advanced topics and are not covered in this book.

10.4. Density-Based Methods


Partitioning and hierarchical methods are designed to find spherical-shaped clusters. They have difficulty finding clusters of arbitrary shape such as the “S” shape and oval clusters in Figure 10.13. Given such data, they would likely inaccurately identify convex regions, where noise or outliers are included in the clusters.

Figure 10.13 Clusters of arbitrary shape.

To find clusters of arbitrary shape, alternatively, we can model clusters as dense regions in the data space, separated by sparse regions. This is the main strategy behind density-based clustering methods, which can discover clusters of nonspherical shape. In this section, you will learn the basic techniques of density-based clustering by studying three representative methods, namely, DBSCAN (Section 10.4.1), OPTICS (Section 10.4.2), and DENCLUE (Section 10.4.3).

10.4.1. DBSCAN: Density-Based Clustering Based on Connected Regions with High Density

“How can we find dense regions in density-based clustering?” The density of an object o can be measured by the number of objects close to o. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core objects, that is, objects that have dense neighborhoods. It connects core objects and their neighborhoods to form dense regions as clusters.

“How does DBSCAN quantify the neighborhood of an object?” A user-specified parameter ϵ > 0 is used to specify the radius of a neighborhood we consider for every object. The ϵ-neighborhood of an object o is the space within a radius ϵ centered at o.

Due to the fixed neighborhood size parameterized by ϵ, the density of a neighborhood can be measured simply by the number of objects in the neighborhood. To determine whether a neighborhood is dense or not, DBSCAN uses another user-specified parameter, MinPts, which specifies the density threshold of dense regions. An object is a core object if the ϵ-neighborhood of the object contains at least MinPts objects. Core objects are the pillars of dense regions.

Given a set, D, of objects, we can identify all core objects with respect to the given parameters, ϵ and MinPts. The clustering task is therein reduced to using core objects and their neighborhoods to form dense regions, where the dense regions are clusters. For a core object q and an object p, we say that p is directly density-reachable from q (with respect to ϵ and MinPts) if p is within the ϵ-neighborhood of q. Clearly, an object p is directly density-reachable from another object q if and only if q is a core object and p is in the ϵ-neighborhood of q. Using the directly density-reachable relation, a core object can “bring” all objects from its ϵ-neighborhood into a dense region.

“How can we assemble a large dense region using small dense regions centered by core objects?” In DBSCAN, p is density-reachable from q (with respect to ϵ and MinPts in D) if there is a chain of objects p1, …, pn, such that p1 = q, pn = p, and pi + 1 is directly density-reachable from pi with respect to ϵ and MinPts, for 1 ≤ i ≤ n, pi ∈ D. Note that density-reachability is not an equivalence relation because it is not symmetric. If both o1 and o2 are core objects and o1 is density-reachable from o2, then o2 is density-reachable from o1. However, if o2 is a core object but o1 is not, then o1 may be density-reachable from o2, but not vice versa.

To connect core objects as well as their neighbors in a dense region, DBSCAN uses the notion of density-connectedness. Two objects p1, p2 ∈ D are density-connected with respect to ϵ and MinPts if there is an object q ∈ D such that both p1 and p2 are density-reachable from q with respect to ϵ and MinPts. Unlike density-reachability, density-connectedness is an equivalence relation. It is easy to show that, for

Return Main Page Previous Page Next Page

®Online Book Reader