Data Mining_ Concepts and Techniques - Jiawei Han [291]
Figure 10.17 Cluster ordering in OPTICS. Source: Adapted from Ankerst, Breunig, Kriegel, and Sander [ABKS99].
The structure of the OPTICS algorithm is very similar to that of DBSCAN. Consequently, the two algorithms have the same time complexity. The complexity is O(n log n) if a spatial index is used, and O(n2) otherwise, where n is the number of objects.
10.4.3. DENCLUE: Clustering Based on Density Distribution Functions
Density estimation is a core issue in density-based clustering methods. DENCLUE (DENsity-based CLUstEring) is a clustering method based on a set of density distribution functions. We first give some background on density estimation, and then describe the DENCLUE algorithm.
In probability and statistics, density estimation is the estimation of an unobservable underlying probability density function based on a set of observed data. In the context of density-based clustering, the unobservable underlying probability density function is the true distribution of the population of all possible objects to be analyzed. The observed data set is regarded as a random sample from that population.
In DBSCAN and OPTICS, density is calculated by counting the number of objects in a neighborhood defined by a radius parameter, ϵ. Such density estimates can be highly sensitive to the radius value used. For example, in Figure 10.18, the density changes significantly as the radius increases by a small amount.
Figure 10.18 The subtlety in density estimation in DBSCAN and OPTICS: Increasing the neighborhood radius slightly from ϵ1 to ϵ2 results in a much higher density.
To overcome this problem, kernel density estimation can be used, which is a nonparametric density estimation approach from statistics. The general idea behind kernel density estimation is simple. We treat an observed object as an indicator of high-probability density in the surrounding region. The probability density at a point depends on the distances from this point to the observed objects.
Formally, let x1, …, xn be an independent and identically distributed sample of a random variable f. The kernel density approximation of the probability density function is
(10.21)
where K() is a kernel and h is the bandwidth serving as a smoothing parameter. A kernel can be regarded as a function modeling the influence of a sample point within its neighborhood. Technically, a kernel K() is a non-negative real-valued integrable function that should satisfy two requirements: and K(−u) = K(u) for all values of u. A frequently used kernel is a standard Gaussian function with a mean of 0 and a variance of 1:
(10.22)
DENCLUE uses a Gaussian kernel to estimate density based on the given set of objects to be clustered. A point x* is called a density attractor if it is a local maximum of the estimated density function. To avoid trivial local maximum points, DENCLUE uses a noise threshold, ξ, and only considers those density attractors x* such that . These nontrivial density attractors are the centers of clusters.
Objects under analysis are assigned to clusters through density attractors using a stepwise hill-climbing procedure. For an object, x, the hill-climbing procedure starts from x and is guided by the gradient of the estimated density function. That is, the density attractor for x is computed as
(10.23)
where δ is a parameter to control the speed of convergence, and
(10.24)
The hill-climbing procedure stops at step k > 0 if , and assigns x to the density attractor x*