Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [209]

By Root 766 0

a distributed clustering algorithm.

12.4.1 Distributed DBSCAN Clustering

Distributed clustering assumes that samples to be clustered reside on different sites. Instead of transmitting all samples to a central site where we can apply one of the standard clustering algorithms to analyze the data locally, the data are clustered independently on the distributed local sites. Then, in a subsequent step, the central site tries to establish a global clustering model based on the downloaded local models, that is, summarized representatives of local data. Distributed clustering is carried out on two different levels, that is, the local level and the global level (Fig. 12.31). On the local level, all sites carry out clustering independently from each other. Communication with the central site and determining a global model should reflect an optimum trade-off between complexity and accuracy of the algorithm.

Figure 12.31. System architecture for distributed clustering.

Local models consist of a set of representatives for each locally found cluster. A representative is a good approximation for samples residing on the corresponding local site. The local model is transferred to a central site, where the local models are merged in order to form a global model. The representation of local models should be simple enough so there will be no overload in communications. At the same time, local models should be informative enough to support a high quality of approximate global clustering. The global model is created by analyzing and integrating local representatives. The resulting global clustering is sent back, at the end of the process, to all local sites.

This global-distributed framework may be more precisely specified when we implement a specific clustering algorithm. The density-based clustering algorithm DBSCAN is a good candidate, because it is robust to outliers, easy to implement, supports clusters of different shapes, and allows incremental, online implementation. The main steps of the algorithm are explained in Chapter 9, and the same process is applied locally. To find local clusters, DBSCAN starts with an arbitrary core object p, which is not yet clustered and retrieves all objects density reachable from p. The retrieval of density-reachable objects is performed in iterations until all local samples are analyzed. After having clustered the data locally, we need a small number of representatives that will describe the local clustering result accurately. For determining suitable representatives of the clusters, the concept of specific core points is introduced.

Let C be a local cluster with respect to the given DBSCAN parameters ε and MinPts. Furthermore, let CorC ⊆ C be the set of core points belonging to this cluster. Then ScorC ⊆ C is called a complete set of specific core points of C iff the following conditions are true:

ScorC ⊆ CorC

∀si,sj ⊆ ScorC: si ∉ Neighborhoodε (sj)

∀c ∈ CorC , ∃s ∈ ScorC: c ∈ Neighborhoodε (s)

The ScorC set of points consists of a very small number of specific core points that describe the cluster C. For example, in Figure 12.32a, sites 2 and 3 have only one specific core point, while site 1, because of the cluster shape, has two specific core points. To further simplify the representation of local clusters, the number of specific core points, |ScorC| = K, is used as an input parameter for a further local “clustering step” with an adapted version of K-means. For each cluster C found by DBSCAN, k-means use ScorC points as starting points. The result is K = |ScorC| subclusters and centroids within C.

Figure 12.32. Distributed DBSCAN clustering

(Januzaj et al., 2003).

(a) Local clusters; (b) local representatives; (c) global model with εglobal = 2εlocal.

Each local model LocalModelk consists of a set of mk pairs: a representative r (complete specific core point), and an ε radius value. The number m of pairs transmitted from each site k is determined by the number n of clusters Ci found on site k. Each of these pairs (r, εr) represents a subset of samples that are all located in a corresponding

Online Book Reader

Data Mining - Mehmed Kantardzic [209]

®Online Book Reader