Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [149]

By Root 822 0

the single-link solution. First, x2 and x3 are merged and a cluster {x2, x3} is generated with the minimum distance equal to 1.5. Also, in the second step, x4 and x5 are merged into a new cluster {x4, x5} with a higher merging level of 2.0. Minimal single-link distance is between clusters {x2, x3}, and {x1} is now 2.5, so these two clusters merge after the previous two steps. Finally, the two clusters {x1, x2, x3} and {x4, x5} are merged at the highest level with a minimal complete-link distance of 5.4. The resulting dendrogram is shown in Figure 9.8.

Figure 9.8. Dendrogram by complete-link method for the data set in Figure 9.6.

Selecting, for example, a threshold measure of similarity s = 2.2, we can recognize from the dendograms in Figures 9.7 and 9.8 that the final clusters for single-link and complete-link algorithms are not the same. A single-link algorithm creates only two clusters: {x1, x2, x3} and {x4, x5}, while a complete-link algorithm creates three clusters: {x1}, {x2, x3}, and {x4, x5}.

Unlike traditional agglomerative methods, Chameleon is a clustering algorithm that tries to improve the clustering quality by using a more elaborate criterion when merging two clusters. Two clusters will be merged if the interconnectivity and closeness of the merged clusters is very similar to the interconnectivity and closeness of the two individual clusters before merging.

To form the initial subclusters, Chameleon first creates a graph G = (V, E), where each node v ∈ V represents a data sample, and a weighted edge e(vi, vj) exists between two nodes vi and vj if vj is one of the k-nearest neighbors of vi. The weight of each edge in G represents the closeness between two samples, that is, an edge will weigh more if the two data samples are closer to each other. Chameleon then uses a graph-partition algorithm to recursively partition G into many small, unconnected subgraphs by doing a min-cut on G at each level of recursion. Here, a min-cut on a graph G refers to a partitioning of G into two parts of close, equal size such that the total weight of the edges being cut is minimized. Each subgraph is then treated as an initial subcluster, and the algorithm is repeated until a certain criterion is reached.

In the second phase, the algorithm goes bottom-up. Chameleon determines the similarity between each pair of elementary clusters Ci and Cj according to their relative interconnectivity RI(Ci, Cj) and their relative closeness RC(Ci, Cj). Given that the interconnectivity of a cluster is defined as the total weight of edges that are removed when a min-cut is performed, the relative interconnectivity RI(Ci, Cj) is defined as the ratio between the interconnectivity of the merged cluster Ci and Cj to the average interconnectivity of Ci and Cj. Similarly, the relative closeness RC(Ci, Cj) is defined as the ratio between the closeness of the merged cluster of Ci and Cj to the average internal closeness of Ci and Cj. Here the closeness of a cluster refers to the average weight of the edges that are removed when a min-cut is performed on the cluster.

The similarity function is then computed as a product: RC(Ci, Cj) * RI(Ci, Cj)α where α is a parameter between 0 and 1. A value of 1 for α will give equal weight to both measures while decreasing α will place more emphasis on RI(Ci, Cj). Chameleon can automatically adapt to the internal characteristics of the clusters and it is effective in discovering arbitrarily shaped clusters of varying density. However, the algorithm is not effective for high-dimensional data having O(n2) time complexity for n samples.

9.4 PARTITIONAL CLUSTERING

Every partitional-clustering algorithm obtains a single partition of the data instead of the clustering structure, such as a dendrogram, produced by a hierarchical technique. Partitional methods have the advantage in applications involving large data sets for which the construction of a dendrogram is computationally very complex. The partitional techniques usually produce clusters by optimizing a criterion function defined either locally (on a subset

Online Book Reader

Data Mining - Mehmed Kantardzic [149]

®Online Book Reader