Data Mining - Mehmed Kantardzic [148]
Most agglomerative hierarchical clustering algorithms are variants of the single-link or complete-link algorithms. These two basic algorithms differ only in the way they characterize the similarity between a pair of clusters. In the single-link method, the distance between two clusters is the minimum of the distances between all pairs of samples drawn from the two clusters (one element from the first cluster, the other from the second). In the complete-link algorithm, the distance between two clusters is the maximum of all distances between all pairs drawn from the two clusters. A graphical illustration of these two distance measures is given in Figure 9.5.
Figure 9.5. Distances for a single-link and a complete-link clustering algorithm. (a) Single-link distance; (b) complete-link distance.
In either case, two clusters are merged to form a larger cluster based on minimum-distance criteria. Although the single-link algorithm is computationally simpler, from a practical viewpoint it has been observed that the complete-link algorithm produces more useful hierarchies in most applications.
As explained earlier, the only difference between the single-link and complete-link approaches is in the distance computation. For both, the basic steps of the agglomerative clustering algorithm are the same. These steps are as follows:
1. Place each sample in its own cluster. Construct the list of intercluster distances for all distinct unordered pairs of samples, and sort this list in ascending order.
2. Step through the sorted list of distances, forming for each distinct threshold value dk a graph of the samples where pairs of samples closer than dk are connected into a new cluster by a graph edge. If all the samples are members of a connected graph, stop. Otherwise, repeat this step.
3. The output of the algorithm is a nested hierarchy of graphs, which can be cut at the desired dissimilarity level forming a partition (clusters) identified by simple connected components in the corresponding subgraph.
Let us consider five points {x1, x2, x3, x4, x5} with the following coordinates as a 2-D sample for clustering:
For this example, we selected 2-D points because it is easier to graphically represent these points and to trace all the steps in the clustering algorithm. The points are represented graphically in Figure 9.6.
Figure 9.6. Five two-dimensional samples for clustering.
The distances between these points using the Euclidian measure are
The distances between points as clusters in the first iteration are the same for both single-link and complete-link clustering. Further computation for these two algorithms is different. Using agglomerative single-link clustering, the following steps are performed to create a cluster and to represent the cluster structure as a dendrogram.
First x2 and x3 samples are merged and a cluster {x2, x3} is generated with a minimum distance equal to 1.5. Second, x4 and x5 are merged into a new cluster {x4, x5} with a higher merging level of 2.0. At the same time, the minimum single-link distance between clusters {x2, x3} and {x1} is also 2.0. So, these two clusters merge at the same level of similarity as x4 and x5. Finally, the two clusters {x1, x2, x3} and {x4, x5} are merged at the highest level with a minimum single-link distance of 3.5. The resulting dendrogram is shown in Figure 9.7.
Figure 9.7. Dendrogram by single-link method for the data set in Figure 9.6.
The cluster hierarchy created by using an agglomerative complete-link clustering algorithm is different compared with