Data Mining - Mehmed Kantardzic [156]
Theory and practical applications both show that all approaches in the validation of clustering results have a subjective component. Hence, little in the way of “gold standards” exists in clustering evaluation. Recent studies in cluster analysis suggest that a user of a clustering algorithm should always keep the following issues in mind:
1. Every clustering algorithm will find clusters in a given data set whether they exist or not; the data should, therefore, be subjected to tests for clustering tendency before applying a clustering algorithm, followed by a validation of the clusters generated by the algorithm.
2. There is no best clustering algorithm; therefore, a user is advised to try several algorithms on a given data set.
It is important to remember that cluster analysis is an exploratory tool; the outputs of clustering algorithms only suggest or sometimes confirm hypotheses, but never prove any hypothesis about natural organization of data.
9.9 REVIEW QUESTIONS AND PROBLEMS
1. Why is the validation of a clustering process highly subjective?
2. What increases the complexity of clustering algorithms?
3.
(a) Using MND distance, distribute the input samples given as 2-D points A(2, 2), B(4, 4), and C(7, 7) into two clusters.
(b) What will be the distribution of samples in the clusters if samples D(1, 1), E(2, 0), and F(0, 0) are added?
4. Given 5-D numeric samples A = (1, 0, 2, 5, 3) and B = (2, 1, 0, 3, −1), find
(a) the Eucledian distance between points;
(b) the city-block distance;
(c) the Minkowski distance for p = 3; and
(d) the cosine-correlation distance.
5. Given 6-D categorical samples C = (A, B, A, B, A, A) and D = (B, B, A, B, B, A), find
(a) an SMC of the similarity between samples;
(b) Jaccard’s coefficient; and
(c) Rao’s coefficient
6. Given a set of 5-D categorical samples
A = (1, 0, 1, 1, 0)
B = (1, 1, 0, 1, 0)
C = (0, 0, 1, 1, 0)
D = (0, 1, 0, 1, 0)
E = (1, 0, 1, 0, 1)
F = (0, 1, 1, 0, 0)
(a) Apply agglomerative hierarchical clustering using
(i) single-link similarity measure based on Rao’s coefficient; and
(ii) complete-link similarity measure based on SMC.
(b) Plot the dendrograms for the solutions to parts (i) and (ii) of (a).
7. Given the samples X1 = {1, 0}, X2 = {0, 1}, X3 = {2, 1}, and X4 = {3, 3}, suppose that the samples are randomly clustered into two clusters C1 = {X1, X3} and C2 = {X2, X4}.
(a) Apply one iteration of the K-means partitional-clustering algorithm, and find a new distribution of samples in clusters. What are the new centroids? How can you prove that the new distribution of samples is better than the initial one?
(b) What is the change in the total square-error?
(c) Apply the second iteration of the K-means algorithm and discuss the changes in clusters.
8. For the samples in Problem 7, apply iterative clustering with the threshold value for cluster radius T = 2. What is the number of clusters and samples distribution after the first iteration?
9. Suppose that the samples in Problem 6 are distributed into two clusters:
C1 = {A, B, E} and C2 = {C, D, F}.
Using K-nearest neighbor algorithm, find the classification for the following samples: