Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [156]

By Root 856 0
Applying some of the available cluster methodologies, we assess the outputs. This analysis uses a specific criterion of optimality that usually contains knowledge about the application domain and therefore is subjective. There are three types of validation studies for clustering algorithms. An external assessment of validity compares the discovered structure with an a priori structure. An internal examination of validity tries to determine if the discovered structure is intrinsically appropriate for the data. Both assessments are subjective and domain-dependent. A relative test, as a third approach, compares the two structures obtained either from different cluster methodologies or by using the same methodology but with different clustering parameters, such as the order of input samples. This test measures their relative merit but we still need to resolve the question of selecting the structures for comparison.

Theory and practical applications both show that all approaches in the validation of clustering results have a subjective component. Hence, little in the way of “gold standards” exists in clustering evaluation. Recent studies in cluster analysis suggest that a user of a clustering algorithm should always keep the following issues in mind:

1. Every clustering algorithm will find clusters in a given data set whether they exist or not; the data should, therefore, be subjected to tests for clustering tendency before applying a clustering algorithm, followed by a validation of the clusters generated by the algorithm.

2. There is no best clustering algorithm; therefore, a user is advised to try several algorithms on a given data set.

It is important to remember that cluster analysis is an exploratory tool; the outputs of clustering algorithms only suggest or sometimes confirm hypotheses, but never prove any hypothesis about natural organization of data.

9.9 REVIEW QUESTIONS AND PROBLEMS

1. Why is the validation of a clustering process highly subjective?

2. What increases the complexity of clustering algorithms?

3.

(a) Using MND distance, distribute the input samples given as 2-D points A(2, 2), B(4, 4), and C(7, 7) into two clusters.

(b) What will be the distribution of samples in the clusters if samples D(1, 1), E(2, 0), and F(0, 0) are added?

4. Given 5-D numeric samples A = (1, 0, 2, 5, 3) and B = (2, 1, 0, 3, −1), find

(a) the Eucledian distance between points;

(b) the city-block distance;

(c) the Minkowski distance for p = 3; and

(d) the cosine-correlation distance.

5. Given 6-D categorical samples C = (A, B, A, B, A, A) and D = (B, B, A, B, B, A), find

(a) an SMC of the similarity between samples;

(b) Jaccard’s coefficient; and

(c) Rao’s coefficient

6. Given a set of 5-D categorical samples

A = (1, 0, 1, 1, 0)

B = (1, 1, 0, 1, 0)

C = (0, 0, 1, 1, 0)

D = (0, 1, 0, 1, 0)

E = (1, 0, 1, 0, 1)

F = (0, 1, 1, 0, 0)

(a) Apply agglomerative hierarchical clustering using

(i) single-link similarity measure based on Rao’s coefficient; and

(ii) complete-link similarity measure based on SMC.

(b) Plot the dendrograms for the solutions to parts (i) and (ii) of (a).

7. Given the samples X1 = {1, 0}, X2 = {0, 1}, X3 = {2, 1}, and X4 = {3, 3}, suppose that the samples are randomly clustered into two clusters C1 = {X1, X3} and C2 = {X2, X4}.

(a) Apply one iteration of the K-means partitional-clustering algorithm, and find a new distribution of samples in clusters. What are the new centroids? How can you prove that the new distribution of samples is better than the initial one?

(b) What is the change in the total square-error?

(c) Apply the second iteration of the K-means algorithm and discuss the changes in clusters.

8. For the samples in Problem 7, apply iterative clustering with the threshold value for cluster radius T = 2. What is the number of clusters and samples distribution after the first iteration?

9. Suppose that the samples in Problem 6 are distributed into two clusters:

C1 = {A, B, E} and C2 = {C, D, F}.

Using K-nearest neighbor algorithm, find the classification for the following samples:

Return Main Page Previous Page Next Page

®Online Book Reader