Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining_ Concepts and Techniques - Jiawei Han [279]

By Root 1745 0

such objects are far away from the majority of the data, and thus, when assigned to a cluster, they can dramatically distort the mean value of the cluster. This inadvertently affects the assignment of other objects to clusters. This effect is particularly exacerbated due to the use of the squared-error function of Eq. (10.1), as observed in Example 10.2.

A drawback of k-means

Consider six points in 1-D space having the values 1, 2, 3, 8, 9, 10, and 25, respectively. Intuitively, by visual inspection we may imagine the points partitioned into the clusters {1, 2, 3} and {8, 9, 10}, where point 25 is excluded because it appears to be an outlier. How would k-means partition the values? If we apply k-means using k = 2 and Eq. (10.1), the partitioning {{1, 2, 3}, {8, 9, 10, 25}} has the within-cluster variation

given that the mean of cluster {1, 2, 3} is 2 and the mean of {8, 9, 10, 25} is 13. Compare this to the partitioning {{1, 2, 3, 8}, {9, 10, 25}}, for which k-means computes the within-cluster variation as

given that 3.5 is the mean of cluster {1, 2, 3, 8} and 14.67 is the mean of cluster {9, 10, 25}. The latter partitioning has the lowest within-cluster variation; therefore, the k-means method assigns the value 8 to a cluster different from that containing 9 and 10 due to the outlier point 25. Moreover, the center of the second cluster, 14.67, is substantially far from all the members in the cluster.

“How can we modify the k-means algorithm to diminish such sensitivity to outliers?” Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual objects to represent the clusters, using one representative object per cluster. Each remaining object is assigned to the cluster of which the representative object is the most similar. The partitioning method is then performed based on the principle of minimizing the sum of the dissimilarities between each object p and its corresponding representative object. That is, an absolute-error criterion is used, defined as

(10.2)

where E is the sum of the absolute error for all objects p in the data set, and oi is the representative object of Ci. This is the basis for the k-medoids method, which groups n objects into k clusters by minimizing the absolute error (Eq. 10.2).

When k = 1, we can find the exact median in O(n2) time. However, when k is a general positive number, the k-medoid problem is NP-hard.

The Partitioning Around Medoids (PAM) algorithm (see Figure 10.5 later) is a popular realization of k-medoids clustering. It tackles the problem in an iterative, greedy way. Like the k-means algorithm, the initial representative objects (called seeds) are chosen arbitrarily. We consider whether replacing a representative object by a nonrepresentative object would improve the clustering quality. All the possible replacements are tried out. The iterative process of replacing representative objects by other objects continues until the quality of the resulting clustering cannot be improved by any replacement. This quality is measured by a cost function of the average dissimilarity between an object and the representative object of its cluster.

Specifically, let o1, …, ok be the current set of representative objects (i.e., medoids). To determine whether a nonrepresentative object, denoted by orandom, is a good replacement for a current medoid oj (1 ≤ j ≤ k), we calculate the distance from every object p to the closest object in the set {o1, …, oj−1, orandom, oj+1, …, ok}, and use the distance to update the cost function. The reassignments of objects to {o1, …, oj−1, orandom oj+1, …, ok} are simple. Suppose object p is currently assigned to a cluster represented by medoid oj (Figure 10.4a or b). Do we need to reassign p to a different cluster if oj is being replaced by orandom? Object p needs to be reassigned to either orandom or some other cluster represented by oi (i ≠ j), whichever is the closest. For example, in Figure 10.4(a), p is closest to oi and therefore is reassigned to oi. In Figure 10.4(b), however, p is closest to orandom

Online Book Reader

Data Mining_ Concepts and Techniques - Jiawei Han [279]

®Online Book Reader