Data Mining_ Concepts and Techniques - Jiawei Han [298]
■ Small cluster preservation. If a small category is split into small pieces in a clustering, those small pieces may likely become noise and thus the small category cannot be discovered from the clustering. The small cluster preservation criterion states that splitting a small category into pieces is more harmful than splitting a large category into pieces. Consider an extreme case. Let D be a data set of n + 2 objects such that, according to ground truth, n objects, denoted by o1, …, on, belong to one category and the other two objects, denoted by on+1, on+2, belong to another category. Suppose clustering has three clusters, C1 = {o1, …, on}, C2 = {on+1}, and C3 = {on+2}. Let clustering have three clusters, too, namely C1 = {o1, …, on-1}, C2 = {on}, and C3 = {on+1, on+2}. In other words, C1 splits the small category and C2 splits the big category. A clustering quality measure Q preserving small clusters should give a higher score to , that is, .
Many clustering quality measures satisfy some of these four criteria. Here, we introduce the BCubed precision and recall metrics, which satisfy all four criteria.
BCubed evaluates the precision and recall for every object in a clustering on a given data set according to ground truth. The precision of an object indicates how many other objects in the same cluster belong to the same category as the object. The recall of an object reflects how many objects of the same category are assigned to the same cluster.
Formally, let D = {o1, …, on} be a set of objects, and be a clustering on D. Let L (oi) (1 ≤ i ≤ n) be the category of oi given by ground truth, and C(oi) be the cluster_ID of oi in . Then, for two objects, oi and oj, (1 ≤ i, j, ≤ n, i ≠ j), the correctness of the relation between oi and oj in clustering is given by
(10.28)
BCubed precision is defined as
(10.29)
BCubed recall is defined as
(10.30)
Intrinsic Methods
When the ground truth of a data set is not available, we have to use an intrinsic method to assess the clustering quality. In general, intrinsic methods evaluate a clustering by examining how well the clusters are separated and how compact the clusters are. Many intrinsic methods have the advantage of a similarity metric between objects in the data set.
The silhouette coefficient is such a measure. For a data set, D, of n objects, suppose D is partitioned into k clusters, C1, …, Ck. For each object o ∈ D, we calculate a (o) as the average distance between o and all other objects in the cluster to which o belongs. Similarly, b(o) is the minimum average distance from o to all clusters to which o does not belong. Formally, suppose o ∈Ci (1 ≤ i ≤ k); then
(10.31)
and
(10.32)
The silhouette coefficient of o is then defined as
(10.33)
The value of the silhouette coefficient is between −1 and 1. The value of a(o) reflects the compactness of the cluster to which o belongs. The smaller the value, the more compact the cluster. The value of b(o) captures the degree to which o is separated from other clusters. The larger b(o) is, the more separated o is from other clusters. Therefore, when the silhouette coefficient value of o approaches 1, the cluster containing o is compact and o is far away from other clusters, which is the preferable case. However, when the silhouette coefficient value is negative (i.e., b(o) < a(o)), this means that, in expectation, o is closer to the objects in another cluster than to the objects in the same cluster as o. In many cases, this is a bad situation and should be avoided.
To measure a cluster's fitness within a clustering, we can compute the average silhouette coefficient value of all objects in the cluster. To measure the quality of a clustering, we can use the average silhouette coefficient value of all objects in the data set. The silhouette coefficient and other intrinsic measures can also be used in the elbow method to heuristically derive the number of clusters in a data set by replacing the sum of within-cluster variances.
10.7. Summary
■ A cluster is a collection