Data Mining_ Concepts and Techniques - Jiawei Han [297]
10.6.3. Measuring Clustering Quality
Suppose you have assessed the clustering tendency of a given data set. You may have also tried to predetermine the number of clusters in the set. You can now apply one or multiple clustering methods to obtain clusterings of the data set. “How good is the clustering generated by a method, and how can we compare the clusterings generated by different methods?”
We have a few methods to choose from for measuring the quality of a clustering. In general, these methods can be categorized into two groups according to whether ground truth is available. Here, ground truth is the ideal clustering that is often built using human experts.
If ground truth is available, it can be used by extrinsic methods, which compare the clustering against the group truth and measure. If the ground truth is unavailable, we can use intrinsic methods, which evaluate the goodness of a clustering by considering how well the clusters are separated. Ground truth can be considered as supervision in the form of “cluster labels.” Hence, extrinsic methods are also known as supervised methods, while intrinsic methods are unsupervised methods.
Let's have a look at simple methods from each category.
Extrinsic Methods
When the ground truth is available, we can compare it with a clustering to assess the clustering. Thus, the core task in extrinsic methods is to assign a score, , to a clustering, , given the ground truth, . Whether an extrinsic method is effective largely depends on the measure, Q, it uses.
In general, a measure Q on clustering quality is effective if it satisfies the following four essential criteria:
■ Cluster homogeneity. This requires that the more pure the clusters in a clustering are, the better the clustering. Suppose that ground truth says that the objects in a data set, D, can belong to categories L1, …, Ln. Consider clustering, , wherein a cluster contains objects from two categories Li, Lj (1 ≤ i < j ≤ n). Also consider clustering , which is identical to except that C2 is split into two clusters containing the objects in Li and Lj, respectively. A clustering quality measure, Q, respecting cluster homogeneity should give a higher score to than , that is, .
■ Cluster completeness. This is the counterpart of cluster homogeneity. Cluster completeness requires that for a clustering, if any two objects belong to the same category according to ground truth, then they should be assigned to the same cluster. Cluster completeness requires that a clustering should assign objects belonging to the same category (according to ground truth) to the same cluster. Consider clustering , which contains clusters C1 and C2, of which the members belong to the same category according to ground truth. Let clustering be identical to except that C1 and C2 are merged into one cluster in . Then, a clustering quality measure, Q, respecting cluster completeness should give a higher score to , that is, .
■ Rag bag. In many practical scenarios, there is often a “rag bag” category containing objects that cannot be merged with other objects. Such a category is often called “miscellaneous,” “other,” and so on. The rag bag criterion states that putting a heterogeneous object into a pure cluster should be penalized more than putting it into a rag bag. Consider a clustering and a cluster such that all objects in C except for one, denoted by o, belong to the same category according to ground truth. Consider a clustering identical to except that o is assigned to a cluster C′ ≠ C in such that C′ contains objects from various categories according to ground truth, and thus is noisy. In other words, C′ in is a rag bag. Then, a clustering quality measure Q respecting the rag bag criterion