Data Mining_ Concepts and Techniques - Jiawei Han [304]
Table 11.1 Set of Digital Cameras and Their Sales at AllElectronics
CameraSales (units)
A 50
B 1320
C 860
D 270
We can apply the fuzzy set idea on clusters. That is, given a set of objects, a cluster is a fuzzy set of objects. Such a cluster is called a fuzzy cluster. Consequently, a clustering contains multiple fuzzy clusters.
Formally, given a set of objects, o1, …, on, a fuzzy clustering of kfuzzy clusters, C1, …, Ck, can be represented using a partition matrix, , where wij is the membership degree of oi in fuzzy cluster Cj. The partition matrix should satisfy the following three requirements:
■ For each object, oi, and cluster, Cj, . This requirement enforces that a fuzzy cluster is a fuzzy set.
■ For each object, oi, . This requirement ensures that every object participates in the clustering equivalently.
■ For each cluster, Cj, . This requirement ensures that for every cluster, there is at least one object for which the membership value is nonzero.
Fuzzy clusters
Suppose the AllElectronics online store has six reviews. The keywords contained in these reviews are listed in Table 11.2.
We can group the reviews into two fuzzy clusters, C1 and C2. C1 is for “digital camera” and “lens," and C2 is for “computer.” The partition matrix is
Here, we use the keywords “digital camera” and “lens” as the features of cluster C1, and “computer” as the feature of cluster C2. For review, Ri, and cluster, Cj, wij is defined as
In this fuzzy clustering, review R4 belongs to clusters C1 and C2 with membership degrees and , respectively.
Table 11.2 Set of Reviews and the Keywords Used
Review_IDKeywords
R1 digital camera, lens
R2 digital camera
R3 lens
R4 digital camera, lens, computer
R5 computer, CPU
R6 computer, computer game
“How can we evaluate how well a fuzzy clustering describes a data set?” Consider a set of objects, , and a fuzzy clustering of k clusters, . Let be the partition matrix. Let be the centers of clusters , respectively. Here, a center can be defined either as the mean or the medoid, or in other ways specific to the application.
As discussed in Chapter 10, the distance or similarity between an object and the center of the cluster to which the object is assigned can be used to measure how well the object belongs to the cluster. This idea can be extended to fuzzy clustering. For any object, oi, and cluster, Cj, if , then measures how well oi is represented by cj, and thus belongs to cluster Cj. Because an object can participate in more than one cluster, the sum of distances to the corresponding cluster centers weighted by the degrees of membership captures how well the object fits the clustering.
Formally, for an object oi, the sum of the squared error (SSE) is given by
(11.2)
where the parameter controls the influence of the degrees of membership. The larger the value of p, the larger the influence of the degrees of membership. Orthogonally, the SSE for a cluster, Cj, is
(11.3)
Finally, the SSE of the clustering is defined as
(11.4)
The SSE can be used to measure how well a fuzzy clustering fits a data set.
Fuzzy clustering is also called soft clustering because it allows an object to belong to more than one cluster. It is easy to see that traditional (rigid) clustering, which enforces each object to belong to only one cluster exclusively, is a special case of fuzzy clustering. We defer the discussion of how to compute fuzzy clustering to Section 11.1.3.
11.1.2. Probabilistic Model-Based Clusters
“Fuzzy clusters (Section 11.1.1) provide the flexibility of allowing an object to participate in multiple clusters. Is there a general framework to specify clusterings where objects may participate in multiple clusters in a probabilistic way?” In this section, we introduce the general notion of probabilistic model-based clusters to answer this question.
As discussed in Chapter 10, we conduct cluster analysis on a data set