Online Book Reader

Home Category

Data Mining_ Concepts and Techniques - Jiawei Han [303]

By Root 1350 0
you will have a good grasp of the issues and techniques regarding advanced cluster analysis.

11.1. Probabilistic Model-Based Clustering


In all the cluster analysis methods we have discussed so far, each data object can be assigned to only one of a number of clusters. This cluster assignment rule is required in some applications such as assigning customers to marketing managers. However, in other applications, this rigid requirement may not be desirable. In this section, we demonstrate the need for fuzzy or flexible cluster assignment in some applications, and introduce a general method to compute probabilistic clusters and assignments.

“In what situations may a data object belong to more than one cluster?” Consider Example 11.1.

Clustering product reviews

AllElectronics has an online store, where customers not only purchase online, but also create reviews of products. Not every product receives reviews; instead, some products may have many reviews, while many others have none or only a few. Moreover, a review may involve multiple products. Thus, as the review editor of AllElectronics, your task is to cluster the reviews.

Ideally, a cluster is about a topic, for example, a group of products, services, or issues that are highly related. Assigning a review to one cluster exclusively would not work well for your task. Suppose there is a cluster for “cameras and camcorders” and another for “computers.” What if a review talks about the compatibility between a camcorder and a computer? The review relates to both clusters; however, it does not exclusively belong to either cluster.

You would like to use a clustering method that allows a review to belong to more than one cluster if the review indeed involves more than one topic. To reflect the strength that a review belongs to a cluster, you want the assignment of a review to a cluster to carry a weight representing the partial membership.


The scenario where an object may belong to multiple clusters occurs often in many applications. This is illustrated in Example 11.2.

Clustering to study user search intent

The AllElectronics online store records all customer browsing and purchasing behavior in a log. An important data mining task is to use the log data to categorize and understand user search intent. For example, consider a user session (a short period in which a user interacts with the online store). Is the user searching for a product, making comparisons among different products, or looking for customer support information? Clustering analysis helps here because it is difficult to predefine user behavior patterns thoroughly. A cluster that contains similar user browsing trajectories may represent similar user behavior.

However, not every session belongs to only one cluster. For example, suppose user sessions involving the purchase of digital cameras form one cluster, and user sessions that compare laptop computers form another cluster. What if a user in one session makes an order for a digital camera, and at the same time compares several laptop computers? Such a session should belong to both clusters to some extent.


In this section, we systematically study the theme of clustering that allows an object to belong to more than one cluster. We start with the notion of fuzzy clusters in Section 11.1.1. We then generalize the concept to probabilistic model-based clusters in Section 11.1.2. In Section 11.1.3, we introduce the expectation-maximization algorithm, a general framework for mining such clusters.

11.1.1. Fuzzy Clusters

Given a set of objects, , a fuzzy set S is a subset of X that allows each object in X to have a membership degree between 0 and 1. Formally, a fuzzy set, S, can be modeled as a function, .

Fuzzy set

The more digital camera units that are sold, the more popular the camera is. In AllElectronics, we can use the following formula to compute the degree of popularity of a digital camera, o, given the sales of o:

(11.1)

Function pop() defines a fuzzy set of popular digital cameras. For example, suppose the sales of digital cameras at

Return Main Page Previous Page Next Page

®Online Book Reader