Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining_ Concepts and Techniques - Jiawei Han [305]

By Root 1580 0

because we assume that the objects in the data set in fact belong to different inherent categories. Recall that clustering tendency analysis (Section 10.6.1) can be used to examine whether a data set contains objects that may lead to meaningful clusters. Here, the inherent categories hidden in the data are latent, which means they cannot be directly observed. Instead, we have to infer them using the data observed. For example, the topics hidden in a set of reviews in the AllElectronics online store are latent because one cannot read the topics directly. However, the topics can be inferred from the reviews because each review is about one or multiple topics.

Therefore, the goal of cluster analysis is to find hidden categories. A data set that is the subject of cluster analysis can be regarded as a sample of the possible instances of the hidden categories, but without any category labels. The clusters derived from cluster analysis are inferred using the data set, and are designed to approach the hidden categories.

Statistically, we can assume that a hidden category is a distribution over the data space, which can be mathematically represented using a probability density function (or distribution function). We call such a hidden category a probabilistic cluster. For a probabilistic cluster, C, its probability density function, f, and a point, o, in the data space, f(o) is the relative likelihood that an instance of C appears at o.

Probabilistic clusters

Suppose the digital cameras sold by AllElectronics can be divided into two categories: C1, a consumer line (e.g., point-and-shoot cameras), and C2, a professional line (e.g., single-lens reflex cameras). Their respective probability density functions, f1 and f2, are shown in Figure 11.1 with respect to the attribute price.

For a price value of, say, $1000, f1 (1000) is the relative likelihood that the price of a consumer-line camera is $1000. Similarly, f2 (1000) is the relative likelihood that the price of a professional-line camera is $1000.

The probability density functions, f1 and f2, cannot be observed directly. Instead, AllElectronics can only infer these distributions by analyzing the prices of the digital cameras it sells. Moreover, a camera often does not come with a well-determined category (e.g., “consumer line” or “professional line”). Instead, such categories are typically based on user background knowledge and can vary. For example, a camera in the prosumer segment may be regarded at the high end of the consumer line by some customers, and the low end of the professional line by others.

As an analyst at AllElectronics, you can consider each category as a probabilistic cluster, and conduct cluster analysis on the price of cameras to approach these categories.

Figure 11.1 The probability density functions of two probabilistic clusters.

Suppose we want to find k probabilistic clusters, , through cluster analysis. For a data set, D, of n objects, we can regard D as a finite sample of the possible instances of the clusters. Conceptually, we can assume that D is formed as follows. Each cluster, Cj, is associated with a probability, ωj, that some instance is sampled from the cluster. It is often assumed that are given as part of the problem setting, and that , which ensures that all objects are generated by the k clusters. Here, parameter ωj captures background knowledge about the relative population of cluster Cj.

We then run the following two steps to generate an object in D. The steps are executed n times in total to generate n objects, , in D.

1. Choose a cluster, Cj, according to probabilities .

2. Choose an instance of Cj according to its probability density function, fj.

The data generation process here is the basic assumption in mixture models. Formally, a mixture model assumes that a set of observed objects is a mixture of instances from multiple probabilistic clusters. Conceptually, each observed object is generated independently by two steps: first choosing a probabilistic cluster according to the probabilities of the clusters, and then

Online Book Reader

Data Mining_ Concepts and Techniques - Jiawei Han [305]

®Online Book Reader