Online Book Reader

Home Category

Data Mining_ Concepts and Techniques - Jiawei Han [306]

By Root 1430 0
choosing a sample according to the probability density function of the chosen cluster.

Given data set, D, and k, the number of clusters required, the task of probabilistic model-based cluster analysis is to infer a set of k probabilistic clusters that is most likely to generate D using this data generation process. An important question remaining is how we can measure the likelihood that a set of k probabilistic clusters and their probabilities will generate an observed data set.

Consider a set, C, of k probabilistic clusters, , with probability density functions , respectively, and their probabilities, . For an object, o, the probability that o is generated by cluster Cj is given by . Therefore, the probability that o is generated by the set C of clusters is

(11.5)

Since the objects are assumed to have been generated independently, for a data set, , of n objects, we have

(11.6)

Now, it is clear that the task of probabilistic model-based cluster analysis on a data set, D, is to find a set C of k probabilistic clusters such that is maximized. Maximizing is often intractable because, in general, the probability density function of a cluster can take an arbitrarily complicated form. To make probabilistic model-based clusters computationally feasible, we often compromise by assuming that the probability density functions are parameterized distributions.

Formally, let be the n observed objects, and be the parameters of the k distributions, denoted by and , respectively. Then, for any object, , Eq. (11.5) can be rewritten as

(11.7)

where is the probability that oi is generated from the j th distribution using parameter Θj. Consequently, Eq. (11.6) can be rewritten as

(11.8)

Using the parameterized probability distribution models, the task of probabilistic model-based cluster analysis is to infer a set of parameters, Θ, that maximizes Eq. (11.8).

Univariate Gaussian mixture model

Let's use univariate Gaussian distributions as an example. That is, we assume that the probability density function of each cluster follows a 1-D Gaussian distribution. Suppose there are k clusters. The two parameters for the probability density function of each cluster are center, μj, and standard deviation, σj. We denote the parameters as and . Let the data set be , where oi is a real number. For any point, , we have

(11.9)

Assuming that each cluster has the same probability, that is , and plugging Eq. (11.9) into Eq. (11.7), we have

(11.10)

Applying Eq. (11.8), we have

(11.11)

The task of probabilistic model-based cluster analysis using a univariate Gaussian mixture model is to infer Θ such that Eq. (11.11) is maximized.


11.1.3. Expectation-Maximization Algorithm

“How can we compute fuzzy clusterings and probabilistic model-based clusterings?” In this section, we introduce a principled approach. Let's start with a review of the k-means clustering problem and the k-means algorithm studied in Chapter 10.

It can easily be shown that k-means clustering is a special case of fuzzy clustering (Exercise 11.1). The k-means algorithm iterates until the clustering cannot be improved. Each iteration consists of two steps:

The expectation step (E-step): Given the current cluster centers, each object is assigned to the cluster with a center that is closest to the object. Here, an object is expected to belong to the closest cluster.

The maximization step (M-step): Given the cluster assignment, for each cluster, the algorithm adjusts the center so that the sum of the distances from the objects assigned to this cluster and the new center is minimized. That is, the similarity of objects assigned to a cluster is maximized.

We can generalize this two-step method to tackle fuzzy clustering and probabilistic model-based clustering. In general, an expectation-maximization (EM) algorithm is a framework that approaches maximum likelihood or maximum a posteriori estimates of parameters in statistical models. In the context of fuzzy or probabilistic model-based clustering, an EM algorithm starts with an initial set of parameters and iterates

Return Main Page Previous Page Next Page

®Online Book Reader