Online Book Reader

Home Category

Data Mining_ Concepts and Techniques - Jiawei Han [287]

By Root 1582 0
of opinions with respect to different customers, which cannot be obtained directly and completely. The task of clustering is to estimate the generative model as accurately as possible using the observed data objects to be clustered.

In practice, we can assume that the data generative models adopt common distribution functions, such as Gaussian distribution or Bernoulli distribution, which are governed by parameters. The task of learning a generative model is then reduced to finding the parameter values for which the model best fits the observed data set.

Generative model

Suppose we are given a set of 1-D points X = {x1, …, xn} for clustering analysis. Let us assume that the data points are generated by a Gaussian distribution,

(10.14)

where the parameters are μ (the mean) and σ2 (the variance).

The probability that a point xi ∈ X is then generated by the model is

(10.15)

Consequently, the likelihood that X is generated by the model is

(10.16)

The task of learning the generative model is to find the parameters μ and σ2 such that the likelihood is maximized, that is, finding

(10.17)

where is called the maximum likelihood.


Given a set of objects, the quality of a cluster formed by all the objects can be measured by the maximum likelihood. For a set of objects partitioned into m clusters C1, …, Cm, the quality can be measured by

(10.18)

where P() is the maximum likelihood. If we merge two clusters, Cj1 and Cj2, into a cluster, Cj1 ∪ Cj2, then, the change in quality of the overall clustering is

(10.19)

When choosing to merge two clusters in hierarchical clustering, is constant for any pair of clusters. Therefore, given clusters C1 and C2, the distance between them can be measured by

(10.20)

A probabilistic hierarchical clustering method can adopt the agglomerative clustering framework, but use probabilistic models (Eq. 10.20) to measure the distance between clusters.

Upon close observation of Eq. (10.19), we see that merging two clusters may not always lead to an improvement in clustering quality, that is, may be less than 1. For example, assume that Gaussian distribution functions are used in the model of Figure 10.11. Although merging clusters C1 and C2 results in a cluster that better fits a Gaussian distribution, merging clusters C3 and C4 lowers the clustering quality because no Gaussian functions can fit the merged cluster well.

Figure 10.11 Merging clusters in probabilistic hierarchical clustering: (a) Merging clusters C1 and C2 leads to an increase in overall cluster quality, but merging clusters (b) C3 and (c) C4 does not.

Based on this observation, a probabilistic hierarchical clustering scheme can start with one cluster per object, and merge two clusters, Ci and Cj, if the distance between them is negative. In each iteration, we try to find Ci and Cj so as to maximize . The iteration continues as long as , that is, as long as there is an improvement in clustering quality. The pseudocode is given in Figure 10.12.

Figure 10.12 A probabilistic hierarchical clustering algorithm.

Probabilistic hierarchical clustering methods are easy to understand, and generally have the same efficiency as algorithmic agglomerative hierarchical clustering methods; in fact, they share the same framework. Probabilistic models are more interpretable, but sometimes less flexible than distance metrics. Probabilistic models can handle partially observed data. For example, given a multidimensional data set where some objects have missing values on some dimensions, we can learn a Gaussian model on each dimension independently using the observed values on the dimension. The resulting cluster hierarchy accomplishes the optimization goal of fitting data to the selected probabilistic models.

A drawback of using probabilistic hierarchical clustering is that it outputs only one hierarchy with respect to a chosen probabilistic model. It cannot handle the uncertainty of cluster hierarchies. Given a data set, there may exist multiple hierarchies that fit the observed data. Neither algorithmic approaches

Return Main Page Previous Page Next Page

®Online Book Reader