Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [150]

By Root 697 0

of samples) or globally (defined over all of the samples). Thus, we say that a clustering criterion can be either global or local. A global criterion, such as the Euclidean square-error measure, represents each cluster by a prototype or centroid and assigns the samples to clusters according to the most similar prototypes. A local criterion, such as the minimal MND, forms clusters by utilizing the local structure or context in the data. Therefore, identifying high-density regions in the data space is a basic criterion for forming clusters.

The most commonly used partitional-clustering strategy is based on the square-error criterion. The general objective is to obtain the partition that, for a fixed number of clusters, minimizes the total square-error. Suppose that the given set of N samples in an n-dimensional space has somehow been partitioned into K clusters {C1, C2, … , Ck}. Each Ck has nk samples and each sample is in exactly one cluster, so that Σnk = N, where k = 1, … , K. The mean vector Mk of cluster Ck is defined as the centroid of the cluster or

where xik is the ith sample belonging to cluster Ck. The square-error for cluster Ck is the sum of the squared Euclidean distances between each sample in Ck and its centroid. This error is also called the within-cluster variation:

The square-error for the entire clustering space containing K clusters is the sum of the within-cluster variations:

The objective of a square-error clustering method is to find a partition containing K clusters that minimize for a given K.

The K-means partitional-clustering algorithm is the simplest and most commonly used algorithm employing a square-error criterion. It starts with a random, initial partition and keeps reassigning the samples to clusters, based on the similarity between samples and clusters, until a convergence criterion is met. Typically, this criterion is met when there is no reassignment of any sample from one cluster to another that will cause a decrease of the total squared error. K-means algorithm is popular because it is easy to implement, and its time and space complexity is relatively small. A major problem with this algorithm is that it is sensitive to the selection of the initial partition and may converge to a local minimum of the criterion function if the initial partition is not properly chosen.

The simple K-means partitional-clustering algorithm is computationally efficient and gives surprisingly good results if the clusters are compact, hyperspherical in shape, and well separated in the feature space. The basic steps of the K-means algorithm are

1. select an initial partition with K clusters containing randomly chosen samples, and compute the centroids of the clusters;

2. generate a new partition by assigning each sample to the closest cluster center;

3. compute new cluster centers as the centroids of the clusters; and

4. repeat steps 2 and 3 until an optimum value of the criterion function is found (or until the cluster membership stabilizes).

Let us analyze the steps of the K-means algorithm on the simple data set given in Figure 9.6. Suppose that the required number of clusters is two, and initially, clusters are formed from a random distribution of samples: C1 = {x1, x2, x4} and C2 = {x3, x5}. The centriods for these two clusters are

Within-cluster variations, after initial random distribution of samples, are

And the total square-error is

When we reassign all samples, depending on a minimum distance from centroids M1 and M2, the new redistribution of samples inside clusters will be

New clusters C1 = {x1, x2, x3} and C2 = {x4, x5} have new centroids

The corresponding within-cluster variations and the total square-error are

We can see that after the first iteration, the total square-error is significantly reduced (from the value 27.48 to 6.17). In this simple example, the first iteration was at the same time the final one because if we analyze the distances between the new centroids and the samples, the latter will all be assigned to the same clusters. There is no reassignment

Online Book Reader

Data Mining - Mehmed Kantardzic [150]

®Online Book Reader