Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [151]

By Root 677 0
and therefore the algorithm halts.

In summary, the K-means algorithm and its equivalent in an artificial neural networks domain—the Kohonen net—have been applied for clustering on large data sets. The reasons behind the popularity of the K-means algorithm are as follows:

1. Its time complexity is O(n × k × l), where n is the number of samples, k is the number of clusters, and l is the number of iterations taken by the algorithm to converge. Typically, k and l are fixed in advance and so the algorithm has linear time complexity in the size of the data set.

2. Its space complexity is O(k + n), and if it is possible to store all the data in the primary memory, access time to all elements is very fast and the algorithm is very efficient.

3. It is an order-independent algorithm. For a given initial distribution of clusters, it generates the same partition of the data at the end of the partitioning process irrespective of the order in which the samples are presented to the algorithm.

A big frustration in using iterative partitional-clustering programs is the lack of guidelines available for choosing K-number of clusters apart from the ambiguity about the best direction for initial partition, updating the partition, adjusting the number of clusters, and the stopping criterion. The K-means algorithm is very sensitive to noise and outlier data points, because a small number of such data can substantially influence the mean value. Unlike the K-means, the K-mediods method, instead of taking the mean value of the samples, uses the most centrally located object (mediods) in a cluster to be the cluster representative. Because of this, the K-mediods method is less sensitive to noise and outliers. Fuzzy c-means, proposed by Dunn and later improved, is an extension of K-means algorithm where each data point can be a member of multiple clusters with a membership value expressed through fuzzy sets. Despite its drawbacks, k-means remains the most widely used partitional clustering algorithm in practice. The algorithm is simple, easily understandable and reasonably scalable, and can be easily modified to deal with streaming data.

9.5 INCREMENTAL CLUSTERING


There are more and more applications where it is necessary to cluster a large collection of data. The definition of “large” has varied with changes in technology. In the 1960s, “large” meant several-thousand samples for clustering. Now, there are applications where millions of samples of high dimensionality have to be clustered. The algorithms discussed above work on large data sets, where it is possible to accommodate the entire data set in the main memory. However, there are applications where the entire data set cannot be stored in the main memory because of its size. There are currently three possible approaches to solve this problem:

1. The data set can be stored in a secondary memory and subsets of this data are clustered independently, followed by a merging step to yield a clustering of the entire set. We call this approach the divide-and-conquer approach.

2. An incremental-clustering algorithm can be employed. Here, data are stored in the secondary memory and data items are transferred to the main memory one at a time for clustering. Only the cluster representations are stored permanently in the main memory to alleviate space limitations.

3. A parallel implementation of a clustering algorithm may be used where the advantages of parallel computers increase the efficiency of the divide-and-conquer approach.

An incremental-clustering approach is most popular, and we will explain its basic principles. The following are the global steps of the incremental-clustering algorithm.

1. Assign the first data item to the first cluster.

2. Consider the next data item. Either assign this item to one of the existing clusters or assign it to a new cluster. This assignment is done based on some criterion, for example, the distance between the new item and the existing cluster centroids. In that case, after every addition of a new item to an existing cluster, recompute a new value for

Return Main Page Previous Page Next Page

®Online Book Reader