Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [152]

By Root 654 0

the centroid.

3. Repeat step 2 until all the data samples are clustered.

The space requirements of the incremental algorithm are very small, necessary only for the centroids of the clusters. Typically, these algorithms are non-iterative and therefore their time requirements are also small. But, even if we introduce iterations into the incremental-clustering algorithm, computational complexity and corresponding time requirements do not increase significantly. On the other hand, there is one obvious weakness of incremental algorithms that we have to be aware of. Most incremental algorithms do not satisfy one of the most important characteristics of a clustering process: order-independence. An algorithm is order-independent if it generates the same partition for any order in which the data set is presented. Incremental algorithms are very sensitive to the order of samples, and for different orders they generate totally different partitions.

Let us analyze the incremental-clustering algorithm with the sample set given in Figure 9.6. Suppose that the order of samples is x1, x2, x3, x4, x5 and the threshold level of similarity between clusters is δ = 3.

1. The first sample x1 will become the first cluster C1 = {x1}. The coordinates of x1 will be the coordinates of the centroid M1 = {0, 2}.

2. Start analysis of the other samples.

(a) Second sample x2 is compared with M1, and the distance d is determined

Therefore, x2 belongs to the cluster C1. The new centroid will be

(b) The third sample x3 is compared with the centroid M1 (still the only centroid):

Because the distance of the sample from the given centroid M1 is larger than the threshold value δ, this sample will create its own cluster C2 = {x4} with the corresponding centroid M2 = {5, 0}.

(d) The fifth sample x5 is compared with both cluster centroids:

The sample is closer to the centroid M2, and its distance is less than the threshold value δ. Therefore, sample x5 is added to the second cluster C2:

3. All samples are analyzed and a final clustering solution of two clusters is obtained:

The reader may check that the result of the incremental-clustering process will not be the same if the order of the samples is different. Usually, this algorithm is not iterative (although it could be) and the clusters generated after all the samples have been analyzed in one iteration are the final clusters. If the iterative approach is used, the centroids of the clusters computed in the previous iteration are used as a basis for the partitioning of samples in the next iteration.

For most partitional-clustering algorithms, including the iterative approach, a summarized representation of the cluster is given through its clustering feature (CF) vector. This vector of parameters is given for every cluster as a triple, consisting of the number of points (samples) of the cluster, the centroid of the cluster, and the radius of the cluster. The cluster’s radius is defined as the square-root of the average mean-squared distance from the centroid to the points in the cluster (averaged within-cluster variation). When a new point is added or removed from a cluster, the new CF can be computed from the old CF. It is very important that we do not need the set of points in the cluster to compute a new CF.

If samples are with categorical data, then we do not have a method to calculate centroids as representatives of the clusters. In that case, an additional algorithm called K-nearest neighbor may be used to estimate distances (or similarities) between samples and existing clusters. The basic steps of the algorithm are

1. to compute the distances between the new sample and all previous samples, already classified into clusters;

2. to sort the distances in increasing order and select K samples with the smallest distance values; and

3. to apply the voting principle. A new sample will be added (classified) to the largest cluster out of K selected samples.

For example, given six 6-D categorical samples

they are

Online Book Reader

Data Mining - Mehmed Kantardzic [152]

®Online Book Reader