Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [147]

By Root 680 0

of similarity for categorical data, Goodall3, because it shows good performances on average for a variety of experiments with different data sets. That does not mean that some other measures such as Eskin, Lin, Smirnov, or Burnaby will not be more appropriate for a specific data set. The Goodall3 measure, given in Table 9.4, assigns a high similarity if the matching values are infrequent regardless of the frequencies of the other values. The range of Sk(Xk; Yk) for matches in the Goodall3 measure is , with the minimum value being attained if Xk is the only value for attribute Ak and maximum value is attained if Xk occurs only twice.

TABLE 9.4. Goodall3 Similarity Measure for Categorical Attributes

There are some advanced distance measures applicable to categorical data, and also to numerical data, that take into account the effect of the surrounding or neighboring points in the n-dimensional spaces of samples. These surrounding points are called contexts. The similarity between two points, xi and xj, with the given context, is measured using the mutual neighbor distance (MND), which is defined as

where NN(xi, xj) is the neighbor number of xj with respect to xi. If xi is the closest point to xj, then NN(xi, xj) is equal to 1, if it is the second closest point, NN(xi, xj) is equal to 2, and so on. Figures 9.3 and 9.4 give an example of the computation and basic characteristics of the MND measure.

Figure 9.3. A and B are more similar than B and C using the MND measure.

Figure 9.4. After changes in the context, B and C are more similar than A and B using the MND measure.

Points in Figures 9.3 and 9.4, denoted by A, B, C, D, E, and F, are 2-D samples with features x1 and x2. In Figure 9.3, the nearest neighbor of A is B using Euclidian distance, and B’s nearest neighbor is A. So,

If we compute the distance between points B and C, the results will be

Figure 9.4 was obtained from Figure 9.3 by adding three new points D, E, and F (samples in the data set). Now, because the context has changed, the distances between the same points A, B, and C have also changed:

The MND between A and B has increased by introducing additional points close to A, even though A and B have not moved. B and C points become more similar than points A and B. The MND measure is not a metric because it does not satisfy the triangle inequality. Despite this, MND has been successfully applied in several real-world clustering tasks.

In general, based on a distance measure between samples, it is possible to define a distance measure between clusters (set of samples). These measures are an essential part in estimating the quality of a clustering process, and therefore they are part of clustering algorithms. The widely used measures for distance between clusters Ci and Cj are

1. Dmin (Ci, Cj) = min |pi − pj|, where pi ∈ Ci and pj ∈ Cj;

2. Dmean (Ci, Cj) = |mi − mj |, where mi and mj are centriods of Ci and Cj;

3. Davg (Ci, Cj) = 1/(ni nj) ∑ ∑ |pi − pj|, where pi ∈ Ci and pj ∈ Cj, and ni and nj are the numbers of samples in clusters Ci and Cj; and

4. Dmax (Ci, Cj) = max |pi − pj|, where pi ∈ Ci and pj ∈ Cj.

9.3 AGGLOMERATIVE HIERARCHICAL CLUSTERING

In hierarchical-cluster analysis, we do not specify the number of clusters as a part of the input. Namely, the input to a system is (X, s), where X is a set of samples, and s is a measure of similarity. An output from a system is a hierarchy of clusters. Most procedures for hierarchical clustering are not based on the concept of optimization, and the goal is to find some approximate, suboptimal solutions, using iterations for improvement of partitions until convergence. Algorithms of hierarchical cluster analysis are divided into the two categories, divisible algorithms and agglomerative algorithms. A divisible algorithm starts from the entire set of samples X and divides it into a partition of subsets, then divides each subset into smaller sets, and so on. Thus, a divisible algorithm generates a sequence of partitions that is ordered from a coarser one to a finer one. An agglomerative

Online Book Reader

Data Mining - Mehmed Kantardzic [147]

®Online Book Reader