Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [233]

By Root 670 0

criteria, and a new population is formed with those chromosomes that are better and more likely to appear. The operators are applied to the new population, and the cycle is repeated.

13.7 GAS FOR CLUSTERING

Much effort has been undertaken toward applying GAs to provide better solutions than those found by traditional clustering algorithms. The emphasis was on appropriate encoding schemes, specific genetic operators, and corresponding fitness functions. Several encoding schemes have been proposed specifically for data clustering, and the main three types are binary, integer, and real encoding.

The binary encoding solution is usually represented as a binary string of length N, where N is the number of data set samples. Each position of the binary string corresponds to a particular sample. The value of the ith gene is 1 if the ith sample is a prototype of a cluster, and 0 otherwise. For example, the data set s in Table 13.5 can be encoded by means of the string [0100001010], in which samples 2, 7, and 9 are prototypes for clusters C1, C2, and C3. The total number of 1s in the string is equal to an a priori defined number of clusters. Clearly, such an encoding scheme leads to a medoid-based representation in which the cluster prototypes coincide with representative samples from the data set. There is an alternative way of representing a data partition using a binary encoding. The matrix of k × N dimensions is used in which the rows represent clusters, and the columns represent samples. In this case, if the jth sample belongs to the ith cluster then 1 is assigned to (i,j) genotype, whereas the other elements of the same column receive 0. For example, using this representation, the data set in Table 13.5 would be encoded as 3 × 10 matrix in Table 13.6.

TABLE 13.5. Three Clusters Are Defined for a Given Data Set s

TABLE 13.6. Binary Encoded Data Set s Given in Table 13.5

Integer encoding uses a vector of N integer positions, where N is the number of data set samples.

Each position corresponds to a particular sample; that is, the ith position (gene) represents the ith data sample. Assuming that there are k clusters, each gene has a value over the alphabet {1, 2, 3, … , k}. These values define the cluster labels. For example, the integer vector [1111222233] represents the clusters depicted in Table 13.5. Another way of representing a partition by means of an integer-encoding scheme involves using an array of only k elements to provide a medoid-based representation of the data set. In this case, each array element represents the index of the sample xi, i = 1, 2, … , N (with respect to the order the samples appear in the data set) corresponding to the prototype of a given cluster. As an example, the array [1 6 10] may represent a partition in which 1, 6, and 10 are indices of the cluster prototypes (medoids) for the data set in Table 13.5. Integer encoding is usually more computationally efficient than the binary-encoding schemes.

Real encoding is the third encoding scheme where the genotypes are made up of real numbers that represent the coordinates of the cluster centroids. In an n-dimensional space, the first n positions represent the n coordinates of the first centroid; the next n positions represent the coordinates of the second centroid, and so forth. To illustrate this, the genotype [1.5 1.5 10.5 1.5 5.0 5.5] encodes the three centroids: (1.5, 1.5), (10.5, 1.5), and (5.0, 5.5) of clusters C1, C2, and C3 in Table 13.5, respectively.

The second important decision, in applying GAs for clustering, is a selection of appropriate genetic operators. A number of crossover and mutation operators are proposed trying to solve an important problem of the context-insensitivity in GAs. When traditional genetic operators are employed in clustering problems, they usually just manipulate gene values without taking into account their connections with other genes. For example, the crossover operation presented in Figure 13.7 shows how two parents representing the same solution to the clustering problem (different labeling but

Online Book Reader

Data Mining - Mehmed Kantardzic [233]

®Online Book Reader