Data Mining - Mehmed Kantardzic [234]
Figure 13.7. Equal parents produce different offspring through crossover.
Different clustering validity criteria are adapted as fitness functions to evolve data partitions in clustering problems. They depend primarily on encoding schemes but also on a selected set of genetic operators. We will illustrate in this text only one example of a clustering fitness function when a real, centroid-based encoding scheme is used. Fitness function f minimizes the sum of squared Euclidean distances of the samples from their respective cluster means. Formally, the fitness function f(C1,C2, … , Ck) is:
where {C1, C2, … , Ck} is the set of k clusters encoded into the genotype, xi is a sample in a data set, and zj is the centroid of cluster Cj. It is important to stress that this criterion is valid only if the number of clusters k is given in advance, and it minimizes the intra-cluster distances and maximizes the intercluster distances as well. In general, fitness functions are based on the distance between samples and either cluster centroids or medoids. Although these types of functions are widely used, usually they are biased toward the discovery of spherical clusters, which clearly will be inappropriate in many real-world applications. Other approaches are possible, including a density-based fitness function. In practice, the success of a GA to solve a clustering problem is highly dependent upon how it has been designed in terms of encoding scheme, operators, fitness function, selection procedure, and initial population generation.
13.8 REVIEW QUESTIONS AND PROBLEMS
1. Given a binary string that represents a concatenation of four attribute values:
use this example to explain the basic concepts of a GA and their equivalents in natural evolution.
2. If we want to optimize a function f(x) using a GA where the precision requirement for x is six decimal places and the range is [−1, 2], what will be the length of a binary vector (chromosome)?
3. If v1 = (0 0 1 1 0 0 1 1) and v2 = (0 1 0 1 0 1 0 1) are two chromosomes, and suppose that the crossover point is randomly selected after the fifth gene, what are the two resulting offspring?
4. Given the schema (* 1 * 0 0), what are the strings that match with it?
5. What is the number of strings that match with the schema (* * * * * * * *)?
6. The function f(x) = −x2 + 16x is defined on interval [0, 63]. Use two iterations of a GA to establish the approximate solution for a maximum of f(x).
7. For the function f(x) given in Problem number 6, compare three schemata
with respect to order (O), length (L), and fitness (F).
8. Given a parent chromosome (1 1 0 0 0 1 0 0 0 1), what is the potential offspring (give examples) if the mutation probability is
(a) pm = 1.0
(b) pm = 0.5
(c) pm = 0.2
(d) pm = 0.1
(e) pm = 0.0001
9. Explain the basic principles of the building-block hypothesis and its potential applications.
10. Perform a PMC operation for two strings S1 and S2, in which two randomly selected crossing points are given:
11. Search the Web to find the basic characteristics of publicly available or commercial software tools that are based on GAs. Document the results of your search.
13.9 REFERENCES FOR FURTHER STUDY
Fogel, D. B., ed., Evolutionary Computation, IEEE Press, New York, 1998.
The book provides a collection of 30 landmark papers, and it spans the entire history of evolutionary computation—from today’s research back to its very origins more than 40 years ago.
Goldenberg, D. E., Genetic Algorithms in Search, Optimization and Machine