Data Mining - Mehmed Kantardzic [224]
in which each feature’s decimal value is encoded as a gene composed of four bits using a binary coding.
Other encoding schemes, such as Gray coding, can also be used and, when necessary, arrangements can be made for encoding negative, floating-point, or discrete-value numbers. Encoding schemes provide a way of translating problem-specific knowledge directly into the GA framework. This process plays a key role in determining GAs’ performances. Moreover, genetic operators can and should be designed along with the encoding scheme used for a specific application.
A set of all feature values encoded into a bit string represents one chromosome. In GAs we are manipulating not a single chromosome but a set of chromosomes called a population. To initialize a population, we can simply set some pop-size number of chromosomes randomly. The size of the population is also one of the most important choices faced by any user of GAs and may be critical in many applications: Will we reach the approximate solution at all, and if yes, how fast? If the population size is too small, the GA may converge too quickly and maybe to a solution that is only the local optimum; if it is too large, the GA may waste computational resources and the waiting time for an improvement might be too long.
13.2.2 Fitness Evaluation
The next step, after creating a population, is to calculate the fitness value of each member in the population because each chromosome is a candidate for an optimal solution. For a maximization problem, the fitness value fi of the ith member is usually the objective function evaluated at this member (or the point in parameter space). The fitness of a solution is a measure that can be used to compare solutions to determine which is better. The fitness values may be determined from complex analytical formulas, simulation models, or by referring to observations from experiments or real-life problem settings. GAs will work correctly if fitness values are determined appropriately, keeping in mind that a selection of the objective function is highly subjective and problem-dependent.
We usually need fitness values that are positive, so some kind of scaling and/or translation of data may become necessary if the objective function is not strictly positive. Another approach is to use the rankings of members in a population as their fitness values. The advantage of this approach is that the objective function does not need to be accurate, as long as it can provide the correct ranking information.
13.2.3 Selection
In this phase, we have to create a new population from the current generation. The selection operation determines which parent chromosomes participate in producing offspring for the next generation. Usually, members are selected for mating with a selection probability proportional to their fitness values. The most common way to implement this method is to set the selection probability p equal to
where n is the population size and fi is a fitness value for the ith chromosome. The effect of this selection method is to allow members with above-average values to reproduce and replace members with below-average fitness values.
For the selection process (selection of a new population with respect to the probability distribution based on fitness values), a roulette wheel with slots sized according to fitness for each chromosome is used. We construct such a roulette wheel as follows:
1. Calculate the fitness value f(vi) for each chromosome vi.
2. Find the total fitness of the population:
3. Calculate the probability of a selection pi for each chromosome vi:
4. Calculate a cumulative probability qi after each chromosome vi is included:
where q increases from 0 to maximum 1. Value 1 shows that all chromosomes from the population are included into a cumulative probability.
The selection process is based on spinning the roulette wheel pop-size times. Each time, we select a single chromosome for a new population. An implementation could repeat steps 1 and 2 pop-size times:
1. Generate a random number r from the range