Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [109]

By Root 849 0

of samples in a data set). The new gain criterion will have the form

Similarly, Split-info(x) can be altered by regarding the samples with unknown values as an additional group in splitting. If the test x has n outcomes, its Split-info(x) is computed as if the test divided the data set into n + 1 subsets. This modification has a direct influence on the final value of the modified criterion Gain-ratio(x).

Let us explain the modifications of the C4.5 decision-tree methodology applied on one example. The database is similar to the previous one (Table 6.1), only there is now one value missing for Attribute1 denoted by “?” as presented in Table 6.2.

TABLE 6.2. A Simple Flat Database of Examples with One Missing Value

The computation of the gain parameter for Attribute1 is similar to as before, only the missing value corrects some of the previous steps. Eight out of the 13 cases with values for Attribute1 belong to CLASS1 and five cases to CLASS2, so the entropy before splitting is

After using Attribute1 to divide T into three subsets (test x1 represents the selection one of three values A, B, or C), the resulting information is given by

The information gained by this test is now corrected with the factor F (F = 13/14 for our example):

The gain for this test is slightly lower than the previous value of 0.216 bits. The split information, however, is still determined from the entire training set and is larger, since there is an extra category for unknown values.

Additionally, the concept of partitioning must be generalized. With every sample a new parameter, probability, is associated. When a case with known value is assigned from T to subset Ti, the probability of it belonging to Ti is 1, and in all other subsets is 0. When a value is not known, only a weaker probabilistic statement can be made. C4.5 therefore associates with each sample (having a missing value) in each subset Ti a weight w, representing the probability that the case belongs to each subset. To make the solution more general, it is necessary to take into account that the probabilities of samples before splitting are not always equal to one (in subsequent iterations of the decision-tree construction). Therefore, new parameter wnew for missing values after splitting is equal to the old parameter wold before splitting multiplied by the probability that the sample belongs to each subset P(Ti), or more formally:

For example, the record with the missing value, given in the database in Figure 6.7, will be represented in all three subsets after the splitting set T into subsets Ti using test x1 on Attribute1. New weights wi will be equal to probabilities 5/13, 3/13, and 5/13, because the initial (old) value for w is equal to one. The new subsets are given in Figure 6.7. |Ti| can now be reinterpreted in C4.5 not as a number of elements in a set Ti, but as a sum of all weights w for the given set Ti. From Figure 6.7, the new values are computed as: |T1| = 5 + 5/13, |T2| = 3 + 3/13, and |T3| = 5 + 5/13.

Figure 6.7. Results of test x1 are subsets Ti (initial set T is with missing value).

If these subsets are partitioned further by the tests on Attribute2 and Attribute3, the final decision tree for a data set with missing values has the form shown in Figure 6.8

Figure 6.8. Decision tree for the database T with missing values.

The decision tree in Figure 6.8 has much the same structure as before (Fig. 6.6), but because of the ambiguity in final classification, every decision is attached with two parameters in a form (|Ti|/E). |Ti| is the sum of the fractional samples that reach the leaf and E is the number of samples that belong to classes other than the nominated class.

For example, (3.4/0.4) means that 3.4 (or 3 + 5/13) fractional training samples reached the leaf, of which 0.4 (or 5/13) did not belong to the class assigned to the leaf. It is possible to express the |Ti| and E parameters in percentages:

A similar approach is taken in C4.5 when the decision tree is used to classify a sample previously not present in a database, that is, the

Online Book Reader

Data Mining - Mehmed Kantardzic [109]

®Online Book Reader