Data Mining - Mehmed Kantardzic [113]
CART uses the Gini diversity index as a splitting criterion instead of information-based criteria for C4.5. The CART authors favor the Gini criterion over information gain because the Gini can be extended to include symmetrized costs, and it is computed more rapidly than information gain. The Gini index is used to select the feature at each internal node of the decision tree. We define the Gini index for a data set S as follows:
where
c is the number of predefined classes,
Ci are classes for i = 1, … , c − 1,
si is the number of samples belonging to class Ci, and
pi= si/S is a relative frequency of class Ci in the set.
This metric indicates the partition purity of the data set S. For branch prediction where we have two classes the Gini index lies within [0, 0.5]. If all the data in S belong to the same class, Gini S equals the minimum value 0, which means that S is pure. If Gini S equals 0.5, all observations in S are equally distributed among two classes. This decreases as a split favoring one class: for instance a 70/30 distribution produces Gini index of 0.42. If we have more than two classes, the maximum possible value for index increases; for instance, the worst possible diversity for three classes is a 33% split, and it produces a Gini value of 0.67. The Gini coefficient, which ranges from 0 to 1 (for extremely large number of classes), is multiplied by 100 to range between 0 and 100 in some commercial tools.
The quality of a split on a feature into k subsets Si is then computed as the weighted sum of the Gini indices of the resulting subsets:
where
ni is the number of samples in subset Si after splitting, and
n is the total number of samples in the given node.
Thus, Ginisplit is calculated for all possible features, and the feature with minimum Ginisplit is selected as split point. The procedure is repetitive as in C4.5. We may compare the results of CART and C4.5 attribute selection by applying entropy index (C4.5) and Gini index (CART) for splitting Attribute1 in Table 6.1. While we already have results for C4.5 (gain ratio for Attribute 1), Ginisplit index for the same attribute may be calculated as:
Interpretation of the absolute value for Ginisplit index is not important. It is important that relatively this value is lower for AttributeA1 than for the other two attributes in Table 6.1. The reader may easily check this claim. Therefore, based on Gini index Attribute1 is selected for splitting. It is the same result obtained in C4.5 algorithm using an entropy criterion. Although the results are the same in this example, for many other data sets there could be (usually small) differences in results between these two approaches.
CART also supports the twoing splitting criterion, which can be used for multi-class problems. At each node, the classes are separated into two superclasses containing disjoint and mutually exhaustive classes. A splitting criterion for a two-class problem is used to find the attribute, and the two superclasses that optimize the two-class criterion. The approach gives “strategic” splits in the sense that several classes that are similar are grouped together. Although twoing splitting rule allows us to build more balanced trees, this algorithm works slower than Gini rule. For example, if the total number of classes is equal to K, then we will have 2K − 1 possible grouping into two classes. It can be seen that there is a small difference between trees constructed using Gini and trees constructed via twoing rule. The difference can be seen mainly at the bottom of the tree where the variables are less significant in comparison with the top of the tree.
Pruning technique used in CART is called minimal cost complexity