Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [114]

By Root 902 0

pruning, while C4.5 uses binomial confidence limits. The proposed approach assumes that the bias in the re-substitution error of a tree increases linearly with the number of leaf nodes. The cost assigned to a subtree is the sum of two terms: the re-substitution error and the number of leaves times a complexity parameter α. It can be shown that, for every α value, there exists a unique smallest tree minimizing cost of the tree. Note that, although α runs through a continuum of values, there are at most a finite number of possible subtrees for analysis and pruning.

Unlike C4.5, CART does not penalize the splitting criterion during the tree construction if examples have unknown values for the attribute used in the split. The criterion uses only those instances for which the value is known. CART finds several surrogate splits that can be used instead of the original split. During classification, the first surrogate split based on a known attribute value is used. The surrogates cannot be chosen based on the original splitting criterion because the subtree at each node is constructed based on the original split selected. The surrogate splits are therefore chosen to maximize a measure of predictive association with the original split. This procedure works well if there are attributes that are highly correlated with the chosen attribute.

As its name implies, CART also supports building regression trees. Regression trees are somewhat simpler than classification trees because the growing and pruning criteria used in CART are the same. The regression tree structure is similar to a classification tree, except that each leaf predicts a real number. The re-substitution estimate for pruning the tree is the mean-squared error.

Among the main advantages of CART method is its robustness to outliers and noisy data. Usually the splitting algorithm will isolate outliers in an individual node or nodes. Also, an important practical property of CART is that the structure of its classification or regression trees is invariant with respect to monotone transformations of independent variables. One can replace any variable with its logarithm or square root value, the structure of the tree will not change. One of the disadvantages of CART is that the system may have unstable decision trees. Insignificant modifications of learning samples, such as eliminating several observations, could lead to radical changes in a decision tree: with a significant increase or decrease in tree complexity are changes in splitting variables and values.

C4.5 and CART are two popular algorithms for decision tree induction; however, their corresponding splitting criteria, information gain, and Gini index are considered to be skew-sensitive. In other words, they are not applicable, or at least not successfully applied, in cases where classes are not equally distributed in training and testing data sets. It becomes important to design a decision tree-splitting criterion that captures the divergence in distributions without being dominated by the class priors. One of the proposed solutions is the Hellinger distance as a decision tree-splitting criterion. Recent experimental results show that this distance measure is skew-insensitive.

For application as a decision tree-splitting criterion, we assume a countable space, so all continuous features are discretized into p partitions or bins. Assuming a two-class problem (class+ and class−), let X+ be samples belonging to class+, and X− are samples with class−. Then, we are essentially interested in calculating the “distance” in the normalized frequencies distributions aggregated over all the partitions of the two-class distributions X+ and X−. The Hellinger distance between X+ and X− is:

This formulation is strongly skew-insensitive. Experiments with real-world data show that the proposed measure may be successfully applied to cases where X+ X−. It essentially captures the divergence between the feature value distributions given the two different classes.

Recent developments in the field extend technology toward multivariate

Online Book Reader

Data Mining - Mehmed Kantardzic [114]

®Online Book Reader