Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [105]

By Root 898 0

The classes must be sharply delineated: A case either does or does not belong to a particular class. It is expected that there will be far more samples than classes.

4. Sufficient Data. Inductive generalization given in the form of a decision tree proceeds by identifying patterns in data. The approach is valid if enough number of robust patterns can be distinguished from chance coincidences. As this differentiation usually depends on statistical tests, there must be sufficient number of samples to allow these tests to be effective. The amount of data required is affected by factors such as the number of properties and classes and the complexity of the classification model. As these factors increase, more data will be needed to construct a reliable model.

5. “Logical” Classification Models. These methods construct only such classifiers that can be expressed as decision trees or decision rules. These forms essentially restrict the description of a class to a logical expression whose primitives are statements about the values of particular attributes. Some applications require weighted attributes or their arithmetic combinations for a reliable description of classes. In these situations logical models become very complex and, in general, they are not effective.

6.2 C4.5 ALGORITHM: GENERATING A DECISION TREE

The most important part of the C4.5 algorithm is the process of generating an initial decision tree from the set of training samples. As a result, the algorithm generates a classifier in the form of a decision tree: a structure with two types of nodes—a leaf indicating a class, or a decision node specifying some tests to be carried out on a single-attribute value, with one branch and a subtree for each possible outcome of the test.

A decision tree can be used to classify a new sample by starting at the root of the tree and moving through it until a leaf is encountered. At each non-leaf decision node, the features’ outcome for the test at the node is determined and attention shifts to the root of the selected subtree. For example, if the classification model of the problem is given with the decision tree in Figure 6.3a, and the sample for classification in Figure 6.3b, then the algorithm will create the path through the nodes A, C, and F (leaf node) until it makes the final classification decision: CLASS2.

Figure 6.3. Classification of a new sample based on the decision-tree model. (a) Decision tree; (b) An example for classification.

The skeleton of the C4.5 algorithm is based on Hunt’s Concept Learning System (CLS) method for constructing a decision tree from a set T of training samples. Let the classes be denoted as {C1, C2, … , Ck}. There are three possibilities for the content of the set T:

1. T contains one or more samples, all belonging to a single class Cj. The decision tree for T is a leaf-identifying class Cj.

2. T contains no samples. The decision tree is again a leaf but the class to be associated with the leaf must be determined from information other than T, such as the overall majority class in T. The C4.5 algorithm uses as a criterion the most frequent class at the parent of the given node.

3. T contains samples that belong to a mixture of classes. In this situation, the idea is to refine T into subsets of samples that are heading toward a single-class collection of samples. Based on single attribute, an appropriate test that has one or more mutually exclusive outcomes {O1, O2, … , On} is chosen. T is partitioned into subsets T1, T2, … , Tn, where Ti contains all the samples in T that have outcome Oi of the chosen test. The decision tree for T consists of a decision node identifying the test and one branch for each possible outcome (examples of this type of nodes are nodes A, B, and C in the decision tree in Fig. 6.3a).

The same tree-building procedure is applied recursively to each subset of training samples, so that the ith branch leads to the decision tree constructed from the subset Ti of the training samples. The successive division of the set of training samples proceeds until all the

Online Book Reader

Data Mining - Mehmed Kantardzic [105]

®Online Book Reader