Data Mining - Mehmed Kantardzic [166]
Samples in the training data set share prefixes. For example, the second sample carries attribute values (a1, b2, c1) in the F-list and shares a common prefix a1 with the first sample. An additional branch from the node a1 will be inserted in the tree with new nodes b2 and c1. A new class label B with the count equal to 1 is also inserted at the end of the new path. The final FP tree for the database T is given in Figure 10.7a.
Figure 10.7. FP tree for the database in Table 10.3. (a) nonmerged FT tree; (b) FT tree after merging d3 nodes.
After analyzing all the samples and constructing an FP tree, the set of CAR can be generated by dividing all rules into subsets without overlap. In our example it will be four subsets: (1) the rules having d3 value; (2) the rules having c1 but no d3; (3) the rules having b2 but neither d3 nor c1; and (4) the rules having only a1. CMAR finds these subsets one by one.
To find the subset of rules having d3, CMAR traverses nodes having the attribute value d3 and looks “upward” the FP tree to collect d3-projected samples. In our example, there are three samples represented in the FP tree, and they are (a1, b2, c1, d3):C, (a1, b2, d3):C, and (d3):A. The problem of finding all FPs in the training set can be reduced to mining FPs in the d3-projected database. In our example, in the d3-projected database, since the pattern (a1, b2, d3) occurs twice its support is equal to the required threshold value 2. Also, the rule based on this FP, (a1, b2, d3) → C has a confidence of 100% (above the threshold value), and that is the only rule generated in the given projection of the database.
After a search for rules having d3 value, all the nodes of d3 and their corresponding class labels are merged into their parent nodes at the FP tree. The FP tree is shrunk as shown in Figure 10.7b. The remaining set of rules can be mined similarly by repeating the previous procedures for a c1-projected database, then for the b2-projected database, and finally for the a1-projected database. In this analysis, (a1, c1) is an FP with support 3, but all rules are with confidence less than the threshold value. The same conclusions can be drawn for pattern (a1, b2) and for (a1). Therefore, the only association rule generated through the training process with the database T is (a1, b2, d3) → C with support equal to 2 and 100% confidence.
When a set of rules is selected for classification, CMAR is ready to classify new samples. For the new sample, CMAR collects the subset of rules matching the sample from the total set of rules. Trivially, if all the rules have the same class, CMAR simply assigns that label to the new sample. If the rules are not consistent in the class label, CMAR divides the rules into groups according to the class label and yields the label of the “strongest” group. To compare the strength of groups, it is necessary to measure the “combined effect” of each group. Intuitively, if the rules in a group are highly positively correlated and have good support, the group should have a strong effect. CMAR uses the strongest rule in the group as its representative, that is, the rule with highest χ2 test value (adopted for this algorithm for a simplified computation). Preliminary experiments have shown that CMAR outperforms the C4.5 algorithm in terms of average accuracy, efficiency, and scalability.
10.7 MULTIDIMENSIONAL ASSOCIATION–RULES MINING
A multidimensional transactional database DB has the schema
where ID is a unique identification of each transaction, Ai are structured attributes in the database, and items are sets of items connected with the given transaction. The information in each tuple t = (id, a1, a2, … , an, items-t) can be partitioned into two: