Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [165]

By Root 858 0
given threshold value (3) satisfies only the frequent itemsets {(c,3), (p,3)}, or the simplified {c, p}. All other itemsets with p are below the threshold value.

The next subsets of frequent itemsets are those with m and without p. The FP tree recognizes the paths {(f,4), (c,3), (a,3), (m,2)}, and {(f,4), (c,3), (a,3), (b,1), (m,1)}, or the corresponding accumulated samples {(f,2), (c,2), (a,2), (m,2)}, and {(f,1), (c,1), (a,1), (b,1), (m,1)}. Analyzing the samples we discover the frequent itemset {(f,3), (c,3), (a,3), (m,3)} or, simplified, {f, c, a, m}.

Repeating the same process for subsets (3) to (6) in our example, additional frequent itemsets could be mined. These are itemsets {f, c, a} and {f, c}, but they are already subsets of the frequent itemset {f, c, a, m}. Therefore, the final solution in the FP growth method is the set of frequent itemsets, which is, in our example, {{c, p}, {f, c, a, m}}.

Experiments have shown that the FP growth algorithm is faster than the Apriori algorithm by about one order of magnitude. Several optimization techniques are added to the FP growth algorithm, and there exists its versions for mining sequences and patterns under constraints.

10.6 ASSOCIATIVE-CLASSIFICATION METHOD


CMAR is a classification method adopted from the FP growth method for generation of frequent itemsets. The main reason we included CMAR methodology in this chapter is its FP growth roots, but there is the possibility of comparing CMAR accuracy and efficiency with the C4.5 methodology.

Suppose data samples are given with n attributes (A1, A2, … , An). Attributes can be categorical or continuous. For a continuous attribute, we assume that its values are discretized into intervals in the preprocessing phase. A training data set T is a set of samples such that for each sample there exists a class label associated with it. Let C = {c1, c2, … , cm} be a finite set of class labels.

In general, a pattern P = {a1, a2, … , ak} is a set of attribute values for different attributes (1 ≤ k ≤ n). A sample is said to match the pattern P if it has all the attribute values given in the pattern. For rule R: P → c, the number of data samples matching pattern P and having class label c is called the support of rule R, denoted sup(R). The ratio of the number of samples matching pattern P and having class label c versus the total number of samples matching pattern P is called the confidence of R, denoted as conf(R). The association-classification method (CMAR) consists of two phases:

1. rule generation or training, and

2. classification or testing.

In the first, rule generation phase, CMAR computes the complete set of rules in the form R: P → c, such that sup(R) and conf(R) pass the given thresholds. For a given support threshold and confidence threshold, the associative-classification method finds the complete set of class-association rules (CAR) passing the thresholds. In the testing phase, when a new (unclassified) sample comes, the classifier, represented by a set of association rules, selects the rule that matches the sample and has the highest confidence, and uses it to predict the classification of the new sample.

We will illustrate the basic steps of the algorithm through one simple example. Suppose that for a given training data set T, as shown in Table 10.3, the support threshold is 2 and the confidence threshold is 70%.

TABLE 10.3. Training Database T for the CMAR Algorithm

First, CMAR scans the training data set and finds the set of attribute values occurring beyond the threshold support (at least twice in our database). One simple approach is to sort each attribute and to find all frequent values. For our database T, this is a set F = {a1, b2, c1, d3}, and it is called a frequent itemset. All other attribute values fail the support threshold. Then, CMAR sorts attribute values in F, in support-descending order, that is, F-list = (a1, b2, c1, d3).

Now, CMAR scans the training data set again to construct an FP tree. The FP tree is a prefix tree with respect to the F-list. For each sample in a training data

Return Main Page Previous Page Next Page

®Online Book Reader