Data Mining - Mehmed Kantardzic [164]
10.5 FP GROWTH METHOD
Let us define one of the most important problems with scalability of the Apriori algorithm. To generate one FP of length 100, such as {a1, a2, … , a100}, the number of candidates that has to be generated will be at least
and it will require hundreds of database scans. The complexity of the computation increases exponentially. That is only one of the many factors that influence the development of several new algorithms for association-rule mining.
FP growth method is an efficient way of mining frequent itemsets in large databases. The algorithm mines frequent itemsets without the time-consuming candidate-generation process that is essential for Apriori. When the database is large, FP growth first performs a database projection of the frequent items; it then switches to mining the main memory by constructing a compact data structure called the FP tree. For an explanation of the algorithm, we will use the transactional database in Table 10.2 and the minimum support threshold of 3.
TABLE 10.2. The Transactional Database T
TID Itemset
01 f, a, c, d, g, i, m, p
02 a, b, c, f, l, m, o
03 b, f, h, j, o
04 b, c, k, s, p
05 a, f, c, e, l, p, m, n
First, a scan of the database T derives a list L of frequent items occurring three or more than three times in the database. These are the items (with their supports):
The items are listed in descending order of frequency. This ordering is important since each path of the FP tree will follow this order.
Second, the root of the tree, labeled ROOT, is created. The database T is scanned a second time. The scan of the first transaction leads to the construction of the first branch of the FP tree: {(f,1), (c,1), (a,1), (m,1), (p,1)}. Only those items that are in the list of frequent items L are selected. The indices for nodes in the branch (all are 1) represent the cumulative number of samples at this node in the tree, and of course, after the first sample, all are 1. The order of the nodes is not as in the sample but as in the list of frequent items L. For the second transaction, because it shares items f, c, and a it shares the prefix {f, c, a} with the previous branch and extends to the new branch {(f, 2), (c, 2), (a, 2), (m, 1), (p, 1)}, increasing the indices for the common prefix by one. The new intermediate version of the FP tree, after two samples from the database, is given in Figure 10.6a. The remaining transactions can be inserted similarly, and the final FP tree is given in Figure 10.6b.
Figure 10.6. FP tree for the database T in Table 10.2. (a) FP tree after two samples; (b) final FT tree.
To facilitate tree traversal, an item header table is built, in which each item in list L connects nodes in the FP tree with its values through node links. All f nodes are connected in one list, all c nodes in the other, and so on. For simplicity of representation only the list for b nodes is given in Figure 10.6b. Using the compact-tree structure, the FP growth algorithm mines the complete set of frequent itemsets.
According to the list L of frequent items, the complete set of frequent itemsets can be divided into subsets (six for our example) without overlap: (1) frequent itemsets having item p (the end of list L); (2) the itemsets having item m but not p; (3) the frequent itemsets with b and without both m and p; … ; and (6) the large itemsets only with f. This classification is valid for our example, but the same principles can be applied to other databases and other L lists.
Based on node-link connection, we collect all the transactions that p participates in by starting from the header table of p and following p’s node links. In our example, two paths will be selected in the FP tree: {(f,4), (c,3), (a,3), (m,2), (p,2)} and {(c,1), (b,1), (p,1)}, where samples with a frequent item p are {(f,2), (c,2), (a,2), (m,2), (p,2)}, and {(c,1), (b,1), (p,1)}. The