Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [168]

By Root 926 0
database X:

X: TID Items

T01 A, B, C, D

T02 A, C, D, F

T03 C, D, E, G, A

T04 A, D, F, B

T05 B, C, G

T06 D, F, G

T07 A, B, G

T08 C, D, F, G

Using the threshold values support = 25% and confidence = 60%,

(a) find all large itemsets in database X;

(b) find strong association rules for database X;

(c) analyze misleading associations for the rule set obtained in (b).

6. Given a transactional database Y:

Y: TID Items

T01 A1, B1, C2

T02 A2, C1, D1

T03 B2, C2, E2

T04 B1, C1, E1

T05 A3, C3, E2

T06 C1, D2, E2

Using the threshold values for support s = 30% and confidence c = 60%,

(a) find all large itemsets in database Y;

(b) if itemsets are organized in a hierarchy so that A = {A1, A2, A3},

B = {B1, B2}, C = {C1, C2, C3}, D = {D1, D2}, and E = {E1, E2}, find large itemsets that are defined on the conceptual level including a hierarchy of items;

(c) find strong association rules for large itemsets in (b).

7. Implement the Apriori algorithm and discover large itemsets in transactional database.

8. Search the Web to find the basic characteristics of publicly available or commercial software tools for association-rule discovery. Document the results of your search.

9. Given a simple transactional database, find FP tree for this database if

(a) support threshold is 5;

(b) support threshold is 3.

TID Items

1 a b c d

2 a c d f

3 c d e g a

4 a d f b

5 b c g

6 d f g

7 a b g

8 c d f g

10. Given a simple transaction database:

TID Items

1 X Z V

2 X Y U

3 Y Z V

4 Z V W

Using two iterations of the Apriori algorithm find large 2-itemsets if required support is s ≥ 50%. Show all steps of the algorithm.

11. Given a frequent itemset A,B,C,D, and E, how many possible association rules exist?

12. What are the frequent itemsets with a minimum support of 3 for the given set of transactions?

TID Items

101 A,B,C,D,E

102 A,C,D

103 D,E

104 B,C,E

105 A,B,D,E

106 A,B

107 B,D,E

108 A,B,D

109 A,D

110 D,E

13. The conviction is a measure for an analysis of a quality of association rules. The formula for conviction CV in terms of probabilities is given as:

or in terms of support and confidence of an association rule:

What are basic characteristics of the conviction measure? Explain the meaning of some characteristic values.

14. Consider the dataset given in the table below.

Customer ID Transaction Id Items

418 234145 {X, Z}

345 543789 {U, V, W, X, Y, Z}

323 965157 {U, W, Y}

418 489651 {V, X, Z}

567 748965 {U, Y}

567 325687 {W, X, Y}

323 147895 {X, Y, Z}

635 617851 {U, Z}

345 824697 {V, Y}

635 102458 {V, W, X}

(a) Compute the support for item sets {Y}, {X, Z} and {X, Y, Z} by treating each transaction ID as a market basket.

(b) Use the results from part (a) to compute the confidence for rules XZ → Y and Y → XZ.

(c) Repeat part (a) by treating each customer ID as a market basket. Each item should be treated as a binary variable (1 if an item appears in at least one transaction bought by the customer, and 0 otherwise).

(d) Use the results from part (c) to compute the confidence for rules XZ → Y and Y → XZ.

(e) Find FP-tree for this database if support threshold is 5.

15. A collection of market-basket data has 100,000 frequent items, and 1,000,000 infrequent items. Each pair of frequent items appears 100 times; each pair consisting of one frequent and one infrequent item appears 10 times, and each pair of infrequent items appears once. Answer each of the following questions. Your answers only have to be correct to within 1%, and for convenience, you may optionally use scientific notation, for example, 3.14 × 108 instead of 314,000,000.

(a) What is the total number of pair occurrences? That is, what is the sum of the counts of all pairs?

(b) We did not state the support threshold, but the given information lets us put bounds on the support threshold s. What are the tightest upper and lower bounds on s?

10.9 REFERENCES FOR FURTHER STUDY


Adamo, J., Data Mining for Association Rules and Sequential Patterns, Springer, Berlin, 2001.

This book presents a collection of algorithms for data

Return Main Page Previous Page Next Page

®Online Book Reader