Data Mining_ Concepts and Techniques - Jiawei Han [167]
Given two itemsets, A and B, the max_confidence measure of A and B is defined as
(6.10)
The max_conf measure is the maximum confidence of the two association rules, “” and “.”
Given two itemsets, A and B, the Kulczynski measure of A and B (abbreviated as Kulc) is defined as
(6.11)
It was proposed in 1927 by Polish mathematician S. Kulczynski. It can be viewed as an average of two confidence measures. That is, it is the average of two conditional probabilities: the probability of itemset B given itemset A, and the probability of itemset A given itemset B.
Finally, given two itemsets, A and B, the cosine measure of A and B is defined as
(6.12)
The cosine measure can be viewed as a harmonized lift measure: The two formulae are similar except that for cosine, the square root is taken on the product of the probabilities of A and B. This is an important difference, however, because by taking the square root, the cosine value is only influenced by the supports of A, B, and , and not by the total number of transactions.
Each of these four measures defined has the following property: Its value is only influenced by the supports of A, B, and , or more exactly, by the conditional probabilities of and , but not by the total number of transactions. Another common property is that each measure ranges from 0 to 1, and the higher the value, the closer the relationship between A and B.
Now, together with lift and , we have introduced in total six pattern evaluation measures. You may wonder, “Which is the best in assessing the discovered pattern relationships?” To answer this question, we examine their performance on some typical data sets.
Comparison of six pattern evaluation measures on typical data sets
The relationships between the purchases of two items, milk and coffee, can be examined by summarizing their purchase history in Table 6.8, a 2 × 2 contingency table, where an entry such as mc represents the number of transactions containing both milk and coffee.
Table 6.8 2 × 2 Contingency Table for Two Items
milk
coffee mc c
m Σ
Table 6.9 shows a set of transactional data sets with their corresponding contingency tables and the associated values for each of the six evaluation measures. Let's first examine the first four data sets, D1 through D4. From the table, we see that m and c are positively associated in D1 and D2, negatively associated in D3, and neutral in D4. For D1 and D2, m and c are positively associated because mc (10,000) is considerably greater than (1000) and (1000). Intuitively, for people who bought milk (), it is very likely that they also bought coffee (), and vice versa.
Table 6.9 Comparison of Six Pattern Evaluation Measures Using Contingency Tables for a Variety of Data Sets
Data
Setmcliftall_conf.max_conf.Kulc.cosine
D1 10,000 1000 1000 100,000 90557 9.26 0.91 0.91 0.91 0.91
D2 10,000 1000 1000 100 0 1 0.91 0.91 0.91 0.91
D3 100 1000 1000 100,000 670 8.44 0.09 0.09 0.09 0.09
D4 1000 1000 1000 100,000 24740 25.75 0.5 0.5 0.5 0.5
D5 1000 100 10,000 100,000 8173 9.18 0.09 0.91 0.5 0.29
D6 1000 10 100,000 100,000 965 1.97 0.01 0.99 0.5 0.10
The results of the four newly introduced measures show that m and c are strongly positively associated in both data sets by producing a measure value of 0.91. However, lift and generate dramatically different measure values for D1 and D2 due to their sensitivity to . In fact, in many real-world scenarios, is usually huge and unstable. For example, in a market basket database, the total number of transactions could fluctuate on a daily basis and overwhelmingly exceed the number of transactions containing any particular itemset. Therefore, a good interestingness measure should not be affected by transactions that do not contain the itemsets of interest; otherwise, it would generate unstable results, as illustrated in D1 and D2.
Similarly, in D3, the four new measures correctly show that m and c are strongly negatively associated because the m to c ratio equals the