Data Mining_ Concepts and Techniques - Jiawei Han [166]
Equation (6.8) is equivalent to , or , which is also referred to as the lift of the association (or correlation) rule . In other words, it assesses the degree to which the occurrence of one “lifts” the occurrence of the other. For example, if A corresponds to the sale of computer games and B corresponds to the sale of videos, then given the current market conditions, the sale of games is said to increase or “lift” the likelihood of the sale of videos by a factor of the value returned by Eq. (6.8).
Let's go back to the computer game and video data of Example 6.7.
Correlation analysis using lift
To help filter out misleading “strong” associations of the form from the data of Example 6.7, we need to study how the two itemsets, A and B, are correlated. Let refer to the transactions of Example 6.7 that do not contain computer games, and refer to those that do not contain videos. The transactions can be summarized in a contingency table, as shown in Table 6.6.
Table 6.6 2 × 2 Contingency Table Summarizing the Transactions with Respect to Game and Video Purchases
game
video 4000 3500 7500
2000 500 2500
6000 4000 10,000
From the table, we can see that the probability of purchasing a computer game is , the probability of purchasing a video is , and the probability of purchasing both is . By Eq. (6.8), the lift of Rule (6.6) is . Because this value is less than 1, there is a negative correlation between the occurrence of {game} and {video}. The numerator is the likelihood of a customer purchasing both, while the denominator is what the likelihood would have been if the two purchases were completely independent. Such a negative correlation cannot be identified by a support–confidence framework.
The second correlation measure that we study is the measure, which was introduced in Chapter 3 (Eq. 3.1). To compute the value, we take the squared difference between the observed and expected value for a slot (A and B pair) in the contingency table, divided by the expected value. This amount is summed for all slots of the contingency table. Let's perform a analysis of Example 6.8.
Correlation analysis using χ2
To compute the correlation using analysis for nominal data, we need the observed value and expected value (displayed in parenthesis) for each slot of the contingency table, as shown in Table 6.7. From the table, we can compute the value as follows:
Because the value is greater than 1, and the observed value of the slot (game, video) = 4000, which is less than the expected value of 4500, buying game and buying video are negatively correlated. This is consistent with the conclusion derived from the analysis of the lift measure in Example 6.8.
Table 6.7 Table 6.6 Contingency Table, Now with the Expected Values
game
video 4000 (4500) 3500 (3000) 7500
2000 (1500) 500 (1000) 2500
6000 4000 10,000
6.3.3. A Comparison of Pattern Evaluation Measures
The above discussion shows that instead of using the simple support–confidence framework to evaluate frequent patterns, other measures, such as lift and , often disclose more intrinsic pattern relationships. How effective are these measures? Should we also consider other alternatives?
Researchers have studied many pattern evaluation measures even before the start of in-depth research on scalable methods for mining frequent patterns. Recently, several other pattern evaluation measures have attracted interest. In this subsection, we present four such measures: all_confidence, max_confidence, Kulczynski, and cosine. We'll then compare their effectiveness with respect to one another and with respect to the lift and measures.
Given two itemsets, A and B, the all_confidence measure of A and B is defined as
(6.9)
where max {sup (A), sup (B)} is the maximum support of the itemsets A and B. Thus, is also the minimum