Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining_ Concepts and Techniques - Jiawei Han [168]

By Root 1710 0

mc to m ratio, that is, . However, lift and both contradict this in an incorrect way: Their values for D2 are between those for D1 and D3.

For data set D4, both lift and indicate a highly positive association between m and c, whereas the others indicate a “neutral” association because the ratio of mc to equals the ratio of mc to , which is 1. This means that if a customer buys coffee (or milk), the probability that he or she will also purchase milk (or coffee) is exactly 50%.

“Why are lift and so poor at distinguishing pattern association relationships in the previous transactional data sets?” To answer this, we have to consider the null- transactions. A null-transaction is a transaction that does not contain any of the itemsets being examined. In our example, represents the number of null-transactions. Lift and have difficulty distinguishing interesting pattern association relationships because they are both strongly influenced by . Typically, the number of null-transactions can outweigh the number of individual purchases because, for example, many people may buy neither milk nor coffee. On the other hand, the other four measures are good indicators of interesting pattern associations because their definitions remove the influence of (i.e., they are not influenced by the number of null-transactions).

This discussion shows that it is highly desirable to have a measure that has a value that is independent of the number of null-transactions. A measure is null-invariant if its value is free from the influence of null-transactions. Null-invariance is an important property for measuring association patterns in large transaction databases. Among the six discussed measures in this subsection, only lift and are not null-invariant measures.

“Among the all_confidence, max_confidence, Kulczynski, and cosine measures, which is best at indicating interesting pattern relationships?”

To answer this question, we introduce the imbalance ratio (IR), which assesses the imbalance of two itemsets, A and B, in rule implications. It is defined as

(6.13)

where the numerator is the absolute value of the difference between the support of the itemsets A and B, and the denominator is the number of transactions containing A or B. If the two directional implications between A and B are the same, then will be zero. Otherwise, the larger the difference between the two, the larger the imbalance ratio. This ratio is independent of the number of null-transactions and independent of the total number of transactions.

Let's continue examining the remaining data sets in Example 6.10.

Comparing null-invariant measures in pattern evaluation

Although the four measures introduced in this section are null-invariant, they may present dramatically different values on some subtly different data sets. Let's examine data sets D5 and D6, shown earlier in Table 6.9, where the two events m and c have unbalanced conditional probabilities. That is, the ratio of mc to c is greater than 0.9. This means that knowing that c occurs should strongly suggest that m occurs also. The ratio of mc to m is less than 0.1, indicating that m implies that c is quite unlikely to occur. The all_confidence and cosine measures view both cases as negatively associated and the Kulc measure views both as neutral. The max_confidence measure claims strong positive associations for these cases. The measures give very diverse results!

“Which measure intuitively reflects the true relationship between the purchase of milk and coffee?” Due to the “balanced” skewness of the data, it is difficult to argue whether the two data sets have positive or negative association. From one point of view, only % of milk-related transactions contain coffee in D5 and this percentage is % in D6, both indicating a negative association. On the other hand, % of transactions in D5 (i.e., ) and 9% in D6 (i.e., ) containing coffee contain milk as well, which indicates a positive association between milk and coffee. These draw very different conclusions.

For such “balanced” skewness, it could be fair to treat it

Online Book Reader

Data Mining_ Concepts and Techniques - Jiawei Han [168]

®Online Book Reader