Data Mining_ Concepts and Techniques - Jiawei Han [181]
Using Statistical Theory to Disclose Exceptional Behavior
It is possible to discover quantitative association rules that disclose exceptional behavior, where “exceptional” is defined based on a statistical theory. For example, the following association rule may indicate exceptional behavior:
(7.9)
This rule states that the average wage for females is only $7.90/hr. This rule is (subjectively) interesting because it reveals a group of people earning a significantly lower wage than the average wage of $9.02/hr. (If the average wage was close to $7.90/hr, then the fact that females also earn $7.90/hr would be “uninteresting.”)
An integral aspect of our definition involves applying statistical tests to confirm the validity of our rules. That is, Rule (7.9) is only accepted if a statistical test (in this case, a Z-test) confirms that with high confidence it can be inferred that the mean wage of the female population is indeed lower than the mean wage of the rest of the population. (The above rule was mined from a real database based on a 1985 U.S. census.)
An association rule under the new definition is a rule of the form:
(7.10)
where the mean of the subset is significantly different from the mean of its complement in the database (and this is validated by an appropriate statistical test).
7.2.4. Mining Rare Patterns and Negative Patterns
All the methods presented so far in this chapter have been for mining frequent patterns. Sometimes, however, it is interesting to find patterns that are rare instead of frequent, or patterns that reflect a negative correlation between items. These patterns are respectively referred to as rare patterns and negative patterns. In this subsection, we consider various ways of defining rare patterns and negative patterns, which are also useful to mine.
Rare patterns and negative patterns
In jewelry sales data, sales of diamond watches are rare; however, patterns involving the selling of diamond watches could be interesting. In supermarket data, if we find that customers frequently buy Coca-Cola Classic or Diet Coke but not both, then buying Coca-Cola Classic and buying Diet Coke together is considered a negative (correlated) pattern. In car sales data, a dealer sells a few fuel-thirsty vehicles (e.g., SUVs) to a given customer, and then later sells hybrid mini-cars to the same customer. Even though buying SUVs and buying hybrid mini-cars may be negatively correlated events, it can be interesting to discover and examine such exceptional cases.
An infrequent (or rare) pattern is a pattern with a frequency support that is below (or far below) user-specified minimum support threshold. However, since the occurrence frequencies of the majority of itemsets are usually below or even far below the minimum support threshold, it is desirable in practice for users to specify for rare patterns. For example, if we want to find patterns containing at least one item with a value that is over $500, we should specify such a constraint explicitly. Efficient mining of such itemsets is discussed under mining multidimensional associations (Section 7.2.1), where the strategy is to adopt multiple (e.g., item- or group-based) minimum support thresholds. Other applicable methods are discussed under constraint-based pattern mining (Section 7.3), where user-specified constraints are pushed deep into the iterative mining process.
There are various ways we could define a negative pattern. We will consider three such definitions.
Definition 7.1
If itemsets X and Y are both frequent but rarely occur together (i.e., ), then itemsets X and Y are negatively correlated, and the pattern is a negatively correlated pattern. If , then X and Y are strongly negatively correlated, and the pattern is