Data Mining_ Concepts and Techniques - Jiawei Han [344]
Contexts may be specified at different levels of granularity. Suppose AllElectronics maintains customer information at a more detailed level for the attributes age, postal_code, number_of_transactions_per_year, and annual_total_transaction_amount. We can still group customers on age and postal_code, and then mine outliers in each group. What if the number of customers falling into a group is very small or even zero? For a customer, c, if the corresponding context contains very few or even no other customers, the evaluation of whether c is an outlier using the exact context is unreliable or even impossible.
To overcome this challenge, we can assume that customers of similar age and who live within the same area should have similar normal behavior. This assumption can help to generalize contexts and makes for more effective outlier detection. For example, using a set of training data, we may learn a mixture model, U, of the data on the contextual attributes, and another mixture model, V, of the data on the behavior attributes. A mapping is also learned to capture the probability that a data object o belonging to cluster Uj on the contextual attributes is generated by cluster Vi on the behavior attributes. The outlier score can then be calculated as
(12.20)
Thus, the contextual outlier problem is transformed into outlier detection using mixture models.
12.7.2. Modeling Normal Behavior with Respect to Contexts
In some applications, it is inconvenient or infeasible to clearly partition the data into contexts. For example, consider the situation where the online store of AllElectronics records customer browsing behavior in a search log. For each customer, the data log contains the sequence of products searched for and browsed by the customer. AllElectronics is interested in contextual outlier behavior, such as if a customer suddenly purchased a product that is unrelated to those she recently browsed. However, in this application, contexts cannot be easily specified because it is unclear how many products browsed earlier should be considered as the context, and this number will likely differ for each product.
This second category of contextual outlier detection methods models the normal behavior with respect to contexts. Using a training data set, such a method trains a model that predicts the expected behavior attribute values with respect to the contextual attribute values. To determine whether a data object is a contextual outlier, we can then apply the model to the contextual attributes of the object. If the behavior attribute values of the object significantly deviate from the values predicted by the model, then the object can be declared a contextual outlier.
By using a prediction model that links the contexts and behavior, these methods avoid the explicit identification of specific contexts. A number of classification and prediction techniques can be used to build such models such as regression, Markov models, and finite state automaton. Interested readers are referred to Chapter 8 and Chapter 9 on classification and the bibliographic notes for further details (Section 12.11).
In summary, contextual outlier detection enhances conventional outlier detection by considering contexts, which are important in many applications. We may be able to detect outliers that cannot be detected otherwise. Consider a credit card user whose income level is low but whose expenditure patterns are similar to those of millionaires. This user can be detected as a contextual outlier if the income level is used to define context. Such a user may not be detected as an outlier without contextual information because she does share expenditure patterns with many millionaires. Considering contexts in outlier detection can also help to avoid false alarms. Without considering the context, a millionaire's purchase transaction may be falsely detected as an outlier if the majority of customers in