Data Mining_ Concepts and Techniques - Jiawei Han [343]
Figure 12.14 Detecting outliers by semi-supervised learning.
Classification-based methods can incorporate human domain knowledge into the detection process by learning from the labeled samples. Once the classification model is constructed, the outlier detection process is fast. It only needs to compare the objects to be examined against the model learned from the training data. The quality of classification-based methods heavily depends on the availability and quality of the training set. In many applications, it is difficult to obtain representative and high-quality training data, which limits the applicability of classification-based methods.
12.7. Mining Contextual and Collective Outliers
An object in a given data set is a contextual outlier (or conditional outlier) if it deviates significantly with respect to a specific context of the object (Section 12.1). The context is defined using contextual attributes. These depend heavily on the application, and are often provided by users as part of the contextual outlier detection task. Contextual attributes can include spatial attributes, time, network locations, and sophisticated structured attributes. In addition, behavioral attributes define characteristics of the object, and are used to evaluate whether the object is an outlier in the context to which it belongs.
Contextual outliers
To determine whether the temperature of a location is exceptional (i.e., an outlier), the attributes specifying information about the location can serve as contextual attributes. These attributes may be spatial attributes (e.g., longitude and latitude) or location attributes in a graph or network. The attribute time can also be used. In customer-relationship management, whether a customer is an outlier may depend on other customers with similar profiles. Here, the attributes defining customer profiles provide the context for outlier detection.
In comparison to outlier detection in general, identifying contextual outliers requires analyzing the corresponding contextual information. Contextual outlier detection methods can be divided into two categories according to whether the contexts can be clearly identified.
12.7.1. Transforming Contextual Outlier Detection to Conventional Outlier Detection
This category of methods is for situations where the contexts can be clearly identified. The idea is to transform the contextual outlier detection problem into a typical outlier detection problem. Specifically, for a given data object, we can evaluate whether the object is an outlier in two steps. In the first step, we identify the context of the object using the contextual attributes. In the second step, we calculate the outlier score for the object in the context using a conventional outlier detection method.
Contextual outlier detection when the context can be clearly identified
In customer-relationship management, we can detect outlier customers in the context of customer groups. Suppose AllElectronics maintains customer information on four attributes, namely age_group (i.e., under 25, 25-45, 45-65, and over 65), postal_code, number_of_ transactions_per_year, and annual_total_transaction_amount. The attributes age_group and postal_code serve as contextual attributes, and the attributes number_of_ transactions_per_year and annual_total_transaction_amount are behavioral attributes.
To detect contextual outliers in this setting, for a customer, c, we can first locate the context of c using the attributes age_group and postal_code. We can then compare c with the other customers