Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining_ Concepts and Techniques - Jiawei Han [343]

By Root 1613 0

approach, we find a large cluster, C, and a small cluster, C1. Because some objects in C carry the label “normal,” we can treat all objects in this cluster (including those without labels) as normal objects. We use the one-class model of this cluster to identify normal objects in outlier detection. Similarly, because some objects in cluster C1 carry the label “outlier,” we declare all objects in C1 as outliers. Any object that does not fall into the model for C(e.g., a) is considered an outlier as well.

Figure 12.14 Detecting outliers by semi-supervised learning.

Classification-based methods can incorporate human domain knowledge into the detection process by learning from the labeled samples. Once the classification model is constructed, the outlier detection process is fast. It only needs to compare the objects to be examined against the model learned from the training data. The quality of classification-based methods heavily depends on the availability and quality of the training set. In many applications, it is difficult to obtain representative and high-quality training data, which limits the applicability of classification-based methods.

12.7. Mining Contextual and Collective Outliers

An object in a given data set is a contextual outlier (or conditional outlier) if it deviates significantly with respect to a specific context of the object (Section 12.1). The context is defined using contextual attributes. These depend heavily on the application, and are often provided by users as part of the contextual outlier detection task. Contextual attributes can include spatial attributes, time, network locations, and sophisticated structured attributes. In addition, behavioral attributes define characteristics of the object, and are used to evaluate whether the object is an outlier in the context to which it belongs.

Contextual outliers

To determine whether the temperature of a location is exceptional (i.e., an outlier), the attributes specifying information about the location can serve as contextual attributes. These attributes may be spatial attributes (e.g., longitude and latitude) or location attributes in a graph or network. The attribute time can also be used. In customer-relationship management, whether a customer is an outlier may depend on other customers with similar profiles. Here, the attributes defining customer profiles provide the context for outlier detection.

In comparison to outlier detection in general, identifying contextual outliers requires analyzing the corresponding contextual information. Contextual outlier detection methods can be divided into two categories according to whether the contexts can be clearly identified.

12.7.1. Transforming Contextual Outlier Detection to Conventional Outlier Detection

This category of methods is for situations where the contexts can be clearly identified. The idea is to transform the contextual outlier detection problem into a typical outlier detection problem. Specifically, for a given data object, we can evaluate whether the object is an outlier in two steps. In the first step, we identify the context of the object using the contextual attributes. In the second step, we calculate the outlier score for the object in the context using a conventional outlier detection method.

Contextual outlier detection when the context can be clearly identified

In customer-relationship management, we can detect outlier customers in the context of customer groups. Suppose AllElectronics maintains customer information on four attributes, namely age_group (i.e., under 25, 25-45, 45-65, and over 65), postal_code, number_of_ transactions_per_year, and annual_total_transaction_amount. The attributes age_group and postal_code serve as contextual attributes, and the attributes number_of_ transactions_per_year and annual_total_transaction_amount are behavioral attributes.

To detect contextual outliers in this setting, for a customer, c, we can first locate the context of c using the attributes age_group and postal_code. We can then compare c with the other customers

Online Book Reader

Data Mining_ Concepts and Techniques - Jiawei Han [343]

®Online Book Reader