Data Mining_ Concepts and Techniques - Jiawei Han [349]
■ Challenges in outlier detection include finding appropriate data models, the dependence of outlier detection systems on the application involved, finding ways to distinguish outliers from noise, and providing justification for identifying outliers as such.
■ Outlier detection methods can be categorized according to whether the sample of data for analysis is given with expert-provided labels that can be used to build an outlier detection model. In this case, the detection methods are supervised, semi-supervised, or unsupervised. Alternatively, outlier detection methods may be organized according to their assumptions regarding normal objects versus outliers. This categorization includes statistical methods, proximity-based methods, and clustering-based methods.
■ Statistical outlier detection methods (or model-based methods) assume that the normal data objects follow a statistical model, where data not following the model are considered outliers. Such methods may be parametric (they assume that the data are generated by a parametric distribution) or nonparametric (they learn a model for the data, rather than assuming one a priori). Parametric methods for multivariate data may employ the Mahalanobis distance, the χ2-statistic, or a mixture of multiple parametric models. Histograms and kernel density estimation are examples of nonparametric methods.
■ Proximity-based outlier detection methods assume that an object is an outlier if the proximity of the object to its nearest neighbors significantly deviates from the proximity of most of the other objects to their neighbors in the same data set. Distance-based outlier detection methods consult the neighborhood of an object, defined by a given radius. An object is an outlier if its neighborhood does not have enough other points. In density-based outlier detection methods, an object is an outlier if its density is relatively much lower than that of its neighbors.
■ Clustering-based outlier detection methods assume that the normal data objects belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters.
■ Classification-based outlier detection methods often use a one-class model. That is, a classifier is built to describe only the normal class. Any samples that do not belong to the normal class are regarded as outliers.
■ Contextual outlier detection and collective outlier detection explore structures in the data. In contextual outlier detection, the structures are defined as contexts using contextual attributes. In collective outlier detection, the structures are implicit and are explored as part of the mining process. To detect such outliers, one approach transforms the problem into one of conventional outlier detection. Another approach models the structures directly.
■ Outlier detection methods for high-dimensional data can be divided into three main approaches. These include extending conventional outlier detection, finding outliers in subspaces, and modeling high-dimensional outliers.
12.10. Exercises
12.1 Give an application example where global outliers, contextual outliers, and collective outliers are all interesting. What are the attributes, and what are the contextual and behavioral attributes? How is the relationship among objects modeled in collective outlier detection?
12.2 Give an application example of where the border between normal objects and outliers is often unclear, so that the degree to which an object is an outlier has to be well estimated.
12.3 Adapt a simple semi-supervised method for outlier detection. Discuss the scenario where you have (a) only some labeled examples of normal objects, and (b) only some labeled examples of outliers.
12.4 Using an equal-depth histogram, design a way to assign an object an outlier score.
12.5 Consider the nested loop approach to mining distance-based outliers (Figure 12.6). Suppose the objects in a data set are arranged randomly, that