Data Mining_ Concepts and Techniques - Jiawei Han [347]
To reduce dimensionality, general feature selection and extraction methods may be used or extended for outlier detection. For example, principal components analysis (PCA) can be used to extract a lower-dimensional space. Heuristically, the principal components with low variance are preferred because, on such dimensions, normal objects are likely close to each other and outliers often deviate from the majority.
By extending conventional outlier detection methods, we can reuse much of the experience gained from research in the field. These new methods, however, are limited. First, they cannot detect outliers with respect to subspaces and thus have limited interpretability. Second, dimensionality reduction is feasible only if there exists a lower-dimensional space where normal objects and outliers are well separated. This assumption may not hold true.
12.8.2. Finding Outliers in Subspaces
Another approach for outlier detection in high-dimensional data is to search for outliers in various subspaces. A unique advantage is that, if an object is found to be an outlier in a subspace of much lower dimensionality, the subspace provides critical information for interpreting why and to what extent the object is an outlier. This insight is highly valuable in applications with high-dimensional data due to the overwhelming number of dimensions.
Outliers in subspaces
As a customer-relationship manager at AllElectronics, you are interested in finding outlier customers. AllElectronics maintains an extensive customer information database, which contains many attributes and the transaction history of customers. The database is high dimensional.
Suppose you find that a customer, Alice, is an outlier in a lower-dimensional subspace that contains the dimensions average_transaction_amount and purchase_frequency, such that her average transaction amount is substantially larger than the majority of the customers, and her purchase frequency is dramatically lower. The subspace itself speaks for why and to what extent Alice is an outlier. Using this information, you strategically decide to approach Alice by suggesting options that could improve her purchase frequency at AllElectronics.
“How can we detect outliers in subspaces?” We use a grid-based subspace outlier detection method to illustrate. The major ideas are as follows. We consider projections of the data onto various subspaces. If, in a subspace, we find an area that has a density that is much lower than average, then the area may contain outliers. To find such projections, we first discretize the data into a grid in an equal-depth way. That is, each dimension is partitioned into ϕ equal-depth ranges, where each range contains a fraction, f, of the objects . Equal-depth partitioning is used because data along different dimensions may have different localities. An equal-width partitioning of the space may not be able to reflect such differences in locality.
Next, we search for regions defined by ranges in subspaces that are significantly sparse. To quantify what we mean by “significantly sparse,” let's consider a k-dimensional cube formed by k ranges on k dimensions. Suppose the data set contains n objects. If the objects are independently distributed, the expected number of objects falling into a k-dimensional region is . The standard deviation of the number of points in a k-dimensional region is . Suppose a specific k-dimensional cube C has n(C) objects. We can define the sparsity coefficient of C as
(12.22)
If , then C contains fewer objects than expected. The smaller the value of S(C) (i.e., the more negative), the sparser C is and the more likely the objects in C are outliers in the subspace.
By assuming S(C) follows a normal distribution, we can use normal distribution tables to determine the probabilistic significance level for an object that deviates dramatically from the average for an a priori assumption of the data following a uniform distribution. In general, the assumption of uniform