Data Mining_ Concepts and Techniques - Jiawei Han [334]
Figure 12.3 Using a boxplot to visualize outliers.
Another simple statistical method for univariate outlier detection using normal distribution is the Grubb's test (also known as the maximum normed residual test). For each object x in a data set, we define a z-score as
(12.4)
where is the mean, and s is the standard deviation of the input data. An object x is an outlier if
(12.5)
where is the value taken by a t-distribution at a significance level of , and N is the number of objects in the data set.
Detection of Multivariate Outliers
Data involving two or more attributes or variables are multivariate data. Many univariate outlier detection methods can be extended to handle multivariate data. The central idea is to transform the multivariate outlier detection task into a univariate outlier detection problem. Here, we use two examples to illustrate this idea.
Multivariate outlier detection using the Mahalanobis distance
For a multivariate data set, let be the mean vector. For an object, o, in the data set, the Mahalanobis distance from o to is
(12.6)
where S is the covariance matrix.
is a univariate variable, and thus Grubb's test can be applied to this measure. Therefore, we can transform the multivariate outlier detection tasks as follows:
1. Calculate the mean vector from the multivariate data set.
2. For each object o, calculate , the Mahalanobis distance from o to .
3. Detect outliers in the transformed univariate data set, .
4. If is determined to be an outlier, then o is regarded as an outlier as well.
Our second example uses the χ2-statistic to measure the distance between an object to the mean of the input data set.
Multivariate outlier detection using the χ2-statistic
The χ2-statistic can also be used to capture multivariate outliers under the assumption of normal distribution. For an object, o, the χ2-statistic is
(12.7)
where oi is the value of o on the i th dimension, Ei is the mean of the i-dimension among all objects, and n is the dimensionality. If the χ2-statistic is large, the object is an outlier.
Using a Mixture of Parametric Distributions
If we assume that the data were generated by a normal distribution, this works well in many situations. However, this assumption may be overly simplified when the actual data distribution is complex. In such cases, we instead assume that the data were generated by a mixture of parametric distributions.
Multivariate outlier detection using multiple parametric distributions
Consider the data set in Figure 12.4. There are two big clusters, C1 and C2. To assume that the data are generated by a normal distribution would not work well here. The estimated mean is located between the two clusters and not inside any cluster. The objects between the two clusters cannot be detected as outliers since they are close to the mean.
Figure 12.4 A complex data set.
To overcome this problem, we can instead assume that the normal data objects are generated by multiple normal distributions, two in this case. That is, we assume two normal distributions, and . For any object, o, in the data set, the probability that o is generated by the mixture of the two distributions is given by
where and are the probability density functions of and , respectively. We can use the expectation-maximization (EM) algorithm (Chapter 11) to learn the parameters from the data, as we do in mixture models for clustering. Each cluster is represented by a learned normal distribution. An object, o, is detected as an outlier if it does not belong