Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining_ Concepts and Techniques - Jiawei Han [334]

By Root 1428 0

a five-number summary (Figure 12.3): the smallest nonoutlier value (Min), the lower quartile (Q1), the median (Q2), the upper quartile (Q3), and the largest nonoutlier value (Max). The interquantile range (IQR) is defined as Q3 − Q 1. Any object that is more than 1.5 × IQR smaller than Q1 or 1.5 × IQR larger than Q3 is treated as an outlier because the region between Q1 − 1.5 × IQR and Q3 + 1.5 × IQR contains 99.3% of the objects. The rationale is similar to using 3σ as the threshold for normal distribution.

Figure 12.3 Using a boxplot to visualize outliers.

Another simple statistical method for univariate outlier detection using normal distribution is the Grubb's test (also known as the maximum normed residual test). For each object x in a data set, we define a z-score as

(12.4)

where is the mean, and s is the standard deviation of the input data. An object x is an outlier if

(12.5)

where is the value taken by a t-distribution at a significance level of , and N is the number of objects in the data set.

Detection of Multivariate Outliers

Data involving two or more attributes or variables are multivariate data. Many univariate outlier detection methods can be extended to handle multivariate data. The central idea is to transform the multivariate outlier detection task into a univariate outlier detection problem. Here, we use two examples to illustrate this idea.

Multivariate outlier detection using the Mahalanobis distance

For a multivariate data set, let be the mean vector. For an object, o, in the data set, the Mahalanobis distance from o to is

(12.6)

where S is the covariance matrix.

is a univariate variable, and thus Grubb's test can be applied to this measure. Therefore, we can transform the multivariate outlier detection tasks as follows:

1. Calculate the mean vector from the multivariate data set.

2. For each object o, calculate , the Mahalanobis distance from o to .

3. Detect outliers in the transformed univariate data set, .

4. If is determined to be an outlier, then o is regarded as an outlier as well.

Our second example uses the χ2-statistic to measure the distance between an object to the mean of the input data set.

Multivariate outlier detection using the χ2-statistic

The χ2-statistic can also be used to capture multivariate outliers under the assumption of normal distribution. For an object, o, the χ2-statistic is

(12.7)

where oi is the value of o on the i th dimension, Ei is the mean of the i-dimension among all objects, and n is the dimensionality. If the χ2-statistic is large, the object is an outlier.

Using a Mixture of Parametric Distributions

If we assume that the data were generated by a normal distribution, this works well in many situations. However, this assumption may be overly simplified when the actual data distribution is complex. In such cases, we instead assume that the data were generated by a mixture of parametric distributions.

Multivariate outlier detection using multiple parametric distributions

Consider the data set in Figure 12.4. There are two big clusters, C1 and C2. To assume that the data are generated by a normal distribution would not work well here. The estimated mean is located between the two clusters and not inside any cluster. The objects between the two clusters cannot be detected as outliers since they are close to the mean.

Figure 12.4 A complex data set.

To overcome this problem, we can instead assume that the normal data objects are generated by multiple normal distributions, two in this case. That is, we assume two normal distributions, and . For any object, o, in the data set, the probability that o is generated by the mixture of the two distributions is given by

where and are the probability density functions of and , respectively. We can use the expectation-maximization (EM) algorithm (Chapter 11) to learn the parameters from the data, as we do in mixture models for clustering. Each cluster is represented by a learned normal distribution. An object, o, is detected as an outlier if it does not belong

Online Book Reader

Data Mining_ Concepts and Techniques - Jiawei Han [334]

®Online Book Reader