Data Mining_ Concepts and Techniques - Jiawei Han [333]
As with statistical methods for clustering, statistical methods for outlier detection make assumptions about data normality. They assume that the normal objects in a data set are generated by a stochastic process (a generative model). Consequently, normal objects occur in regions of high probability for the stochastic model, and objects in the regions of low probability are outliers.
The general idea behind statistical methods for outlier detection is to learn a generative model fitting the given data set, and then identify those objects in low-probability regions of the model as outliers. However, there are many different ways to learn generative models. In general, statistical methods for outlier detection can be divided into two major categories:parametric methods and nonparametric methods, according to how the models are specified and learned.
A parametric method assumes that the normal data objects are generated by a parametric distribution with parameter Θ. The probability density function of the parametric distribution gives the probability that object x is generated by the distribution. The smaller this value, the more likely x is an outlier.
A nonparametric method does not assume an a priori statistical model. Instead, a nonparametric method tries to determine the model from the input data. Note that most nonparametric methods do not assume that the model is completely parameter-free. (Such an assumption would make learning the model from data almost mission impossible.) Instead, nonparametric methods often take the position that the number and nature of the parameters are flexible and not fixed in advance. Examples of nonparametric methods include histogram and kernel density estimation.
12.3.1. Parametric Methods
In this subsection, we introduce several simple yet practical parametric methods for outlier detection. We first discuss methods for univariate data based on normal distribution. We then discuss how to handle multivariate data using multiple parametric distributions.
Detection of Univariate Outliers Based on Normal Distribution
Data involving only one attribute or variable are called univariate data. For simplicity, we often choose to assume that data are generated from a normal distribution. We can then learn the parameters of the normal distribution from the input data, and identify the points with low probability as outliers.
Let's start with univariate data. We will try to detect outliers by assuming the data follow a normal distribution.
Univariate outlier detection using maximum likelihood
Suppose a city's average temperature values in July in the last 10 years are, in value-ascending order, 24.0° C, 28.9° C, 28.9° C, 29.0° C, 29.1° C, 29.1° C, 29.2° C, 29.2° C, 29.3° C, and 29.4° C. Let's assume that the average temperature follows a normal distribution, which is determined by two parameters: the mean, μ, and the standard deviation, σ.
We can use the maximum likelihood method to estimate the parameters μ and σ. That is, we maximize the log-likelihood function
(12.1)
where n is the total number of samples, which is 10 in this example.
Taking derivatives with respect to μ and σ2 and solving the resulting system of first-order conditions leads to the following maximum likelihood estimates:
(12.2)
(12.3)
In this example, we have
Accordingly, we have .
The most deviating value, 24.0° C, is 4.61° C away from the estimated mean. We know that the region contains 99.7% data under the assumption of normal distribution. Because , the probability that the value 24.0° C is generated by the normal distribution is less than 0.15%, and thus can be identified as an outlier.
Example 12.8 elaborates a simple yet practical outlier detection method. It simply labels any object as an outlier if it is more than 3σ away from the mean of the estimated distribution, where σ is the standard deviation.
Such straightforward methods for statistical outlier detection can also be used in visualization. For example, the boxplot method (described in Chapter 2) plots the univariate input data using