Data Mining - Mehmed Kantardzic [32]
graphical or visualization techniques,
statistical-based techniques,
distance-based techniques, and
model-based techniques.
Examples of visualization methods include Boxplot (1-D), Scatter plot (2-D), and Spin plot (3-D), and they will be explained in the following chapters. Data visualization methods that are useful in outlier detection for one to three dimensions are weaker in multidimensional data because of a lack of adequate visualization methodologies for n-dimensional spaces. An illustrative example of a visualization of 2-D samples and visual detection of outliers is given in Figures 2.6 and 2.7. The main limitations of the approach are its time-consuming process and the subjective nature of outlier detection.
Figure 2.6. Outliers for univariate data based on mean value and standard deviation.
Figure 2.7. Two-dimensional data set with one outlying sample.
Statistically based outlier detection methods can be divided between univariate methods, proposed in earlier works in this field, and multivariate methods, which usually form most of the current body of research. Statistical methods either assume a known underlying distribution of the observations or, at least, they are based on statistical estimates of unknown distribution parameters. These methods flag as outliers those observations that deviate from the model assumptions. The approach is often unsuitable for high-dimensional data sets and for arbitrary data sets without prior knowledge of the underlying data distribution.
Most of the earliest univariate methods for outlier detection rely on the assumption of an underlying known distribution of the data, which is assumed to be identically and independently distributed. Moreover, many discordance tests for detecting univariate outliers further assume that the distribution parameters and the type of expected outliers are also known. Although traditionally the normal distribution has been used as the target distribution, this definition can be easily extended to any unimodal symmetric distribution with positive density function. Traditionally, the sample mean and the sample variance give good estimation for data location and data shape if it is not contaminated by outliers. When the database is contaminated, those parameters may deviate and significantly affect the outlier-detection performance. Needless to say, in real-world data-mining applications, these assumptions are often violated.
The simplest approach to outlier detection for 1-D samples is based on traditional unimodal statistics. Assuming that the distribution of values is given, it is necessary to find basic statistical parameters such as mean value and variance. Based on these values and the expected (or predicted) number of outliers, it is possible to establish the threshold value as a function of variance. All samples out of the threshold value are candidates for outliers as presented in Figure 2.6. The main problem with this simple methodology is an a priori assumption about data distribution. In most real-world examples, the data distribution may not be known.
For example, if the given data set represents the feature age with 20 different values:
then, the corresponding statistical parameters are
If we select the threshold value for normal distribution of data as
then, all data that are out of range [−54.1, 131.2] will be potential outliers. Additional knowledge of the characteristics of the feature (age is always greater then 0) may further reduce the range to [0, 131.2]. In our example there are three values that are outliers based on the given criteria: 156, 139, and −67. With a high probability we can conclude that all three of them are typo errors (data entered with additional digits or an additional “–” sign).
An additional single-dimensional method is Grubbs’