Data Mining_ Concepts and Techniques - Jiawei Han [338]
■ Level-1 cell property: Given any possible point, x of C, and any possible point, y, in a level-1 cell, then .
■ Level-2 cell property: Given any possible point, x of C, and any point, y, such that , then y is in a level-2 cell.
Let a be the number of objects in cell C, b1 be the total number of objects in the level-1 cells, and b2 be the total number of objects in the level-2 cells. We can apply the following rules.
■ Level-1 cell pruning rule: Based on the level-1 cell property, if , then every object o in C is not a -outlier because all those objects in C and the level-1 cells are in the r-neighborhood of o, and there are at least such neighbors.
■ Level-2 cell pruning rule: Based on the level-2 cell property, if , then all objects in C are -outliers because each of their r-neighborhoods has less than other objects.
Using the preceding two rules, the CELL method organizes objects into groups using a grid—all objects in a cell form a group. For groups satisfying one of the two rules, we can determine that either all objects in a cell are outliers or nonoutliers, and thus do not need to check those objects one by one. Moreover, to apply the two rules, we need only check a limited number of cells close to a target cell instead of the whole data set.
Using the previous two rules, many objects can be determined as being either nonoutliers or outliers. We only need to check the objects that cannot be pruned using the two rules. Even for such an object, o, we need only compute the distance between o and the objects in the level-2 cells with respect to o. This is because all objects in the level-1 cells have a distance of at most r to o, and all objects not in a level-1 or level-2 cell must have a distance of more than r from o, and thus cannot be in the r-neighborhood of o.
When the data set is very large so that most of the data are stored on disk, the CELL method may incur many random accesses to disk, which is costly. An alternative method was proposed, which uses a very small amount of main memory (around 1% of the data set) to mine all outliers by scanning the data set three times. First, a sample, S, is created of the given data set, D, using sampling by replacement. Each object in S is considered the centroid of a partition. The objects in D are assigned to the partitions based on distance. The preceding steps are completed in one scan of D. Candidate outliers are identified in a second scan of D. After a third scan, all -outliers have been found.
12.4.3. Density-Based Outlier Detection
Distance-based outliers, such as -outliers, are just one type of outlier. Specifically, distance-based outlier detection takes a global view of the data set. Such outliers can be regarded as “global outliers” for two reasons:
■ A -outlier, for example, is far (as quantified by parameter r) from at least % of the objects in the data set. In other words, an outlier as such is remote from the majority of the data.
■ To detect distance-based outliers, we need two global parameters, r and π, which are applied to every outlier object.
Many real-world data sets demonstrate a more complex structure, where objects may be considered outliers with respect to their local neighborhoods, rather than with respect to the global data distribution. Let's look at an example.
Local proximity-based outliers
Consider the data points in Figure 12.8. There are two clusters: C1 is dense, and C2 is sparse. Object o3 can be detected as a distance-based outlier because it is far from the majority of the data set.
Now, let's consider objects o1 and o2. Are they outliers? On the one hand, the distance from o1 and o2 to the objects in the dense cluster, C1, is smaller than the average distance between an object in cluster C2 and its nearest neighbor. Thus, o1 and o2 are not distance-based outliers. In fact, if we were to categorize o1 and o2 as -outliers, we would have to classify all the objects in clusters C2 as -outliers.
Figure