Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining_ Concepts and Techniques - Jiawei Han [337]

By Root 1399 0

density is relatively much lower than that of its neighbors.

Let's start with distance-based outliers.

12.4.1. Distance-Based Outlier Detection and a Nested Loop Method

A representative method of proximity-based outlier detection uses the concept of distance-based outliers. For a set, D, of data objects to be analyzed, a user can specify a distance threshold, r, to define a reasonable neighborhood of an object. For each object, o, we can examine the number of other objects in the r-neighborhood of o. If most of the objects in D are far from o, that is, not in the r-neighborhood of o, then o can be regarded as an outlier.

Formally, let r be a distance threshold and π be a fraction threshold. An object, o, is a -outlier if

(12.10)

where is a distance measure.

Equivalently, we can determine whether an object, o, is a -outlier by checking the distance between o and its k-nearest neighbor, ok, where . Object o is an outlier if , because in such a case, there are fewer than k objects except for o that are in the r-neighborhood of o.

“How can we compute -outliers?” A straightforward approach is to use nested loops to check the r-neighborhood for every object, as shown in Figure 12.6. For any object, oi , we calculate the distance between oi and the other object, and count the number of other objects in the r-neighborhood of oi. Once we find other objects within a distance r from oi, the inner loop can be terminated because oi already violates (Eq. 12.10), and thus is not a -outlier. On the other hand, if the inner loop completes for oi, this means that oi has less than neighbors in a radius of r, and thus is a -outlier.

Figure 12.6 Nested loop algorithm for DB(r, π)-outlier detection.

The straightforward nested loop approach takes O(n2) time. Surprisingly, the actual CPU runtime is often linear with respect to the data set size. For most nonoutlier objects, the inner loop terminates early when the number of outliers in the data set is small, which should be the case most of the time. Correspondingly, only a small fraction of the data set is examined.

When mining large data sets where the complete set of objects cannot be held in main memory, the nested loop approach is still costly. Suppose the main memory has m pages for the mining. Instead of conducting the inner loop object by object, in such a case, the outer loop uses m − 1 pages to hold as many objects as possible and uses the remaining one page to run the inner loop. The inner loop cannot stop until all objects in the m − 1 pages are identified as not being outliers, which is very unlikely to happen. Correspondingly, it is likely that the algorithm has to incur input/output (I/O) cost, where b is the number of objects that can be held in one page.

The major cost in the nested loop method comes from two aspects. First, to check whether an object is an outlier, the nested loop method tests the object against the whole data set. To improve, we need to explore how to determine the outlierness of an object from the neighbors that are close to the object. Second, the nested loop method checks objects one by one. To improve, we should try to group objects according to their proximity, and check the outlierness of objects group by group most of the time. Section 12.4.2 introduces how to implement the preceding ideas.

12.4.2. A Grid-Based Method

CELL is a grid-based method for distance-based outlier detection. In this method, the data space is partitioned into a multidimensional grid, where each cell is a hypercube that has a diagonal of length , where r is a distance threshold parameter. In other words, if there are l dimensions, the length of each edge of a cell is .

Consider a 2-D data set, for example. Figure 12.7 shows part of the grid. The length of each edge of a cell is .

Figure 12.7 Grids in the CELL method.

Consider the cell C in Figure 12.7. The neighboring cells of C can be divided into two groups. The cells immediately next to C constitute the level-1 cells (labeled “1” in the figure), and the cells one or two cells away from C in any direction

Online Book Reader

Data Mining_ Concepts and Techniques - Jiawei Han [337]

®Online Book Reader