Data Mining_ Concepts and Techniques - Jiawei Han [339]
On the other hand, o1 and o2 can be identified as outliers when they are considered locally with respect to cluster C1 because o1 and o2 deviate significantly from the objects in C1. Moreover, o1 and o2 are also far from the objects in C2.
To summarize, distance-based outlier detection methods cannot capture local outliers like o1 and o2. Note that the distance between object o4 and its nearest neighbors is much greater than the distance between o1 and its nearest neighbors. However, because o4 is local to cluster C2(which is sparse), o4 is not considered a local outlier.
“How can we formulate the local outliers as illustrated inExample 12.14?” The critical idea here is that we need to compare the density around an object with the density around its local neighbors. The basic assumption of density-based outlier detection methods is that the density around a nonoutlier object is similar to the density around its neighbors, while the density around an outlier object is significantly different from the density around its neighbors.
Based on the preceding, density-based outlier detection methods use the relative density of an object against its neighbors to indicate the degree to which an object is an outlier.
Now, let's consider how to measure the relative density of an object, o, given a set of objects, D. The k-distance of o, denoted by distk(o), is the distance, dist( o, p), between o and another object, p , such that
■ There are at least k objects o′ {o} such that dist (o, o′) (o, p).
■ There are at most objects o″ { o } such that dist (o, o″)(o, p).
In other words, is the distance between o and its k-nearest neighbor. Consequently, the k-distance neighborhood of o contains all objects of which the distance to o is not greater than , the k-distance of o, denoted by
(12.11)
Note that Nk(o) may contain more than k objects because multiple objects may each be the same distance away from o.
We can use the average distance from the objects in Nk(o) to o as the measure of the local density of o. However, such a straightforward measure has a problem: If o has very close neighbors o′ such that (o, o′) is very small, the statistical fluctuations of the distance measure can be undesirably high. To overcome this problem, we can switch to the following reachability distance measure by adding a smoothing effect.
For two objects, o and o′, the reachability distance from o′ to o is (o o′) if (o, o′) (o), and (o) otherwise. That is,
(12.12)
Here, k is a user-specified parameter that controls the smoothing effect. Essentially, k specifies the minimum neighborhood to be examined to determine the local density of an object. Importantly, reachability distance is not symmetric, that is, in general, .
Now, we can define the local reachability density of an object, o, as
(12.13)
There is a critical difference between the density measure here for outlier detection and that in density-based clustering (Section 12.5). In density-based clustering, to determine whether an object can be considered a core object in a density-based cluster, we use two parameters: a radius parameter, r, to specify the range of the neighborhood, and the minimum number of points in the r-neighborhood. Both parameters are global and are applied to every object. In contrast, as motivated by the observation that relative density is the key to finding local outliers, we use the parameter k to quantify the neighborhood and do not need to specify the minimum number of objects in the neighborhood as a requirement of density. We instead calculate the local reachability density for an object and compare it with that of its neighbors to quantify the degree to which the object is considered an outlier.
Specifically, we define the local outlier factor of an object o as
(12.14)
In other words, the local outlier factor is the average of the ratio of the local reachability density of o and those of o's k-nearest neighbors. The lower the local reachability density of o (i.e., the smaller the item ) and the higher