Online Book Reader

Home Category

Data Mining_ Concepts and Techniques - Jiawei Han [340]

By Root 1576 0
the local reachability densities of the k-nearest neighbors of o, the higher the LOF value is. This exactly captures a local outlier of which the local density is relatively low compared to the local densities of its k-nearest neighbors.

The local outlier factor has some nice properties. First, for an object deep within a consistent cluster, such as the points in the center of cluster C2 in Figure 12.8, the local outlier factor is close to 1. This property ensures that objects inside clusters, no matter whether the cluster is dense or sparse, will not be mislabeled as outliers.

Second, for an object o, the meaning of LOF(o) is easy to understand. Consider the objects in Figure 12.9, for example. For object o, let

(12.15)

be the minimum reachability distance from o to its k-nearest neighbors. Similarly, we can define

(12.16)

Figure 12.9 A property of LOF(o).

We also consider the neighbors of o's k-nearest neighbors. Let

(12.17)

and

(12.18)

Then, it can be shown that LOF (o) is bounded as

(12.19)

This result clearly shows that LOF captures the relative density of an object.

12.5. Clustering-Based Approaches


The notion of outliers is highly related to that of clusters. Clustering-based approaches detect outliers by examining the relationship between objects and clusters. Intuitively, an outlier is an object that belongs to a small and remote cluster, or does not belong to any cluster.

This leads to three general approaches to clustering-based outlier detection. Consider an object.

■ Does the object belong to any cluster? If not, then it is identified as an outlier.

■ Is there a large distance between the object and the cluster to which it is closest? If yes, it is an outlier.

■ Is the object part of a small or sparse cluster? If yes, then all the objects in that cluster are outliers.

Let's look at examples of each of these approaches.

Detecting outliers as objects that do not belong to any cluster

Gregarious animals (e.g., goats and deer) live and move in flocks. Using outlier detection, we can identify outliers as animals that are not part of a flock. Such animals may be either lost or wounded.

In Figure 12.10, each point represents an animal living in a group. Using a density-based clustering method, such as DBSCAN, we note that the black points belong to clusters. The white point, a, does not belong to any cluster, and thus is declared an outlier.

Figure 12.10 Object a is an outlier because it does not belong to any cluster.

The second approach to clustering-based outlier detection considers the distance between an object and the cluster to which it is closest. If the distance is large, then the object is likely an outlier with respect to the cluster. Thus, this approach detects individual outliers with respect to clusters.

Clustering-based outlier detection using distance to the closest cluster

Using the k-means clustering method, we can partition the data points shown in Figure 12.11 into three clusters, as shown using different symbols. The center of each cluster is marked with a +.

Figure 12.11 Outliers (a, b, c) are far from the clusters to which they are closest (with respect to the cluster centers).

For each object, o, we can assign an outlier score to the object according to the distance between the object and the center that is closest to the object. Suppose the closest center to o is co; then the distance between o and co is (o, co), and the average distance between co and the objects assigned to o is . The ratio measures how (o, co) stands out from the average. The larger the ratio, the farther away o is relative from the center, and the more likely o is an outlier. In Figure 12.11, points a, b, and c are relatively far away from their corresponding centers, and thus are suspected of being outliers.


This approach can also be used for intrusion detection, as described in Example 12.17.

Intrusion detection by clustering-based outlier detection

A bootstrap method was developed to detect intrusions in TCP connection data by considering the similarity

Return Main Page Previous Page Next Page

®Online Book Reader