Data Mining_ Concepts and Techniques - Jiawei Han [51]
(2.16)
Another well-known measure is the Manhattan (or city block) distance, named so because it is the distance in blocks between any two points in a city (such as 2 blocks down and 3 blocks over for a total of 5 blocks). It is defined as
(2.17)
Both the Euclidean and the Manhattan distance satisfy the following mathematical properties:
Non-negativity: : Distance is a non-negative number.
Identity of indiscernibles: : The distance of an object to itself is 0.
Symmetry: : Distance is a symmetric function.
Triangle inequality: : Going directly from object i to object j in space is no more than making a detour over any other object k.
A measure that satisfies these conditions is known as metric. Please note that the non-negativity property is implied by the other three properties.
Euclidean distance and Manhattan distance
Let x1 = (1, 2) and x2 = (3, 5) represent two objects as shown in Figure 2.23. The Euclidean distance between the two is . The Manhattan distance between the two is 2 + 3 = 5.
Figure 2.23 Euclidean, Manhattan, and supremum distances between two objects.
Minkowski distance is a generalization of the Euclidean and Manhattan distances. It is defined as
(2.18)
where h is a real number such that h ≥ 1. (Such a distance is also called Lp norm in some literature, where the symbol p refers to our notation of h. We have kept p as the number of attributes to be consistent with the rest of this chapter.) It represents the Manhattan distance when h = 1 (i.e., L1 norm) and Euclidean distance when h = 2 (i.e., L2 norm).
The supremum distance (also referred to as Lmax, L∞ norm and as the Chebyshev distance) is a generalization of the Minkowski distance for h → ∞. To compute it, we find the attribute f that gives the maximum difference in values between the two objects. This difference is the supremum distance, defined more formally as:
(2.19)
The L∞ norm is also known as the uniform norm.
Supremum distance
Let's use the same two objects, x1 = (1, 2) and x2 = (3, 5), as in Figure 2.23. The second attribute gives the greatest difference between values for the objects, which is 5 − 2 = 3. This is the supremum distance between both objects.
If each attribute is assigned a weight according to its perceived importance, the weighted Euclidean distance can be computed as
(2.20)
Weighting can also be applied to other distance measures as well.
2.4.5. Proximity Measures for Ordinal Attributes
The values of an ordinal attribute have a meaningful order or ranking about them, yet the magnitude between successive values is unknown (Section 2.1.4). An example includes the sequence small, medium, large for a size attribute. Ordinal attributes may also be obtained from the discretization of numeric attributes by splitting the value range into a finite number of categories. These categories are organized into ranks. That is, the range of a numeric attribute can be mapped to an ordinal attribute f having Mf states. For example, the range of the interval-scaled attribute temperature (in Celsius) can be organized into the following states: −30 to −10, −10 to 10, 10 to 30, representing the categories cold temperature, moderate temperature, and warm temperature, respectively. Let M represent the number of possible states that an ordinal attribute can have. These ordered states define the ranking .
“How are ordinal attributes handled?” The treatment of ordinal attributes is quite similar to that of numeric attributes when computing dissimilarity between objects. Suppose that f is an attribute from a set of ordinal attributes describing n objects. The dissimilarity computation with respect to f involves the following steps:
1. The value of f for the ith object is xif, and f has Mf ordered states, representing the ranking . Replace each xif by its corresponding rank, .
2. Since each ordinal attribute can have a different number of states, it is often necessary to