Data Mining_ Concepts and Techniques - Jiawei Han [50]
Table 2.3 Contingency Table for Binary Attributes
Object j
Object i 1 0 sum
1 q r q + r
0 s t s + t
sum q + s r + t p
Recall that for symmetric binary attributes, each state is equally valuable. Dissimilarity that is based on symmetric binary attributes is called symmetric binary dissimilarity. If objects i and j are described by symmetric binary attributes, then the dissimilarity between i and j is
(2.13)
For asymmetric binary attributes, the two states are not equally important, such as the positive (1) and negative (0) outcomes of a disease test. Given two asymmetric binary attributes, the agreement of two 1s (a positive match) is then considered more significant than that of two 0s (a negative match). Therefore, such binary attributes are often considered “monary” (having one state). The dissimilarity based on these attributes is called asymmetric binary dissimilarity, where the number of negative matches, t, is considered unimportant and is thus ignored in the following computation:
(2.14)
Complementarily, we can measure the difference between two binary attributes based on the notion of similarity instead of dissimilarity. For example, the asymmetric binary similarity between the objects i and j can be computed as
(2.15)
The coefficient sim(i, j) of Eq. (2.15) is called the Jaccard coefficient and is popularly referenced in the literature.
When both symmetric and asymmetric binary attributes occur in the same data set, the mixed attributes approach described in Section 2.4.6 can be applied.
Dissimilarity between binary attributes
Suppose that a patient record table (Table 2.4) contains the attributes name, gender, fever, cough, test-1, test-2, test-3, and test-4, where name is an object identifier, gender is a symmetric attribute, and the remaining attributes are asymmetric binary.
Table 2.4 Relational TableWhere Patients Are Described by Binary Attributes
namegenderfevercoughtest-1test-2test-3test-4
Jack M Y N P N N N
Jim M Y Y N N N N
Mary F Y N P N P N
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
For asymmetric attribute values, let the values Y (yes) and P (positive) be set to 1, and the value N (no or negative) be set to 0. Suppose that the distance between objects (patients) is computed based only on the asymmetric attributes. According to Eq. (2.14), the distance between each pair of the three patients—Jack, Mary, and Jim—is
These measurements suggest that Jim and Mary are unlikely to have a similar disease because they have the highest dissimilarity value among the three pairs. Of the three patients, Jack and Mary are the most likely to have a similar disease.
2.4.4. Dissimilarity of Numeric Data: Minkowski Distance
In this section, we describe distance measures that are commonly used for computing the dissimilarity of objects described by numeric attributes. These measures include the Euclidean, Manhattan, and Minkowski distances.
In some cases, the data are normalized before applying distance calculations. This involves transforming the data to fall within a smaller or common range, such as [−1, 1] or [0.0, 1.0]. Consider a height attribute, for example, which could be measured in either meters or inches. In general, expressing an attribute in smaller units will lead to a larger range for that attribute, and thus tend to give such attributes greater effect or “weight.” Normalizing the data attempts to give all attributes an equal weight. It may or may not be useful in a particular application. Methods for normalizing data are discussed in detail in Chapter 3 on data preprocessing.
The most popular distance measure is Euclidean distance (i.e., straight line or “as the