Data Mining_ Concepts and Techniques - Jiawei Han [49]
(2.10)
where sim(i, j) is the similarity between objects i and j. Throughout the rest of this chapter, we will also comment on measures of similarity.
A data matrix is made up of two entities or “things,” namely rows (for objects) and columns (for attributes). Therefore, the data matrix is often called a two-mode matrix. The dissimilarity matrix contains one kind of entity (dissimilarities) and so is called a one-mode matrix. Many clustering and nearest-neighbor algorithms operate on a dissimilarity matrix. Data in the form of a data matrix can be transformed into a dissimilarity matrix before applying such algorithms.
2.4.2. Proximity Measures for Nominal Attributes
A nominal attribute can take on two or more states (Section 2.1.2). For example, map_color is a nominal attribute that may have, say, five states: red, yellow, green, pink, and blue.
Let the number of states of a nominal attribute be M. The states can be denoted by letters, symbols, or a set of integers, such as 1, 2, …, M. Notice that such integers are used just for data handling and do not represent any specific ordering.
“How is dissimilarity computed between objects described by nominal attributes?” The dissimilarity between two objects i and j can be computed based on the ratio of mismatches:
(2.11)
where m is the number of matches (i.e., the number of attributes for which i and j are in the same state), and p is the total number of attributes describing the objects. Weights can be assigned to increase the effect of m or to assign greater weight to the matches in attributes having a larger number of states.
Dissimilarity between nominal attributes
Suppose that we have the sample data of Table 2.2, except that only the object-identifier and the attribute test-1 are available, where test-1 is nominal. (We will use test-2 and test-3 in later examples.) Let's compute the dissimilarity matrix (Eq. 2.9), that is,
Since here we have one nominal attribute, test-1, we set p = 1 in Eq. (2.11) so that d(i, j) evaluates to 0 if objects i and j match, and 1 if the objects differ. Thus, we get
From this, we see that all objects are dissimilar except objects 1 and 4 (i.e., d(4, 1) = 0).
Table 2.2 A Sample Data Table Containing Attributes of Mixed Type
Object Identifiertest-1 (nominal)test-2 (ordinal)test-3 (numeric)
1 code A excellent 45
2 code B fair 22
3 code C good 64
4 code A excellent 28
Alternatively, similarity can be computed as
(2.12)
Proximity between objects described by nominal attributes can be computed using an alternative encoding scheme. Nominal attributes can be encoded using asymmetric binary attributes by creating a new binary attribute for each of the M states. For an object with a given state value, the binary attribute representing that state is set to 1, while the remaining binary attributes are set to 0. For example, to encode the nominal attribute map_color, a binary attribute can be created for each of the five colors previously listed. For an object having the color yellow, the yellow attribute is set to 1, while the remaining four attributes are set to 0. Proximity measures for this form of encoding can be calculated using the methods discussed in the next subsection.
2.4.3. Proximity Measures for Binary Attributes
Let's look at dissimilarity and similarity measures for objects described by either symmetric or asymmetric binary attributes.
Recall that a binary attribute has only one of two states: 0 and 1, where 0 means that the attribute is absent, and 1 means that it is present (Section 2.1.3). Given the attribute smoker describing a patient, for instance, 1 indicates that the patient smokes, while 0 indicates that the patient does not. Treating binary attributes as if they are numeric can be misleading. Therefore, methods specific to binary data are necessary for computing dissimilarity.
“So, how can we compute the dissimilarity between two binary attributes?” One approach involves computing a dissimilarity matrix from the given binary data. If all binary attributes are thought of as having the