Data Mining_ Concepts and Techniques - Jiawei Han [52]
(2.21)
3. Dissimilarity can then be computed using any of the distance measures described in Section 2.4.4 for numeric attributes, using zif to represent the f value for the ith object.
Dissimilarity between ordinal attributes
Suppose that we have the sample data shown earlier in Table 2.2, except that this time only the object-identifier and the continuous ordinal attribute, test-2, are available. There are three states for test-2: fair, good, and excellent, that is, Mf = 3. For step 1, if we replace each value for test-2 by its rank, the four objects are assigned the ranks 3, 1, 2, and 3, respectively. Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0. For step 3, we can use, say, the Euclidean distance (Eq. 2.16), which results in the following dissimilarity matrix:
Therefore, objects 1 and 2 are the most dissimilar, as are objects 2 and 4 (i.e., d(2, 1) = 1.0 and d(4, 2) = 1.0). This makes intuitive sense since objects 1 and 4 are both excellent. Object 2 is fair, which is at the opposite end of the range of values for test-2.
Similarity values for ordinal attributes can be interpreted from dissimilarity as .
2.4.6. Dissimilarity for Attributes of Mixed Types
2.4.2, 2.4.3, 2.4.4 and 2.4.5 discussed how to compute the dissimilarity between objects described by attributes of the same type, where these types may be either nominal, symmetric binary, asymmetric binary, numeric, or ordinal. However, in many real databases, objects are described by a mixture of attribute types. In general, a database can contain all of these attribute types.
“So, how can we compute the dissimilarity between objects of mixed attribute types?” One approach is to group each type of attribute together, performing separate data mining (e.g., clustering) analysis for each type. This is feasible if these analyses derive compatible results. However, in real applications, it is unlikely that a separate analysis per attribute type will generate compatible results.
A more preferable approach is to process all attribute types together, performing a single analysis. One such technique combines the different attributes into a single dissimilarity matrix, bringing all of the meaningful attributes onto a common scale of the interval [0.0, 1.0].
Suppose that the data set contains p attributes of mixed type. The dissimilarity d(i, j) between objects i and j is defined as
(2.22)
where the indicator if either (1) xif or xjf is missing (i.e., there is no measurement of attribute f for object i or object j), or (2) xif = xjf = 0 and attribute f is asymmetric binary; otherwise, . The contribution of attribute f to the dissimilarity between i and j (i.e., ) is computed dependent on its type:
■ If f is numeric: , where h runs over all nonmissing objects for attribute f.
■ If f is nominal or binary: if xif = xjf; otherwise, .
■ If f is ordinal: compute the ranks rif and , and treat zif as numeric.
These steps are identical to what we have already seen for each of the individual attribute types. The only difference is for numeric attributes, where we normalize so that the values map to the interval [0.0, 1.0]. Thus, the dissimilarity between objects can be computed even when the attributes describing the objects are of different types.
Dissimilarity between attributes of mixed type
Let's compute a dissimilarity matrix for the objects in Table 2.2. Now we will consider all of the attributes, which are of different types. In Example 2.17, Example 2.18, Example 2.19, Example 2.20 and Example 2.21, we worked out the dissimilarity matrices for each of the individual attributes. The procedures we followed for test-1 (which is nominal) and test-2 (which is ordinal) are the same as outlined earlier for processing attributes of mixed types. Therefore, we can use the dissimilarity matrices obtained for test-1 and test-2