Data Mining_ Concepts and Techniques - Jiawei Han [53]
We can now use the dissimilarity matrices for the three attributes in our computation of Eq. (2.22). The indicator for each of the three attributes, f. We get, for example, . The resulting dissimilarity matrix obtained for the data described by the three attributes of mixed types is:
From Table 2.2, we can intuitively guess that objects 1 and 4 are the most similar, based on their values for test -1 and test -2. This is confirmed by the dissimilarity matrix, where d(4, 1) is the lowest value for any pair of different objects. Similarly, the matrix indicates that objects 1 and 2 are the least similar.
2.4.7. Cosine Similarity
A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as a keyword) or phrase in the document. Thus, each document is an object represented by what is called a term-frequency vector. For example, in Table 2.5, we see that Document1 contains five instances of the word team, while hockey occurs three times. The word coach is absent from the entire document, as indicated by a count value of 0. Such data can be highly asymmetric.
Table 2.5 Document Vector or Term-Frequency Vector
Documentteamcoachhockeybaseballsoccerpenaltyscorewinlossseason
Document1 5 0 3 0 2 0 0 2 0 0
Document2 3 0 2 0 1 1 0 1 0 1
Document3 0 7 0 2 1 0 0 3 0 0
Document4 0 1 0 0 1 2 2 0 3 0
Term-frequency vectors are typically very long and sparse (i.e., they have many 0 values). Applications using such structures include information retrieval, text document clustering, biological taxonomy, and gene feature mapping. The traditional distance measures that we have studied in this chapter do not work well for such sparse numeric data. For example, two term-frequency vectors may have many 0 values in common, meaning that the corresponding documents do not share many words, but this does not make them similar. We need a measure that will focus on the words that the two documents do have in common, and the occurrence frequency of such words. In other words, we need a measure for numeric data that ignores zero-matches.
Cosine similarity is a measure of similarity that can be used to compare documents or, say, give a ranking of documents with respect to a given vector of query words. Let x and y be two vectors for comparison. Using the cosine measure as a similarity function, we have
(2.23)
where ||x|| is the Euclidean norm of vector , defined as . Conceptually, it is the length of the vector. Similarly, ||y|| is the Euclidean norm of vector y. The measure computes the cosine of the angle between vectors x and y. A cosine value of 0 means that the two vectors are at 90 degrees to each other (orthogonal) and have no match. The closer the cosine value to 1, the smaller the angle and the greater the match between vectors. Note that because the cosine similarity measure does not obey all of the properties of Section 2.4.4 defining metric measures, it is referred to as a nonmetric measure.
Cosine similarity between two term-frequency vectors
Suppose that x and y are the first two term-frequency vectors in Table 2.5. That is, and . How similar are x and y? Using Eq. (2.23) to compute the cosine similarity between the two vectors, we get:
Therefore, if we were using the cosine similarity measure to compare these documents, they would be considered quite similar.
When attributes are binary-valued, the cosine similarity function can be interpreted in terms of shared features or attributes. Suppose an object x possesses the ith attribute if xi = 1. Then xt ⋅ y is the number of attributes possessed (i.e., shared) by both x and y, and |x||y| is the geometric mean of the number of attributes