Data Mining - Mehmed Kantardzic [146]
TABLE 9.2. The 2 × 2 Contingency Table
The meaning of the table parameters a, b, c, and d, which are given in Figure 6.2, is as follows:
1. a is the number of binary attributes of samples xi and xj such that xik = xjk = 1.
2. b is the number of binary attributes of samples xi and xj such that xik = 1 and xjk = 0.
3. c is the number of binary attributes of samples xi and xj such that xik = 0 and xjk = 1.
4. d is the number of binary attributes of samples xi and xj such that xik = xjk = 0.
For example, if xi and xj are 8-D vectors with binary feature values
then the values of the parameters introduced are
Several similarity measures for samples with binary features are proposed using the values in the 2 × 2 contingency table. Some of them are
1. simple matching coefficient (SMC)
2. Jaccard Coefficient
3. Rao’s Coefficient
For the previously given 8-D samples xi and xj these measures of similarity will be ssmc(xi, xj) = 5/8, sjc(xi, xj) = 2/5, and src(xi, xj) = 2/8.
How to measure distances between values when categorical data are not binary? The simplest way to find similarity between two categorical attributes is to assign a similarity of 1 if the values are identical and a similarity of 0 if the values are not identical. For two multivariate categorical data points, the similarity between them will be directly proportional to the number of attributes in which they match. This simple measure is also known as the overlap measure in the literature. One obvious drawback of the overlap measure is that it does not distinguish between the different values taken by an attribute. All matches, as well as mismatches, are treated as equal.
This observation has motivated researchers to come up with data-driven similarity measures for categorical attributes. Such measures take into account the frequency distribution of different attribute values in a given data set to define similarity between two categorical attribute values. Intuitively, the use of additional information would lead to a better performance. There are two main characteristics of categorical data that are included in new measures of similarity (distance):
1. number of values taken by each attribute, nk (one attribute might take several hundred possible values, while another attribute might take very few values); and
2. distribution fk(x), which refers to the distribution of frequency of values taken by an attribute in the given data set.
Almost all similarity measures assign a similarity value between two d-dimensional samples X and Y belonging to the data set D as follows:
where Sk(Xk, Yk) is the per-attribute similarity between two values for the categorical attribute Ak. The quantity wk denotes the weight assigned to the attribute Ak. To understand how different measures calculate the per-attribute similarity, Sk(Xk; Yk), consider a categorical attribute A, which takes one of the values{a, b, c, d}. The per-attribute similarity computation is equivalent to constructing the (symmetric) matrix shown in Table 9.3.
TABLE 9.3. Similarity Matrix for a Single Categorical Attribute
Essentially, in determining the similarity between two values, any categorical measure is filling the entries of this matrix. For example, the overlap measure sets the diagonal entries to 1 and the off-diagonal entries to 0, that is, the similarity is 1 if the values match and 0 if the values mismatch. Additionally, measures may use the following information in computing a similarity value (all the measures in this paper use only this information):
1. f(a), f(b), f(c), and f(d), the frequencies of the values in the data set;
2. N, the size of the data set; and
3. n, the number of values taken by the attribute (4 in the case above).
We will present, as an illustrative example, only one additional measure