Data Mining - Mehmed Kantardzic [145]
Quantitative features can be subdivided as
1. continuous values (e.g., real numbers where Pj ⊆ R),
2. discrete values (e.g., binary numbers Pj = {0,1}, or integers Pj ⊆ Z), and
3. interval values (e.g., Pj = {xij ≤ 20, 20 < xij < 40, xij ≥ 40}.
Qualitative features can be
1. nominal or unordered (e.g., color is “blue” or “red”), and
2. ordinal (e.g., military rank with values “general” and “colonel”).
Since similarity is fundamental to the definition of a cluster, a measure of the similarity between two patterns drawn from the same feature space is essential to most clustering algorithms. This measure must be chosen very carefully because the quality of a clustering process depends on this decision. It is most common to calculate, instead of the similarity measure, the dissimilarity between two samples using a distance measure defined on the feature space. A distance measure may be a metric or a quasi-metric on the sample space, and it is used to quantify the dissimilarity of samples.
The word “similarity” in clustering means that the value of s(x, x′) is large when x and x′ are two similar samples; the value of s(x, x′) is small when x and x′ are not similar. Moreover, a similarity measure s is symmetric:
For most clustering techniques, we say that a similarity measure is normalized:
Very often a measure of dissimilarity is used instead of a similarity measure. A dissimilarity measure is denoted by d(x, x′), ∀x, x′ ∈ X. Dissimilarity is frequently called a distance. A distance d(x, x′) is small when x and x′ are similar; if x and x′ are not similar d(x, x′) is large. We assume without loss of generality that
Distance measure is also symmetric:
and if it is accepted as a metric distance measure, then a triangular inequality is required:
The most well-known metric distance measure is the Euclidean distance in an m-dimensional feature space:
Another metric that is frequently used is called the L1 metric or city block distance:
and finally, the Minkowski metric includes the Euclidean distance and the city block distance as special cases:
It is obvious that when p = 1, then d coincides with L1 distance, and when p = 2, d is identical with the Euclidean metric. For example, for 4-D vectors x1 = {1, 0, 1, 0} and x2 = {2, 1, −3, −1}, these distance measures are d1 = 1 + 1 + 4 + 1 = 7, d2 = (1 + 1 + 16 + 1)1/2 = 4.36, and d3 = (1 + 1 + 64 + 1)1/3 = 4.06.
The Euclidian n-dimensional space model offers not only the Euclidean distance but also other measures of similarity. One of them is called the cosine-correlation:
It is easy to see that
For the previously given vectors x1 and x2, the corresponding cosine measure of similarity is scos(x1, x2) = (2 + 0 − 3 + 0)/(2½ · 15½ ) = −0.18.
Computing distances or measures of similarity between samples that have some or all features that are noncontinuous is problematic, since the different types of features are not comparable and one standard measure is not applicable. In practice, different distance measures are used for different features of heterogeneous samples. Let us explain one possible distance measure for binary data. Assume that each sample is represented by the n-dimensional vector xi,