Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [145]

By Root 734 0

each dimension as a single feature. These features can be either quantitative or qualitative descriptions of the object. If the individual scalar component xij of a sample xi is a feature or attribute value, then each component xij, j = 1, … , m is an element of a domain Pj, where Pj could belong to different types of data such as binary (Pj = {0,1}), integer (Pj ⊆ Z), real number (Pj ⊆ R), or a categorical set of symbols. In the last case, for example, Pj may be a set of colors: Pj = {white, black, red, blue, green}. If weight and color are two features used to describe samples, then the sample (20, black) is the representation of a black object with 20 units of weight. The first feature is quantitative and the second one is qualitative. In general, both feature types can be further subdivided, and details of this taxonomy are already given in Chapter 1.

Quantitative features can be subdivided as

1. continuous values (e.g., real numbers where Pj ⊆ R),

2. discrete values (e.g., binary numbers Pj = {0,1}, or integers Pj ⊆ Z), and

3. interval values (e.g., Pj = {xij ≤ 20, 20 < xij < 40, xij ≥ 40}.

Qualitative features can be

1. nominal or unordered (e.g., color is “blue” or “red”), and

2. ordinal (e.g., military rank with values “general” and “colonel”).

Since similarity is fundamental to the definition of a cluster, a measure of the similarity between two patterns drawn from the same feature space is essential to most clustering algorithms. This measure must be chosen very carefully because the quality of a clustering process depends on this decision. It is most common to calculate, instead of the similarity measure, the dissimilarity between two samples using a distance measure defined on the feature space. A distance measure may be a metric or a quasi-metric on the sample space, and it is used to quantify the dissimilarity of samples.

The word “similarity” in clustering means that the value of s(x, x′) is large when x and x′ are two similar samples; the value of s(x, x′) is small when x and x′ are not similar. Moreover, a similarity measure s is symmetric:

For most clustering techniques, we say that a similarity measure is normalized:

Very often a measure of dissimilarity is used instead of a similarity measure. A dissimilarity measure is denoted by d(x, x′), ∀x, x′ ∈ X. Dissimilarity is frequently called a distance. A distance d(x, x′) is small when x and x′ are similar; if x and x′ are not similar d(x, x′) is large. We assume without loss of generality that

Distance measure is also symmetric:

and if it is accepted as a metric distance measure, then a triangular inequality is required:

The most well-known metric distance measure is the Euclidean distance in an m-dimensional feature space:

Another metric that is frequently used is called the L1 metric or city block distance:

and finally, the Minkowski metric includes the Euclidean distance and the city block distance as special cases:

It is obvious that when p = 1, then d coincides with L1 distance, and when p = 2, d is identical with the Euclidean metric. For example, for 4-D vectors x1 = {1, 0, 1, 0} and x2 = {2, 1, −3, −1}, these distance measures are d1 = 1 + 1 + 4 + 1 = 7, d2 = (1 + 1 + 16 + 1)1/2 = 4.36, and d3 = (1 + 1 + 64 + 1)1/3 = 4.06.

The Euclidian n-dimensional space model offers not only the Euclidean distance but also other measures of similarity. One of them is called the cosine-correlation:

It is easy to see that

For the previously given vectors x1 and x2, the corresponding cosine measure of similarity is scos(x1, x2) = (2 + 0 − 3 + 0)/(2½ · 15½ ) = −0.18.

Computing distances or measures of similarity between samples that have some or all features that are noncontinuous is problematic, since the different types of features are not comparable and one standard measure is not applicable. In practice, different distance measures are used for different features of heterogeneous samples. Let us explain one possible distance measure for binary data. Assume that each sample is represented by the n-dimensional vector xi,

Online Book Reader

Data Mining - Mehmed Kantardzic [145]

®Online Book Reader