Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining_ Concepts and Techniques - Jiawei Han [48]

By Root 1382 0

to others. Knowledge of object similarities can also be used in nearest-neighbor classification schemes where a given object (e.g., a patient) is assigned a class label (relating to, say, a diagnosis) based on its similarity toward other objects in the model.

This section presents similarity and dissimilarity measures, which are referred to as measures of proximity. Similarity and dissimilarity are related. A similarity measure for two objects, i and j, will typically return the value 0 if the objects are unalike. The higher the similarity value, the greater the similarity between objects. (Typically, a value of 1 indicates complete similarity, that is, the objects are identical.) A dissimilarity measure works the opposite way. It returns a value of 0 if the objects are the same (and therefore, far from being dissimilar). The higher the dissimilarity value, the more dissimilar the two objects are.

In Section 2.4.1 we present two data structures that are commonly used in the above types of applications: the data matrix (used to store the data objects) and the dissimilarity matrix (used to store dissimilarity values for pairs of objects). We also switch to a different notation for data objects than previously used in this chapter since now we are dealing with objects described by more than one attribute. We then discuss how object dissimilarity can be computed for objects described by nominal attributes (Section 2.4.2), by binary attributes (Section 2.4.3), by numeric attributes (Section 2.4.4), by ordinal attributes (Section 2.4.5), or by combinations of these attribute types (Section 2.4.6). Section 2.4.7 provides similarity measures for very long and sparse data vectors, such as term-frequency vectors representing documents in information retrieval. Knowing how to compute dissimilarity is useful in studying attributes and will also be referenced in later topics on clustering (Chapter 10 and Chapter 11), outlier analysis (Chapter 12), and nearest-neighbor classification (Chapter 9).

2.4.1. Data Matrix versus Dissimilarity Matrix

In Section 2.2, we looked at ways of studying the central tendency, dispersion, and spread of observed values for some attribute X. Our objects there were one-dimensional, that is, described by a single attribute. In this section, we talk about objects described by multiple attributes. Therefore, we need a change in notation. Suppose that we have n objects (e.g., persons, items, or courses) described by p attributes (also called measurements or features, such as age, height, weight, or gender). The objects are , , and so on, where xij is the value for object xi of the jth attribute. For brevity, we hereafter refer to object xi as object i. The objects may be tuples in a relational database, and are also referred to as data samples or feature vectors.

Main memory-based clustering and nearest-neighbor algorithms typically operate on either of the following two data structures:

■ Data matrix (or object-by-attribute structure): This structure stores the n data objects in the form of a relational table, or n-by-p matrix (n objects × p attributes):

(2.8)

Each row corresponds to an object. As part of our notation, we may use f to index through the p attributes.

■ Dissimilarity matrix (or object-by-object structure): This structure stores a collection of proximities that are available for all pairs of n objects. It is often represented by an n-by-n table:

(2.9)

where d(i, j) is the measured dissimilarity or “difference” between objects i and j. In general, d(i, j) is a non-negative number that is close to 0 when objects i and j are highly similar or “near” each other, and becomes larger the more they differ. Note that d(i, i) = 0; that is, the difference between an object and itself is 0. Furthermore, d(i, j) = d(j, i). (For readability, we do not show the d(j, i) entries; the matrix is symmetric.) Measures of dissimilarity are discussed throughout the remainder of this chapter.

Measures of similarity can often be expressed as a function of measures of dissimilarity. For example, for nominal

Online Book Reader

Data Mining_ Concepts and Techniques - Jiawei Han [48]

®Online Book Reader