Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [143]

By Root 886 0
that form a partition, or a structure of partitions, of the data set. One additional result of cluster analysis is a generalized description of every cluster, and this is especially important for a deeper analysis of the data set’s characteristics.

9.1 CLUSTERING CONCEPTS


Organizing data into sensible groupings is one of the most fundamental approaches of understanding and learning. Cluster analysis is the formal study of methods and algorithms for natural grouping, or clustering, of objects according to measured or perceived intrinsic characteristics or similarities. Samples for clustering are represented as a vector of measurements, or more formally, as a point in a multidimensional space. Samples within a valid cluster are more similar to each other than they are to a sample belonging to a different cluster. Clustering methodology is particularly appropriate for the exploration of interrelationships among samples to make a preliminary assessment of the sample structure. Human performances are competitive with automatic-clustering procedures in one, two, or three dimensions, but most real problems involve clustering in higher dimensions. It is very difficult for humans to intuitively interpret data embedded in a high-dimensional space.

Table 9.1 shows a simple example of clustering information for nine customers, distributed across three clusters. Two features describe customers: The first feature is the number of items the customers bought, and the second feature shows the price they paid for each.

TABLE 9.1. Sample Set of Clusters Consisting of Similar Objects

Number of Items Price

Cluster 1 2 1700

3 2000

4 2300

Cluster 2 10 1800

12 2100

11 2500

Cluster 3 2 100

3 200

3 350

Customers in Cluster 1 purchase a few high-priced items; customers in Cluster 2 purchase many high-priced items; and customers in Cluster 3 purchase few low-priced items. Even this simple example and interpretation of a cluster’s characteristics shows that clustering analysis (in some references also called unsupervised classification) refers to situations in which the objective is to construct decision boundaries (classification surfaces) based on unlabeled training data set. The samples in these data sets have only input dimensions, and the learning process is classified as unsupervised.

Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes in an n-dimensional data space. To compound the problem further, the number of clusters in the data often depends on the resolution (fine vs. coarse) with which we view the data. The next example illustrates these problems through the process of clustering points in the Euclidean two-dimensional (2-D) space. Figure 9.1a shows a set of points (samples in a 2-D space) scattered on a 2-D plane. Let us analyze the problem of dividing the points into a number of groups. The number of groups N is not given beforehand. Figure 9.1b shows the natural clusters bordered by broken curves. Since the number of clusters is not given, we have another partition of four clusters in Figure 9.1c that is as natural as the groups in Figure 9.1b. This kind of arbitrariness for the number of clusters is a major problem in clustering.

Figure 9.1. Cluster analysis of points in a 2D-space. (a) Initial data; (b) three clusters of data; (c) four clusters of data.

Note that the above clusters can be recognized by sight. For a set of points in a higher dimensional Euclidean space, we cannot recognize clusters visually. Accordingly, we need an objective criterion for clustering. To describe this criterion, we have to introduce a more formalized approach in describing the basic concepts and the clustering process.

An input to a cluster analysis can be described as an ordered pair (X, s), or (X, d), where X is a set of object descriptions represented with samples, and s and d are measures for similarity or dissimilarity (distance) between samples, respectively. Output from the clustering system is a partition Λ = {G1, G2, … , GN }, where Gk, k = 1, … , N is a crisp subset

Return Main Page Previous Page Next Page

®Online Book Reader