Data Mining_ Concepts and Techniques - Jiawei Han [45]
Figure 2.11 Some frequently used 2-D space-filling curves.
Note that the windows do not have to be rectangular. For example, the circle segment technique uses windows in the shape of segments of a circle, as illustrated in Figure 2.12. This technique can ease the comparison of dimensions because the dimension windows are located side by side and form a circle.
Figure 2.12 The circle segment technique. (a) Representing a data record in circle segments. (b) Laying out pixels in circle segments.
2.3.2. Geometric Projection Visualization Techniques
A drawback of pixel-oriented visualization techniques is that they cannot help us much in understanding the distribution of data in a multidimensional space. For example, they do not show whether there is a dense area in a multidimensional subspace. Geometric projection techniques help users find interesting projections of multidimensional data sets. The central challenge the geometric projection techniques try to address is how to visualize a high-dimensional space on a 2-D display.
A scatter plot displays 2-D data points using Cartesian coordinates. A third dimension can be added using different colors or shapes to represent different data points. Figure 2.13 shows an example, where X and Y are two spatial attributes and the third dimension is represented by different shapes. Through this visualization, we can see that points of types “+” and “×” tend to be colocated.
Figure 2.13 Visualization of a 2-D data set using a scatter plot. Source:www.cs.sfu.ca/jpei/publications/rareevent-geoinformatica06.pdf.
A 3-D scatter plot uses three axes in a Cartesian coordinate system. If it also uses color, it can display up to 4-D data points (Figure 2.14).
Figure 2.14 Visualization of a 3-D data set using a scatter plot. Source:http://upload.wikimedia.org/wikipedia/commons/c/c4/Scatter_plot.jpg.
For data sets with more than four dimensions, scatter plots are usually ineffective. The scatter-plot matrix technique is a useful extension to the scatter plot. For an n-dimensional data set, a scatter-plot matrix is an n × n grid of 2-D scatter plots that provides a visualization of each dimension with every other dimension. Figure 2.15 shows an example, which visualizes the Iris data set. The data set consists of 450 samples from each of three species of Iris flowers. There are five dimensions in the data set: length and width of sepal and petal, and species.
Figure 2.15 Visualization of the Iris data set using a scatter-plot matrix. Source:http://support.sas.com/documentation/cdl/en/grstatproc/61948/HTML/default/images/gsgscmat.gif.
The scatter-plot matrix becomes less effective as the dimensionality increases. Another popular technique, called parallel coordinates, can handle higher dimensionality. To visualize n-dimensional data points, the parallel coordinates technique draws n equally spaced axes, one for each dimension, parallel to one of the display axes. A data record is represented by a polygonal line that intersects each axis at the point corresponding to the associated dimension value (Figure 2.16).
Figure 2.16 Here is a visualization that uses parallel coordinates. Source:www.stat.columbia.edu/~cook/movabletype/archives/2007/10/parallel_coordi.thml.
A major limitation of the parallel coordinates technique is that it cannot effectively show a data set of many records. Even for a data set of several thousand records, visual clutter and overlap often reduce the readability of the visualization and make the patterns hard to find.
2.3.3. Icon-Based Visualization