Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [252]

By Root 686 0
visualization. Traditional visual metaphors are used for a single or a small number of dimensions, and they include:

1. bar charts that show aggregations and frequencies;

2. histograms that show the distribution of variable values;

3. line charts for understanding trends in order;

4. pie charts for visualizing fractions of a total;

5. scatter plots for bivariate analysis.

Color-coding is one of the most common traditional IVT methods for displaying a 1-D set of values where each value is represented by a different color. This representation becomes a continuous tonal variation of color when real numbers are the values of a dimension. Normally, a color spectrum from blue to red is chosen, representing a natural variation from “cool” to “hot,” in other words, from the smallest to the highest values.

With the development of large data warehouses, data cubes became very popular IVT. A data cube, the raw-data structure in a multidimensional database, organizes information along a sequence of categories. The categorizing variables are called dimensions. The data, called measures, are stored in cells along given dimensions. The cube dimensions are organized into hierarchies and usually include a dimension representing time. The hierarchical levels for the dimension time may be year, quarter, month, day, and hour. Similar hierarchies could be defined for other dimensions given in a data warehouse. Multidimensional databases in modern data warehouses automatically aggregate measures across hierarchical dimensions; they support hierarchical navigation, expand and collapse dimensions, enable drill down, drill up, or drill across, and facilitate comparisons through time. In a transaction information in the database, the cube dimensions might be product, store, department, customer number, region, month, year. The dimensions are predefined indices in a cube cell and the measures in a cell are roll-ups or aggregations over the transactions. They are usually sums but may include functions such as average, standard deviation, and percentage.

For example, the values for the dimensions in a database may be

1. region: north, south, east, west;

2. product: shoes, shirts;

3. month: anuary, February, March, … , December.

Then, the cell corresponding to (north, shirt, February) is the total sales of shirts for the northern region for the month of February.

Novel IVT can simultaneously represent large data sets with many dimensions on one screen. The widely accepted classifications of these new techniques are

1. geometric-projection techniques,

2. icon-based techniques,

3. pixel-oriented techniques, and

4. hierarchical techniques.

Geometric-projection techniques aim to find interesting projections of multidimensional data sets. We will present some illustrative examples of these techniques.

The Scatter-Plot Matrix Technique is an approach that is very often available in new data-mining software tools. A grid of 2-D scatter plots is the standard means of extending a standard 2-D scatter plot to higher dimensions. If you have 10-D data, a 10 × 10 array of scatter plots is used to provide a visualization of each dimension versus every other dimension. This is useful for looking at all possible two-way interactions or correlations between dimensions. Positive and negative correlations, but only between two dimensions, can be seen easily. The standard display quickly becomes inadequate for extremely large numbers of dimensions, and user interactions of zooming and panning are needed to interpret the scatter plots effectively.

The Survey Plot is a simple technique of extending an n-dimensional point (sample) in a line graph. Each dimension of the sample is represented on a separate axis in which the dimension’s value is a proportional line from the center of the axis. The principles of representation are given in Figure 15.1.

Figure 15.1. A 4-dimensional survey plot.

This visualization of n-dimensional data allows you to see correlations between any two variables, especially when the data are sorted according to a particular dimension.

Return Main Page Previous Page Next Page

®Online Book Reader