Data Mining_ Concepts and Techniques - Jiawei Han [42]
Variance and standard deviation
In Example 2.6, we found using Eq. (2.1) for the mean. To determine the variance and standard deviation of the data from that example, we set N = 12 and use Eq. (2.6) to obtain
The basic properties of the standard deviation, σ, as a measure of spread are as follows:
■ σ measures spread about the mean and should be considered only when the mean is chosen as the measure of center.
■ σ = 0 only when there is no spread, that is, when all observations have the same value. Otherwise, σ > 0.
Importantly, an observation is unlikely to be more than several standard deviations away from the mean. Mathematically, using Chebyshev's inequality, it can be shown that at least of the observations are no more than k standard deviations from the mean. Therefore, the standard deviation is a good indicator of the spread of a data set.
The computation of the variance and standard deviation is scalable in large databases.
2.2.3. Graphic Displays of Basic Statistical Descriptions of Data
In this section, we study graphic displays of basic statistical descriptions. These include quantile plots, quantile–quantile plots, histograms, and scatter plots. Such graphs are helpful for the visual inspection of data, which is useful for data preprocessing. The first three of these show univariate distributions (i.e., data for one attribute), while scatter plots show bivariate distributions (i.e., involving two attributes).
Quantile Plot
In this and the following subsections, we cover common graphic displays of data distributions. A quantile plot is a simple and effective way to have a first look at a univariate data distribution. First, it displays all of the data for the given attribute (allowing the user to assess both the overall behavior and unusual occurrences). Second, it plots quantile information (see Section 2.2.2). Let xi, for i = 1 to N, be the data sorted in increasing order so that x1 is the smallest observation and xN is the largest for some ordinal or numeric attribute X. Each observation, xi, is paired with a percentage, fi, which indicates that approximately fi × 100% of the data are below the value, xi. We say “approximately” because there may not be a value with exactly a fraction, fi, of the data below xi. Note that the 0.25 corresponds to quartile Q1, the 0.50 is the median, and the 0.75 is Q3.
Let
(2.7)
These numbers increase in equal steps of 1/N, ranging from (which is slightly above 0) to (which is slightly below 1). On a quantile plot, xi is graphed against fi. This allows us to compare different distributions based on their quantiles. For example, given the quantile plots of sales data for two different time periods, we can compare their Q1, median, Q3, and other fi values at a glance.
Quantile plot
Figure 2.4 shows a quantile plot for the unit price data of Table 2.1.
Figure 2.4 A quantile plot for the unit price data of Table 2.1.
Table 2.1 A Set of Unit Price Data for Items Sold at a Branch of AllElectronics
Unit price ($)Count of items sold
40 275
43 300
47 250
– –
74 360
75 515
78 540
– –
115 320
117 270
120 350
Quantile–Quantile Plot
A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution against the corresponding quantiles of another. It is a powerful visualization tool in that it allows the user to view whether there is a shift in going from one distribution to another.
Suppose that we have two sets of observations for the attribute or variable unit price, taken from two different branch locations. Let be the data from the first branch, and be the data from the second, where each data set is sorted in increasing order. If M = N (i.e., the number of points in each set is the same), then we simply plot yi against xi, where yi and xi are both (i − 0.5)/N quantiles of their respective data sets. If M < N (i.e., the second branch has fewer observations than the first), there can be only M points on the q-q plot. Here, yi is the (i − 0.5)/M quantile of the y data, which is plotted against