Data Mining_ Concepts and Techniques - Jiawei Han [41]
The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data. This distance is called the interquartile range (IQR) and is defined as
(2.5)
Interquartile range
The quartiles are the three values that split the sorted data set into four equal parts. The data of Example 2.6 contain 12 observations, already sorted in increasing order. Thus, the quartiles for this data are the third, sixth, and ninth values, respectively, in the sorted list. Therefore, Q1 = $47,000 and Q3 is $63,000. Thus, the interquartile range is IQR = 63 − 47 = $16,000. (Note that the sixth value is a median, $52,000, although this data set has two medians since the number of data values is even.)
Five-Number Summary, Boxplots, and Outliers
No single numeric measure of spread (e.g., IQR) is very useful for describing skewed distributions. Have a look at the symmetric and skewed data distributions of Figure 2.1. In the symmetric distribution, the median (and other measures of central tendency) splits the data into equal-size halves. This does not occur for skewed distributions. Therefore, it is more informative to also provide the two quartiles Q1 and Q3, along with the median. A common rule of thumb for identifying suspected outliers is to single out values falling at least 1.5 × IQR above the third quartile or below the first quartile.
Because Q1, the median, and Q3 together contain no information about the endpoints (e.g., tails) of the data, a fuller summary of the shape of a distribution can be obtained by providing the lowest and highest data values as well. This is known as the five-number summary. The five-number summary of a distribution consists of the median (Q2), the quartiles Q1 and Q3, and the smallest and largest individual observations, written in the order of Minimum, Q1, Median, Q3, Maximum.
Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the five-number summary as follows:
■ Typically, the ends of the box are at the quartiles so that the box length is the interquartile range.
■ The median is marked by a line within the box.
■ Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations.
When dealing with a moderate number of observations, it is worthwhile to plot potential outliers individually. To do this in a boxplot, the whiskers are extended to the extreme low and high observations only if these values are less than 1.5 × IQR beyond the quartiles. Otherwise, the whiskers terminate at the most extreme observations occurring within 1.5 × IQR of the quartiles. The remaining cases are plotted individually. Boxplots can be used in the comparisons of several sets of compatible data.
Boxplot
Figure 2.3 shows boxplots for unit price data for items sold at four branches of AllElectronics during a given time period. For branch 1, we see that the median price of items sold is $80, Q1 is $60, and Q3 is $100. Notice that two outlying observations for this branch were plotted individually, as their values of 175 and 202 are more than 1.5 times the IQR here of 40.
Figure 2.3 Boxplot for the unit price data for items sold at four branches of AllElectronics during a given time period.
Boxplots can be computed in O(n log n) time. Approximate boxplots can be computed in linear or sublinear time depending on the quality guarantee required.
Variance and Standard Deviation
Variance and standard deviation are measures of data dispersion. They indicate how spread out a data distribution is. A low standard deviation means that the data observations tend to be very close to the mean, while a high standard deviation indicates that the data are spread out over a large range of values.
The variance of N observations, , for a numeric attribute X is
(2.6)
where is the mean value of the observations, as defined in Eq. (2.1). The standard deviation, σ, of the observations is the square