Data Mining_ Concepts and Techniques - Jiawei Han [55]
agefrequency
1–5 200
6–15 450
16–20 300
21–50 1500
51–80 700
81–110 44
Compute an approximate median value for the data.
2.4 Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with the following results:
age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
age 52 54 54 56 57 58 58 60 61
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
(a) Calculate the mean, median, and standard deviation of age and %fat.
(b) Draw the boxplots for age and %fat.
(c) Draw a scatter plot and a q-q plot based on these two variables.
2.5 Briefly outline how to compute the dissimilarity between objects described by the following:
(a) Nominal attributes
(b) Asymmetric binary attributes
(c) Numeric attributes
(d) Term-frequency vectors
2.6 Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using q = 3.
(d) Compute the supremum distance between the two objects.
2.7 The median is one of the most important holistic measures in data analysis. Propose several methods for median approximation. Analyze their respective complexity under different parameter settings and decide to what extent the real value can be approximated. Moreover, suggest a heuristic strategy to balance between accuracy and complexity and then apply it to all methods you have given.
2.8 It is important to define or select similarity measures in data analysis. However, there is no commonly accepted subjective similarity measure. Results can vary depending on the similarity measures used. Nonetheless, seemingly different similarity measures may be equivalent after some transformation.
Suppose we have the following 2-D data set:
A1A2
x1 1.5 1.7
x2 2 1.9
x3 1.6 1.8
x4 1.2 1.5
x5 1.5 1.0
(a) Consider the data as 2-D data points. Given a new data point, x = (1.4, 1.6) as a query, rank the database points based on similarity with the query using Euclidean distance, Manhattan distance, supremum distance, and cosine similarity.
(b) Normalize the data set to make the norm of each data point equal to 1. Use Euclidean distance on the transformed data to rank the data points.
2.7. Bibliographic Notes
Methods for descriptive data summarization have been studied in the statistics literature long before the onset of computers. Good summaries of statistical descriptive data mining methods include Freedman, Pisani, and Purves [FPP07] and Devore [Dev95]. For statistics-based visualization of data using boxplots, quantile plots, quantile–quantile plots, scatter plots, and loess curves, see Cleveland [Cle93].
Pioneering work on data visualization techniques is described in The Visual Display of Quantitative Information[Tuf83], Envisioning Information[Tuf90] and Visual Explanations: Images and Quantities, Evidence and Narrative[Tuf97], all by Tufte, in addition to Graphics and Graphic Information Processing by Bertin [Ber81], Visualizing Data by Cleveland [Cle93] and Information Visualization in Data Mining and Knowledge Discovery edited by Fayyad, Grinstein, and Wierse [FGW01].
Major conferences and symposiums on visualization include ACM Human Factors in Computing Systems (CHI), Visualization, and the International Symposium on Information Visualization. Research on visualization is also published in Transactions on Visualization and Computer Graphics, Journal of Computational and Graphical Statistics, and IEEE Computer Graphics and Applications.
Many graphical user interfaces and visualization tools have been developed and can be found in various data mining products. Several books on data mining (e.g., Data Mining Solutions by Westphal and Blaxton [WB98]) present many good examples and visual snapshots. For a survey of visualization techniques, see “Visual techniques for exploring databases” by Keim [Kei97].