Data Mining_ Concepts and Techniques - Jiawei Han [80]
3.12 ChiMerge [Ker92] is a supervised, bottom-up (i.e., merge-based) data discretization method. It relies on χ2 analysis: Adjacent intervals with the least χ2 values are merged together until the chosen stopping criterion satisfies.
(a) Briefly describe how ChiMerge works.
(b) Take the IRIS data set, obtained from the University of California–Irvine Machine Learning Data Repository (www.ics.uci.edu/~mlearn/MLRepository.html), as a data set to be discretized. Perform data discretization for each of the four numeric attributes using the ChiMerge method. (Let the stopping criteria be: max-interval = 6). You need to write a small program to do this to avoid clumsy numerical computation. Submit your simple analysis and your test results: split-points, final intervals, and the documented source program.
3.13 Propose an algorithm, in pseudocode or in your favorite programming language, for the following:
(a) The automatic generation of a concept hierarchy for nominal data based on the number of distinct values of attributes in the given schema.
(b) The automatic generation of a concept hierarchy for numeric data based on the equal-width partitioning rule.
(c) The automatic generation of a concept hierarchy for numeric data based on the equal-frequency partitioning rule.
3.14 Robust data loading poses a challenge in database systems because the input data are often dirty. In many cases, an input record may miss multiple values; some records could be contaminated, with some data values out of range or of a different data type than expected. Work out an automated data cleaning and loading algorithm so that the erroneous data will be marked and contaminated data will not be mistakenly inserted into the database during data loading.
3.8. Bibliographic Notes
Data preprocessing is discussed in a number of textbooks, including English [Eng99], Pyle [Pyl99], Loshin [Los01], Redman [Red01] and Dasu and Johnson [DJ03]. More specific references to individual preprocessing techniques are given later.
For discussion regarding data quality, see Redman [Red92]; Wang, Storey, and Firth [WSF95]; Wand and Wang [WW96]; Ballou and Tayi [BT99]; and Olson [Ols03]. Potter's Wheel (control.cx.berkely.edu/abc), the interactive data cleaning tool described in Section 3.2.3, is presented in Raman and Hellerstein [RH01]. An example of the development of declarative languages for the specification of data transformation operators is given in Galhardas et al. [GFS+01]. The handling of missing attribute values is discussed in Friedman [Fri77]; Breiman, Friedman, Olshen, and Stone [BFOS84]; and Quinlan [Qui89]. Hua and Pei [HP07] presented a heuristic approach to cleaning disguised missing data, where such data are captured when users falsely select default values on forms (e.g., “January 1” for birthdate) when they do not want to disclose personal information.
A method for the detection of outlier or “garbage” patterns in a handwritten character database is given in Guyon, Matic, and Vapnik [GMV96]. Binning and data normalization are treated in many texts, including Kennedy et al. [KLV+98], Weiss and Indurkhya [WI98] and Pyle [Pyl99]. Systems that include attribute (or feature) construction include BACON by Langley, Simon, Bradshaw, and Zytkow [LSBZ87]; Stagger by Schlimmer [Sch86]; FRINGE by Pagallo [Pag89]; and AQ17-DCI by Bloedorn and Michalski [BM98]. Attribute construction is also described in Liu and Motoda [LM98a] and [LM98b]. Dasu et al. built a BELLMAN system and proposed a set of interesting methods for building a data quality browser by mining database structures [DJMS02].
A good survey of data reduction techniques can be found in Barbarà et al. [BDF+97]. For algorithms on data cubes and their precomputation, see Sarawagi and Stonebraker [SS94]; Agarwal et al. [AAD+96]; Harinarayan, Rajaraman, and Ullman [HRU96]; Ross and Srivastava [RS97]; and Zhao, Deshpande, and Naughton