Data Mining - Mehmed Kantardzic [37]
Kennedy, R. L. et al., Solving Data Mining Problems through Pattern Recognition, Prentice Hall, Upper Saddle River, NJ, 1998.
The book takes a practical approach to overall data-mining project development. The rigorous, multistep methodology includes defining the data set; collecting, preparing, and preprocessing data; choosing the appropriate technique and tuning the parameters; and training, testing, and troubleshooting.
Weiss, S. M., N. Indurkhya, Predictive Data Mining: A Practical Guide, Morgan Kaufman Publishers, San Francisco, CA, 1998.
This book focuses on the data-preprocessing phase in successful data-mining applications. Preparation and organization of data and development of an overall strategy for data mining are not only time-consuming processes, but also fundamental requirements in real-world data mining. The simple presentation of topics with a large number of examples is an additional strength of the book.
3
DATA REDUCTION
Chapter Objectives
Identify the differences in dimensionality reduction based on features, cases, and reduction of value techniques.
Explain the advantages of data reduction in the preprocessing phase of a data-mining process.
Understand the basic principles of feature-selection and feature-composition tasks using corresponding statistical methods.
Apply and compare entropy-based technique and principal component analysis (PCA) for feature ranking.
Understand the basic principles and implement ChiMerge and bin-based techniques for reduction of discrete values.
Distinguish approaches in cases where reduction is based on incremental and average samples.
For small or moderate data sets, the preprocessing steps mentioned in the previous chapter in preparation for data mining are usually enough. For really large data sets, there is an increased likelihood that an intermediate, additional step—data reduction—should be performed prior to applying the data-mining techniques. While large data sets have the potential for better mining results, there is no guarantee that they will yield better knowledge than small data sets. Given multidimensional data, a central question is whether it can be determined, prior to searching for all data-mining solutions in all dimensions, that the method has exhausted its potential for mining and discovery in a reduced data set. More commonly, a general solution may be deduced from a subset of available features or cases, and it will remain the same even when the search space is enlarged.
The main theme for simplifying the data in this step is dimension reduction, and the main question is whether some of these prepared and preprocessed data can be discarded without sacrificing the quality of results. There is one additional question about techniques for data reduction: Can the prepared data be reviewed and a subset found in a reasonable amount of time and space? If the complexity of algorithms for data reduction increases exponentially, then there is little to gain in reducing dimensions in big data. In this chapter, we will present basic and relatively efficient techniques for dimension reduction applicable to different data-mining problems.
3.1 DIMENSIONS OF LARGE DATA SETS
The choice of data representation and selection, reduction, or transformation of features is probably the most important issue that determines the quality of a data-mining solution. Besides influencing the nature of a data-mining algorithm, it can determine whether the problem is solvable at all, or how powerful the resulting model of data mining is. A large number of features can make available samples of data relatively insufficient for mining. In practice, the number of features can be as many as several hundred. If we have only a few hundred samples