Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [56]

By Root 723 0
tool for feature ranking based on means and variances. Input data set is represented in the form of flat file with several features.

10. Develop a software tool for ranking features using entropy measure. The input data set is represented in the form of a flat file with several features.

11. Implement the ChiMerge algorithm for automated discretization of selected features in a flat input file.

12. Given the data set F = {4, 2, 1, 6, 4, 3, 1, 7, 2, 2}, apply two iterations of bin method for values reduction with best cutoffs. Initial number of bins is 3. What are the final medians of bins, and what is the total minimized error?

13. Assume you have 100 values that are all different, and use equal width discretization with 10 bins.

(a) What is the largest number of records that could appear in one bin?

(b) What is the smallest number of records that could appear in one bin?

(c) If you use equal height discretization with 10 bins, what is largest number of records that can appear in one bin?

(d) If you use equal height discretization with 10 bins, what is smallest number of records that can appear in one bin?

(e) Now assume that the maximum value frequency is 20. What is the largest number of records that could appear in one bin with equal width discretization (10 bins)?

(f) What about with equal height discretization (10 bins)?

3.10 REFERENCES FOR FURTHER STUDY


Fodor, I. K., A Survey of Dimension Reduction Techniques, LLNL Technical Report, June 2002.

The author reviews PCA and FA, respectively, the two most widely used linear dimension-reduction methods based on second-order statistics. However, many data sets of interest are not realizations from Gaussian distributions. For those cases, higher order dimension-reduction methods, using information not contained in the covariance matrix, are more appropriate. It includes ICA and method of random projections.

Liu, H., H. Motoda, eds., Instance Selection and Construction for Data Mining, Kluwer Academic Publishers, Boston, MA, 2001.

Many different approaches have been used to address the data-explosion issue, such as algorithm scale-up and data reduction. Instance, sample, or tuple selection pertains to methods that select or search for a representative portion of data that can fulfill a data-mining task as if the whole data were used. This book brings researchers and practitioners together to report new developments and applications in instance-selection techniques, to share hard-learned experiences in order to avoid similar pitfalls, and to shed light on future development.

Liu, H., H. Motoda, Feature Selection for Knowledge Discovery and Data Mining, (Second Printing), Kluwer Academic Publishers, Boston, MA, 2000.

The book offers an overview of feature-selection methods and provides a general framework in order to examine these methods and categorize them. The book uses simple examples to show the essence of methods and suggests guidelines for using different methods under various circumstances.

Liu, H., H. Motoda, Computational Methods of Feature Selection, CRC Press, Boston, MA, 2007.

The book represents an excellent surveys, practical guidance, and comprehensive tutorials from leading experts. It paints a picture of the state-of-the-art techniques that can boost the capabilities of many existing data-mining tools and gives the novel developments of feature selection that have emerged in recent years, including causal feature selection and Relief. The book contains real-world case studies from a variety of areas, including text classification, web mining, and bioinformatics.

Saul, L. K., et al., Spectral Methods for Dimensionality Reduction, in Semisupervised Learning, B. Schööelkopf, O. Chapelle and A. Zien eds., MIT Press, Cambridge, MA, 2005.

Spectral methods have recently emerged as a powerful tool for nonlinear dimensionality reduction and manifold learning. These methods are able to reveal low-dimensional structure in high-dimensional data from the top or bottom eigenvectors of specially constructed matrices. To analyze data that lie on a low-dimensional

Return Main Page Previous Page Next Page

®Online Book Reader