Data Mining - Mehmed Kantardzic [2]
new topics such as ensemble learning, graph mining, temporal, spatial, distributed, and privacy preserving data mining;
new algorithms such as Classification and Regression Trees (CART), Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Balanced and Iterative Reducing and Clustering Using Hierarchies (BIRCH), PageRank, AdaBoost, support vector machines (SVM), Kohonen self-organizing maps (SOM), and latent semantic indexing (LSI);
more details on practical aspects and business understanding of a data-mining process, discussing important problems of validation, deployment, data understanding, causality, security, and privacy; and
some quantitative measures and methods for comparison of data-mining models such as ROC curve, lift chart, ROI chart, McNemar’s test, and K-fold cross validation paired t-test.
Keeping in mind the educational aspect of the book, many new exercises have been added. The bibliography and appendices have been updated to include work that has appeared in the last few years, as well as to reflect the change in emphasis when a new topic gained importance.
I would like to thank all my colleagues all over the world who used the first edition of the book for their classes and who sent me support, encouragement, and suggestions to put together this revised version. My sincere thanks are due to all my colleagues and students in the Data Mining Lab and Computer Science Department for their reviews of this edition, and numerous helpful suggestions. Special thanks go to graduate students Brent Wenerstrom, Chamila Walgampaya, and Wael Emara for patience in proofreading this new edition and for useful discussions about the content of new chapters, numerous corrections, and additions. To Dr. Joung Woo Ryu, who helped me enormously in the preparation of the final version of the text and all additional figures and tables, I would like to express my deepest gratitude.
I believe this book can serve as a valuable guide to the field for undergraduate, graduate students, researchers, and practitioners. I hope that the wide range of topics covered will allow readers to appreciate the extent of the impact of data mining on modern business, science, even the entire society.
MEHMED KANTARDZIC
Louisville
July 2011
PREFACE TO THE FIRST EDITION
The modern technologies of computers, networks, and sensors have made data collection and organization an almost effortless task. However, the captured data need to be converted into information and knowledge from recorded data to become useful. Traditionally, the task of extracting useful information from recorded data has been performed by analysts; however, the increasing volume of data in modern businesses and sciences calls for computer-based methods for this task. As data sets have grown in size and complexity, so there has been an inevitable shift away from direct hands-on data analysis toward indirect, automatic data analysis in which the analyst works via more complex and sophisticated tools. The entire process of applying computer-based methodology, including new techniques for knowledge discovery from data, is often called data mining.
The importance of data mining arises from the fact that the modern world is a data-driven world. We are surrounded by data, numerical and otherwise, which must be analyzed and processed to convert it into information that informs, instructs, answers, or otherwise aids understanding and decision making. In the age of the Internet, intranets, data warehouses, and data marts, the fundamental paradigms of classical data analysis are ripe for changes. Very large collections of data—millions or even hundred of millions of individual records—are now being stored into centralized data warehouses, allowing analysts to make