Data Mining - Mehmed Kantardzic [102]
Cherkassky, V., F. Mulier, Learning from Data: Concepts, Theory and Methods, John Wiley, New York, 1998.
The book provides a unified treatment of the principles and methods for learning dependencies from data. It establishes a general conceptual framework in which various learning methods from statistics, machine learning, and other disciplines can be applied—showing that a few fundamental principles underlie most new methods being proposed today. An additional strength of this primary theoretical book is a large number of case studies and examples that simplify and make understandable statistical learning theory concepts.
Hand, D., Mannila H., Smith P., Principles of Data Mining, MIT Press, Cambridge, MA, 2001.
The book consists of three sections. The first, foundations, provides a tutorial overview of the principles underlying data-mining algorithms and their applications. The second section, data-mining algorithms, shows how algorithms are constructed to solve specific problems in a principled manner. The third section shows how all of the preceding analyses fit together when applied to real-world data-mining problems.
Nisbet, R., J. Elder, G. Miner, Handbook of Statistical Analysis and Data Mining Applications, Elsevier Inc., Amsterdam, 2009.
The book is a comprehensive professional reference book that guides business analysts, scientists, engineers, and researchers (both academic and industrial) through all stages of data analysis, model building, and implementation. The handbook helps one discern technical and business problems, understand the strengths and weaknesses of modern data-mining algorithms, and employ the right statistical methods for practical application. Use this book to address massive and complex data sets with novel statistical approaches and be able to objectively evaluate analyses and solutions. It has clear, intuitive explanations of the principles and tools for solving problems using modern analytic techniques, and discusses their application to real problems, in ways accessible and beneficial to practitioners across industries—from science and engineering, to medicine, academia, and commerce. This handbook brings together, in a single resource, all the information a beginner will need to understand the tools and issues in data mining to build successful data-mining solutions.
6
DECISION TREES AND DECISION RULES
Chapter Objectives
Analyze the characteristics of a logic-based approach to classification problems.
Describe the differences between decision-tree and decision-rule representations in a final classification model.
Explain in-depth the C4.5 algorithm for generating decision trees and decision rules.
Identify the required changes in the C4.5 algorithm when missing values exist in training or testing data set.
Introduce the basic characteristics of Classification and Regression Trees (CART) algorithm and Gini index.
Know when and how to use pruning techniques to reduce the complexity of decision trees and decision rules.
Summarize the limitations of representing a classification model by decision trees and decision rules.
Decision trees and decision rules are data-mining methodologies applied in many real-world applications as a powerful solution to classification problems. Therefore, to be begin with, let us briefly summarize the basic principles of classification. In general, classification is a process of learning a function that maps a data item into one of several predefined classes. Every classification based on inductive-learning algorithms is given as an input a set of samples that consist of vectors of attribute values (also called feature vectors) and a corresponding class. The goal of learning is to create a classification model, known as a classifier, which will predict, with the values of its available input attributes, the class for some entity (a given sample). In other words, classification is the process of assigning a discrete label value (class) to an unlabeled record, and a classifier