Data Mining - Mehmed Kantardzic [21]
Olson D., S. Yong, Introduction to Business Data Mining, McGraw-Hill, Englewood Cliffs, NJ, 2007.
Introduction to Business Data Mining was developed to introduce students, as opposed to professional practitioners or engineering students, to the fundamental concepts of data mining. Most importantly, this text shows readers how to gather and analyze large sets of data to gain useful business understanding. The authors’ team has had extensive experience with the quantitative analysis of business as well as with data-mining analysis. They have both taught this material and used their own graduate students to prepare the text’s data-mining reports. Using real-world vignettes and their extensive knowledge of this new subject, David Olson and Yong Shi have created a text that demonstrates data-mining processes and techniques needed for business applications.
Westphal, C., T. Blaxton, Data Mining Solutions: Methods and Tools for Solving Real-World Problems, John Wiley, New York, 1998.
This introductory book gives a refreshing “out-of-the-box” approach to data mining that will help the reader to maximize time and problem-solving resources, and prepare for the next wave of data-mining visualization techniques. An extensive coverage of data-mining software tools is valuable to readers who are planning to set up their own data-mining environment.
2
PREPARING THE DATA
Chapter Objectives
Analyze basic representations and characteristics of raw and large data sets.
Apply different normalization techniques on numerical attributes.
Recognize different techniques for data preparation, including attribute transformation.
Compare different methods for elimination of missing values.
Construct a method for uniform representation of time-dependent data.
Compare different techniques for outlier detection.
Implement some data preprocessing techniques.
2.1 REPRESENTATION OF RAW DATA
Data samples introduced as rows in Figure 1.4 are basic components in a data-mining process. Every sample is described with several features, and there are different types of values for every feature. We will start with the two most common types: numeric and categorical. Numeric values include real-value variables or integer variables such as age, speed, or length. A feature with numeric values has two important properties: Its values have an order relation (2 < 5 and 5 < 7) and a distance relation (d [2.3, 4.2] = 1.9).
In contrast, categorical (often called symbolic) variables have neither of these two relations. The two values of a categorical variable can be either equal or not equal: They only support an equality relation (Blue = Blue, or Red ≠ Black). Examples of variables of this type are eye color, sex, or country of citizenship. A categorical variable with two values can be converted, in principle, to a numeric binary variable with two values: 0 or 1. A categorical variable with n values can be converted into n binary numeric variables, namely, one binary variable for each categorical value. These coded categorical variables are known as “dummy variables” in statistics. For example, if the variable eye color has four values (black, blue, green, and brown), they can be coded with four binary digits.
Feature Value Code
Black 1000
Blue 0100
Green 0010
Brown 0001
Another way of classifying a variable, based on its values, is to look at it as a continuous variable or a discrete variable.
Continuous variables are also known as quantitative or metric variables. They are measured using either an interval scale or a ratio scale. Both scales allow the underlying variable to be defined or measured theoretically with infinite precision. The difference between these two scales lies in how the 0 point is defined in the scale. The 0 point in the interval scale is placed arbitrarily, and thus it does not indicate the complete absence of whatever is being measured. The best example of the interval scale is the temperature scale, where 0 degrees Fahrenheit