Data Mining - Mehmed Kantardzic [5]
Data mining is an iterative process within which progress is defined by discovery, through either automatic or manual methods. Data mining is most useful in an exploratory analysis scenario in which there are no predetermined notions about what will constitute an “interesting” outcome. Data mining is the search for new, valuable, and nontrivial information in large volumes of data. It is a cooperative effort of humans and computers. Best results are achieved by balancing the knowledge of human experts in describing problems and goals with the search capabilities of computers.
In practice, the two primary goals of data mining tend to be prediction and description. Prediction involves using some variables or fields in the data set to predict unknown or future values of other variables of interest. Description, on the other hand, focuses on finding patterns describing the data that can be interpreted by humans. Therefore, it is possible to put data-mining activities into one of two categories:
1. predictive data mining, which produces the model of the system described by the given data set, or
2. descriptive data mining, which produces new, nontrivial information based on the available data set.
On the predictive end of the spectrum, the goal of data mining is to produce a model, expressed as an executable code, which can be used to perform classification, prediction, estimation, or other similar tasks. On the descriptive end of the spectrum, the goal is to gain an understanding of the analyzed system by uncovering patterns and relationships in large data sets. The relative importance of prediction and description for particular data-mining applications can vary considerably. The goals of prediction and description are achieved by using data-mining techniques, explained later in this book, for the following primary data-mining tasks:
1. Classification. Discovery of a predictive learning function that classifies a data item into one of several predefined classes.
2. Regression. Discovery of a predictive learning function that maps a data item to a real-value prediction variable.
3. Clustering. A common descriptive task in which one seeks to identify a finite set of categories or clusters to describe the data.
4. Summarization. An additional descriptive task that involves methods for finding a compact description for a set (or subset) of data.
5. Dependency Modeling. Finding a local model that describes significant dependencies between variables or between the values of a feature in a data set or in a part of a data set.
6. Change and Deviation Detection. Discovering the most significant changes in the data set.
The more formal approach, with graphical interpretation of data-mining tasks for complex and large data sets and illustrative examples, is given in Chapter 4. Current introductory classifications and definitions are given here only to give the reader a feeling of the wide spectrum of problems and tasks that may be solved using data-mining technology.
The success of a data-mining engagement depends largely on the amount of energy, knowledge, and creativity that the designer puts into it. In essence, data mining is like solving a puzzle. The individual pieces of the puzzle are not complex structures in and of themselves. Taken as a collective whole, however, they can constitute very elaborate systems. As you try to unravel these systems, you will probably get frustrated, start forcing parts together, and generally become annoyed at the entire process, but once you know how to work with the pieces, you realize that it was not really that hard in the first place. The same analogy can be applied