Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [7]

By Root 820 0
have led to the data. In place of the statistical emphasis on models, machine learning tends to emphasize algorithms. This is hardly surprising; the very word “learning” contains the notion of a process, an implicit algorithm.

Basic modeling principles in data mining also have roots in control theory, which is primarily applied to engineering systems and industrial processes. The problem of determining a mathematical model for an unknown system (also referred to as the target system) by observing its input–output data pairs is generally referred to as system identification. The purposes of system identification are multiple and, from the standpoint of data mining, the most important are to predict a system’s behavior and to explain the interaction and relationships between the variables of a system.

System identification generally involves two top-down steps:

1. Structure Identification. In this step, we need to apply a priori knowledge about the target system to determine a class of models within which the search for the most suitable model is to be conducted. Usually this class of models is denoted by a parameterized function y = f(u,t), where y is the model’s output, u is an input vector, and t is a parameter vector. The determination of the function f is problem-dependent, and the function is based on the designer’s experience, intuition, and the laws of nature governing the target system.

2. Parameter Identification. In the second step, when the structure of the model is known, all we need to do is apply optimization techniques to determine parameter vector t such that the resulting model y* = f(u,t*) can describe the system appropriately.

In general, system identification is not a one-pass process: Both structure and parameter identification need to be done repeatedly until a satisfactory model is found. This iterative process is represented graphically in Figure 1.1. Typical steps in every iteration are as follows:

1. Specify and parameterize a class of formalized (mathematical) models, y* = f(u,t*), representing the system to be identified.

2. Perform parameter identification to choose the parameters that best fit the available data set (the difference y − y* is minimal).

3. Conduct validation tests to see if the model identified responds correctly to an unseen data set (often referred to as test, validating or checking data set).

4. Terminate the process once the results of the validation test are satisfactory.

Figure 1.1. Block diagram for parameter identification.

If we do not have any a priori knowledge about the target system, then structure identification becomes difficult, and we have to select the structure by trial and error. While we know a great deal about the structures of most engineering systems and industrial processes, in a vast majority of target systems where we apply data-mining techniques, these structures are totally unknown, or they are so complex that it is impossible to obtain an adequate mathematical model. Therefore, new techniques were developed for parameter identification and they are today a part of the spectra of data-mining techniques.

Finally, we can distinguish between how the terms “model” and “pattern” are interpreted in data mining. A model is a “large-scale” structure, perhaps summarizing relationships over many (sometimes all) cases, whereas a pattern is a local structure, satisfied by few cases or in a small region of a data space. It is also worth noting here that the word “pattern,” as it is used in pattern recognition, has a rather different meaning for data mining. In pattern recognition it refers to the vector of measurements characterizing a particular object, which is a point in a multidimensional data space. In data mining, a pattern is simply a local model. In this book we refer to n-dimensional vectors of data as samples.

1.3 DATA-MINING PROCESS


Without trying to cover all possible approaches and all different views about data mining as a discipline, let us start with one possible, sufficiently broad definition of data mining:

Data mining is

Return Main Page Previous Page Next Page

®Online Book Reader