Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [60]

By Root 735 0
functions, Y = f(X, w); instead, it must be assumed or demonstrated by arguments outside the results of inductive-learning analysis. For example, it is well known that people in Florida are on average older than in the rest of the United States. This observation may be supported by inductive-learning dependencies, but it does not imply, however, that the climate in Florida causes people to live longer. The cause is totally different; people just move there when they retire and that is possibly the cause, and maybe not the only one, of people being older in Florida than elsewhere. Similar misinterpretation could be based on the data analysis of life expectancy for a married versus a single man. Statistics show that the married man lives longer than the single man. But do not hurry with sensational causality and conclusions: that marriage is good for one’s health and increases life expectancy. It can be argued that males with physical problems and/or socially deviant patterns of behavior are less likely to get married, and this could be one of possible explanations why married men live longer. Unobservable factors such as a person’s health and social behavior are more likely the cause of changed life expectancy, and not the observed variable, marriage status. These illustrations should lead us to understand that inductive-learning processes build the model of dependencies but they should not automatically be interpreted as causality relations. Only experts in the domain where the data are collected may suggest additional, deeper semantics of discovered dependencies.

Let us return again to the learning machine and its task of system modeling. The problem encountered by the learning machine is to select a function from the set of functions this machine supports, which best approximates the system’s responses. The learning machine is limited to observing a finite number of samples n in order to make this selection. The finite number of samples, which we call a training data set, is denoted by (Xi, yi), where i = 1, … , n. The quality of an approximation produced by the learning machine is measured by the loss function L(y, f[X, w]), where

y is the output produced by the system,

X is a set of inputs,

f(X, w) is the output produced by the learning machine for a selected approximating function, and

w is the set of parameters in the approximating functions.

L measures the difference between the outputs produced by the system yi and that produced by the learning machine f(Xi,w) for every input point Xi. By convention, the loss function is nonnegative, so that large positive values correspond to poor approximation, and small positive values close to 0 show a good approximation. The expected value of the loss is called the risk functional R(w)

where L(y, f[X, w])is a loss function and p(X, y) is a probability distribution of samples. The R(w) value, for a selected approximating functions, is dependent only on a set of parameters w. Inductive learning can be now defined as the process of estimating the function f(X,wopt), which minimizes the risk functional R(w) over the set of functions supported by the learning machine, using only the training data set, and not knowing the probability distribution p(X, y). With finite data, we cannot expect to find f(X, wopt) exactly, so we denote as the estimate of parameters of the optimal solution wopt obtained with finite training data using some learning procedure.

For common learning problems such as classification or regression, the nature of the loss function and the interpretation of risk functional are different. In a two-class classification problem, where the output of the system takes on only two symbolic values, y = {0, 1}, corresponding to the two classes, a commonly used loss function measures the classification error.

Using this loss function, the risk functional quantifies the probability of misclassification. Inductive learning becomes a problem of finding the classifier function f(X, w), which minimizes the probability of misclassification using only the training

Return Main Page Previous Page Next Page

®Online Book Reader