Data Mining - Mehmed Kantardzic [60]
Let us return again to the learning machine and its task of system modeling. The problem encountered by the learning machine is to select a function from the set of functions this machine supports, which best approximates the system’s responses. The learning machine is limited to observing a finite number of samples n in order to make this selection. The finite number of samples, which we call a training data set, is denoted by (Xi, yi), where i = 1, … , n. The quality of an approximation produced by the learning machine is measured by the loss function L(y, f[X, w]), where
y is the output produced by the system,
X is a set of inputs,
f(X, w) is the output produced by the learning machine for a selected approximating function, and
w is the set of parameters in the approximating functions.
L measures the difference between the outputs produced by the system yi and that produced by the learning machine f(Xi,w) for every input point Xi. By convention, the loss function is nonnegative, so that large positive values correspond to poor approximation, and small positive values close to 0 show a good approximation. The expected value of the loss is called the risk functional R(w)
where L(y, f[X, w])is a loss function and p(X, y) is a probability distribution of samples. The R(w) value, for a selected approximating functions, is dependent only on a set of parameters w. Inductive learning can be now defined as the process of estimating the function f(X,wopt), which minimizes the risk functional R(w) over the set of functions supported by the learning machine, using only the training data set, and not knowing the probability distribution p(X, y). With finite data, we cannot expect to find f(X, wopt) exactly, so we denote as the estimate of parameters of the optimal solution wopt obtained with finite training data using some learning procedure.
For common learning problems such as classification or regression, the nature of the loss function and the interpretation of risk functional are different. In a two-class classification problem, where the output of the system takes on only two symbolic values, y = {0, 1}, corresponding to the two classes, a commonly used loss function measures the classification error.
Using this loss function, the risk functional quantifies the probability of misclassification. Inductive learning becomes a problem of finding the classifier function f(X, w), which minimizes the probability of misclassification using only the training