Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [63]

By Root 801 0
At a point n = h where the growth starts to slow down, it is a characteristic of a set of functions. If h is finite, then the G(n) function does not grow linearly for enough large training data sets, and it is bounded by a logarithmic function. If G(n) is only linear, then h → ∞, and no valid generalization through selected approximating functions is possible. The finiteness of h provides necessary and sufficient conditions for the quick convergence of risk functions, consistency of ERM, and potentially good generalization in the inductive-learning process. These requirements place analytic constraints on the ability of the learned model to generalize, expressed through the empirical risk. All theoretical results in the SLT use the VC dimension defined on the set of loss functions. But, it has also been proven that the VC dimension for theoretical loss functions is equal to the VC dimension for approximating functions in typical, inductive-learning tasks such as classification or regression.

Figure 4.5. Behavior of the growth function G(n).

The ERM inductive principle is intended for relatively large data sets, namely, when the ratio n/h is large and the empirical risk converges close to the true risk. However, if n/h is small, namely, when the ratio n/h is less than 20, then a modification of the ERM principle is necessary. The inductive principle called SRM provides a formal mechanism for choosing a model with optimal complexity in finite and small data sets. According to SRM, solving a learning problem with a finite data set requires a priori specification of a structure on a set of approximating functions. For example, a set of functions S1 is a subset of S2, and S2 is a subset of S3. The set of approximating functions S1 has the lowest complexity, but the complexity increases with each new superset S2, S3, … , Sk. A simplified graphical representation of the structure is given in Figure 4.6.

Figure 4.6. Structure of a set of approximating functions.

For a given data set, the optimal model estimation is performed following two steps:

1. selecting an element of a structure having optimal complexity, and

2. estimating the model based on the set of approximating functions defined in a selected element of the structure.

Through these two steps the SRM provides a quantitative characterization of the trade-off between the complexity of approximating functions and the quality of fitting the training data. As the complexity increases (increase of the index k for Sk), the minimum empirical risk decreases, and the quality of fitting the data improves. But estimated true risk, measured through the additional testing data set, has a convex form, and in one moment it moves in a direction opposite that of the empirical risk, as shown in Figure 4.7. The SRM chooses an optimal element of the structure that yields the minimal guaranteed bound on the true risk.

Figure 4.7. Empirical and true risk as a function of h (model complexity).

In practice, to implement the SRM approach, it is necessary to be able to

1. calculate or estimate the VC dimension for any element Sk of the structure, and then

2. minimize the empirical risk for each element of the structure.

For most practical inductive-learning methods that use nonlinear approximating functions, finding the VC dimension analytically is difficult, as is the nonlinear optimization of empirical risk. Therefore, rigorous application of the SRM principle cannot only be difficult but, in many cases, impossible with nonlinear approximations. This does not, however, imply that the SLT is impractical. There are various heuristic procedures that are often used to implement SRM implicitly. Examples of such heuristics include early stopping rules and weight initialization, which are often used in artificial neural networks. These heuristics will be explained together with different learning methods in the following chapters. The choice of an SRM-optimization strategy suitable for a given learning problem depends on the type of approximating functions supported by the learning machine.

Return Main Page Previous Page Next Page

®Online Book Reader