Online Book Reader

Home Category

Data Mining_ Concepts and Techniques - Jiawei Han [247]

By Root 1571 0
differentiable, allowing the backpropagation algorithm to model classification problems that are linearly inseparable.

We compute the output values, Oj, for each hidden layer, up to and including the output layer, which gives the network's prediction. In practice, it is a good idea to cache (i.e., save) the intermediate output values at each unit as they are required again later when backpropagating the error. This trick can substantially reduce the amount of computation required.

Backpropagate the error: The error is propagated backward by updating the weights and biases to reflect the error of the network's prediction. For a unit j in the output layer, the error Errj is computed by

(9.6)

where Oj is the actual output of unit j, and Tj is the known target value of the given training tuple. Note that is the derivative of the logistic function.

To compute the error of a hidden layer unit j, the weighted sum of the errors of the units connected to unit j in the next layer are considered. The error of a hidden layer unit j is

(9.7)

where wjk is the weight of the connection from unit j to a unit k in the next higher layer, and Errk is the error of unit k.

The weights and biases are updated to reflect the propagated errors. Weights are updated by the following equations, where is the change in weight :

(9.8)

(9.9)

“What is l inEq. (9.8)?” The variable l is the learning rate, a constant typically having a value between 0.0 and 1.0. Backpropagation learns using a gradient descent method to search for a set of weights that fits the training data so as to minimize the mean-squared distance between the network's class prediction and the known target value of the tuples. 1 The learning rate helps avoid getting stuck at a local minimum in decision space (i.e., where the weights appear to converge, but are not the optimum solution) and encourages finding the global minimum. If the learning rate is too small, then learning will occur at a very slow pace. If the learning rate is too large, then oscillation between inadequate solutions may occur. A rule of thumb is to set the learning rate to 1/t, where t is the number of iterations through the training set so far.

1A method of gradient descent was also used for training Bayesian belief networks, as described in Section 9.1.2.

Biases are updated by the following equations, where is the change in bias :

(9.10)

(9.11)

Note that here we are updating the weights and biases after the presentation of each tuple. This is referred to as case updating. Alternatively, the weight and bias increments could be accumulated in variables, so that the weights and biases are updated after all the tuples in the training set have been presented. This latter strategy is called epoch updating, where one iteration through the training set is an epoch. In theory, the mathematical derivation of backpropagation employs epoch updating, yet in practice, case updating is more common because it tends to yield more accurate results.

Terminating condition: Training stops when

■ All in the previous epoch are so small as to be below some specified threshold, or

■ The percentage of tuples misclassified in the previous epoch is below some threshold, or

■ A prespecified number of epochs has expired.

In practice, several hundreds of thousands of epochs may be required before the weights will converge.

“How efficient is backpropagation?” The computational efficiency depends on the time spent training the network. Given tuples and w weights, each epoch requires time. However, in the worst-case scenario, the number of epochs can be exponential in n, the number of inputs. In practice, the time required for the networks to converge is highly variable. A number of techniques exist that help speed up the training time. For example, a technique known as simulated annealing can be used, which also ensures convergence to a global optimum.

Sample calculations for learning by the backpropagation algorithm

Figure 9.5 shows a multilayer feed-forward neural network. Let the learning rate be 0.9. The initial

Return Main Page Previous Page Next Page

®Online Book Reader