Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [125]

By Root 880 0

is connected to all the nodes (neurons) in the previous layer. Data flow through the network progresses in a forward direction, from left to right and on a layer-by-layer basis.

Figure 7.10. A graph of a multilayered-perceptron architecture with two hidden layers.

MLPs have been applied successfully to solve some difficult and diverse problems by training the network in a supervised manner with a highly popular algorithm known as the error backpropagation algorithm. This algorithm is based on the error-correction learning rule and it may be viewed as its generalization. Basically, error backpropagation learning consists of two phases performed through the different layers of the network: a forward pass and a backward pass.

In the forward pass, a training sample (input data vector) is applied to the input nodes of the network, and its effect propagates through the network layer by layer. Finally, a set of outputs is produced as the actual response of the network. During the forward phase, the synaptic weights of the network are all fixed. During the backward phase, on the other hand, the weights are all adjusted in accordance with an error-correction rule. Specifically, the actual response of the network is subtracted from a desired (target) response, which is a part of the training sample, to produce an error signal. This error signal is then propagated backward through the network, against the direction of synaptic connections. The synaptic weights are adjusted to make the actual response of the network closer to the desired response.

Formalization of the backpropagation algorithm starts with the assumption that an error signal exists at the output of a neuron j at iteration n (i.e., presentation of the nth training sample). This error is defined by

We define the instantaneous value of the error energy for neuron j as 1/2 ej2(n). The total error energy for the entire network is obtained by summing instantaneous values over all neurons in the output layer. These are the only “visible” neurons for which the error signal can be calculated directly. We may thus write

where the set C includes all neurons in the output layer of the network. Let N denote the total number of samples contained in the training set. The average squared error energy is obtained by summing E(n) over all n and then normalizing it with respect to size N, as shown by

The average error energy Eav is a function of all the free parameters of the network. For a given training set, Eav represents the cost function as a measure of learning performances. The objective of the learning process is to adjust the free parameters of the network to minimize Eav. To do this minimization, the weights are updated on a sample-by-sample basis for one iteration, that is, one complete presentation of the entire training set of a network has been dealt with.

To obtain the minimization of the function Eav, we have to use two additional relations for node-level processing, which have been explained earlier in this chapter:

and

where m is the number of inputs for jth neuron. Also, we use the symbol v as a shorthand notation of the previously defined variable net. The backpropagation algorithm applies a correction Δwji(n) to the synaptic weight wji(n), which is proportional to the partial derivative δE(n)/δwji(n). Using the chain rule for derivation, this partial derivative can be expressed as

The partial derivative δE(n)/δwji(n) represents a sensitive factor, determining the direction of search in weight space. Knowing that the next relations

are valid, we can express the partial derivative ∂E(n)/∂wji(n) in the form

The correction Δwji(n) applied to wji(n) is defined by the delta rule

where η is the learning-rate parameter of the backpropagation algorithm. The use of the minus sign accounts for gradient descent in weight space, that is, a direction for weight change that reduces the value E(n). Asking for φ′(vj[n]) in the learning process is the best explanation for why we prefer continuous functions such as log-sigmoid and hyperbolic as a standard-activation

Online Book Reader

Data Mining - Mehmed Kantardzic [125]

®Online Book Reader