Asian Science Citation Index is committed to provide an authoritative, trusted and significant information by the coverage of the most important and influential journals to meet the needs of the global scientific community.  
ASCI Database
308-Lasani Town,
Sargodha Road,
Faisalabad, Pakistan
Fax: +92-41-8815544
Contact Via Web
Suggest a Journal
Journal of Computer Science
Year: 2009  |  Volume: 5  |  Issue: 11  |  Page No.: 849 - 856

A Modified Conjugate Gradient Formula for Back Propagation Neural Network Algorithm

Abbas Y. Al Bayati, Najmaddin A. Sulaiman and Gulnar W. Sadiq    

Abstract: Problem statement: The Conjugate Gradient (CG) algorithm which usually used for solving nonlinear functions is presented and is combined with the modified Back Propagation (BP) algorithm yielding a new fast training multilayer algorithm. Approach: This study consisted of determination of new search directions by exploiting the information calculated by gradient descent as well as the previous search direction. The proposed algorithm improved the training efficiency of BP algorithm by adaptively modifying the initial search direction. Results: Performance of the proposed algorithm was demonstrated by comparing it with the Neural Network (NN) algorithm for the chosen test functions. Conclusion: The numerical results showed that number of iterations required by the proposed algorithm to converge was less than the both standard CG and NN algorithms. The proposed algorithm improved the training efficiency of BP-NN algorithms by adaptively modifying the initial search direction.

Eq. 6 and gradient for gain value by using Eq. 9.

Step 2:Calculate the gradient descent on error with respect to the weights and gains values.

Step 3:Use the gradient weight vector and gradient of gain calculated in step 1 to calculate the new weight vector and vector of new gain values for use in the next epoch.

In general, the objective of a learning process in neural network is to find a weight vector w which minimizes the difference between the actual output and the desired output. Namely:

(1)

Suppose for a particular input pattern o0 and let the input layer is layer 0. The desired output is the teacher pattern t = [t1…tn]T and the actual output is , where L denotes the output layer. Define an error function on that pattern as:

(2)

The overall error on the training set is simply the sum, across patterns, of the pattern error E. Consider a multilayer feed Forward Neural Network (FNN)[1] has one output layer and one input layer with one or more hidden layers. Each layer has a set of units, nodes, or neurons. It is usually assumed that each layer is fully connected with a previous layer without direct connections between layers which are not consecutive. Each connection has a weight. Let be the activation of the ith node of layer s and let be the column vector of the activation values in the layer s and the input layer as layer 0. Let be the weight on the connection from the ith node in layer s -1 to the jth node in layer s and let be the column vector of weights from layer s -1 to the jth node of layer s. The net input to the jth node of layer s is defined as and let be the column vector of the net input values in layers. The activation of a node is given by a function of its net input:

(3)

Where:

f
= Any function with bounded derivative
= A real value called the gain of the node

Note that this activation function becomes the usual logistic activation function if .

By introducing "gain variation" term in this activation function, then only updating formulas for are changed while others are the same as the standard back propagation. To simplify the calculation, taken from the Eq. 2 we then can perform gradient descent on E with respect to. The chain rule yields:

(4)

where, . In particular, the first three factors of (4) indicate that:

(5)

As we noted that the iterative formula (5) for is the same as standard BP[11] except for the appearance of the value gain. By combining (4) and (5) yields the learning rule for weights:

(6)

where, η is a small positive constant called ‘step length’ or ‘learning rate’ and the search direction or gradient vector is . In this approach, at step n is the calculation for gradient of error g(n) is modified by including the variation of gain value to yield:

(7)

The gradient descent on error with respect to the gain can also be calculated by using the chain rule as previously described; it is easy to compute as:

(8)

Then the gradient descent rule for the gain value becomes:

(9)

At the end of each iteration the new gain value is updated using a simple gradient based method as given by the formula:

(10)

For more details of this method and its implementations[12].

CG-type method: In order to overcome the difficulties of the standard delta rule learning technique, where this turning through the steepest descent direction. Which causes the zigzag problematic, where it increases the turning time. To improve the global rate of convergence of the standard delta rule learning technique, we have suggested the new delta rule based on CG type method. The CG method is an iterative optimization approach, the gradient is n-components vector and the gradient direction is called the steepest ascent or descent (the Fletcher-Reeves method tries to exploit the fact that for a quadratic function of n variables, n linear searches along the mutually conjugate directions will locate the minimum). But this direction has a local property and not a global one. Thus any method which makes use of the gradient vector, can be expected to give the minimum point. The steps of CG algorithm are:

Algorithm (CG-based NN-algorithm):

Select the initial weights randomly with small values and set p = 1
Select the next training pair from the training set [Tp] apply the input vector [xp] to the network input and specify the desired output vector. Then calculate the actual outputs of the network by using these two formulae:


Where:

= Summation of multiple weight with inputs for j
= The actual output of the network j by using function
= It represents input patterns, p number of pattern and j is number of net

And:

netj = The result of summation of j node
xi = Input pattern
wij = Weight between node j and i
Calculate the errors between the actual output of the network and the desired output by using these steps

Start with xi. If (i = 1) then d1 = -g1. Obtain (μi) using line search and update the object (x) using the formula: . Increment i by 1 and calculate the gradient vector gn to obtain βi. Update the direction di using the formula and find the corresponding step size (μi) by updating the object (x) to minimize the value of object (x) and adjust the weights of the network in a way that minimizes the error.

When (xi+1) is minimum terminate the iterations; if not, continue to repeat step (3) for each vector in the training set until the total error for the entire set is acceptable low, where di is the update direction of the object at the iteration and di, the initial direction, is chosen to be the steepest descent direction, μi is the step size along the direction di; βi is chosen to insure that the direction vector is conjugate to all previous directions and gi is the gradient of the activation function of the iteration. There is a number of different choices for βi.

In this study, Fletcher and Reeves has been used. It is one of the most commonly used and its formula is:

(11)

The other type of βi is Al Assady and Al Bayati and Polak and Ribere. The formula of βi of these two types are:

(12)

(13)

Following we are going to investigate and develop a new formula for the extended FR formula which is based on the non-quadratic models, this will increase the rate of convergence of the In fact, these quantities are well-known and derived with respect of quadratic models, certainly the FR formula has a standard FRCG-method. Now most of the currently used optimization methods use a local quadratic representation of the objective function. But the use of quadratic model may be inadequate to incorporate all the information so that more general models than quadratic are proposed as a basis for CG algorithms.

It is shown that CG methods with in the numerator of βi has strong global convergence theorems with exact and inexact line searches but has poor performance in practice. On the other hand the CG methods with in the numerator of βi has uncertain global convergence for general non-linear functions, but has good performance in practice. Therefore CG methods has been frequently modified and improved by many authors, for example[14-16] have proposed further modifications of the conjugate gradient methods which are based on some non-quadratic models.

Here we generalize the FRCG by considering more general function than quadratic which we call it as quasi-sigmoid function, our goal is to preserve and increase the convergence properties of FR algorithm and to force its performance in practice.

A generalized FRCG algorithm (based on non-quadratic model): If q(x) is a quadratic function defined by:

(14)

Where:

G = nxn symmetric and positive definite matrix
b = =Constant vector in Rn and c is constant

Then we say that f is Defined as a nonlinear scaling of q(x) if the following conditions hold:

(15)

And:

(16)

The following proportions are immediately derived from the above conditions:

Every contour line of q(x) is a contour line of f
If x• is minimize of q(x) then it's also a minimize of f

In this area there are various published works. The special polynomial case:

(17)

where, ε1, ε2 scalars are has been investigated by[16]. A rational model has been developed by[15] where:

(18)

Another rational model was considered by[14] where:

(19)

A new extended CG-formula: Here we consider the new model function defined by:

(20)

We call it as quasi sigmoid function where q>0 with further assumption that:

(21)

From Eq. 20 we have:

(22)

Return to Eq. 20 and solve it for q assuming that where the remainder . It is clear that and set .

Then:

And:

(23)

Use (21) and (23) to compute Ft using only function values:

(24)

where, assuming

Now define ρi as follows:

(25)

Then we can compute the value of ρi using function value at two points xk+1 and xk from Eq. 24:

(26)

Where:

Our objective is to investigate an extended FRCG method applied to function F(q(x)) obtained by non-linear scaling of q(x). The extended FRCG method can be done by modifying the search directions as follows:

(27)

And for i≥0:

(28)

Where:

(29)

Where:

is the search direction applied to and .

We can show that the original FRCG method and extended FRCG algorithm defined in (27-29) generates the same set of directions and same sequence of points {xi}by using the following theorem:

Theorem 1: Given an identical starting point x0∈Rn.

The method of FRCG applied to f(x) = q(x) and the extended (ERFCG) method defined by (27-29) and applied to f(x) = F(q(x)) with ρi defined by (26) generate identical set of directions and identical sequence of points {xi}.

Proof: The prove is by induction for k = 0 we have:

Suppose: , to prove for i+1

Hence the two methods generates the same set of directions.

Our proposed algorithm known as (NEW) begins the minimization process with an initial estimate w0 and an initial search direction as:

(30)

Then, for every epoch by using our proposed method in Eq. 7 the search direction at (n+1)th iteration is calculated as:

(31)

where the scalar βn is to be determined by the requirement that dn and dn+1 must fulfill the conjugacy property[3] and (ρn ) is defined by Eq. 26.

Outlines of the new algorithm:

Step 1:Initializing the weight vector randomly, the gradient vector g0 = 0 and gain value as one. Let the first search direction d0 = g0. Set β0 = 0, epoch = 1 and n = 1. Let Nt is the number of weight parameters Select a Convergence Tolerance CT.

Step 2:At step n, evaluate gradient vector gn(cn) with respect to gain vector cn and calculate gain vector.

Step 3:Evaluate E(wn). If E(wn)<CT then STOP training ELSE go to step 4.

Step 4:Calculate a new search direction: , where is defined in Eq. 25 for I = n-1.

Step 5:For the first iteration, check if n>1 THEN with the function of gain, update:

ELSE go to step 6

Step 6:If [(epoch +1)/Nt] = 0 THEN ‘restart’ the gradient vector with dn = gn-1(cn-1) ELSE go to step 7.

Step 7:Calculate the optimal value for learning rate η* by using line search technique such as:

Step 8:Update , where dn is a descent direction.

Step 9:Evaluate new gradient vector gn+1(cn+1) with respect to gain value (cn+1).

Step 10:Calculate new search direction:

Step 11:Set n = n+1 and go to step 2.

RESULTS

The performance criterion used in this research focuses on the speed of convergence, measured in number of iterations and CPU time. The test problems used to verify our new proposed algorithm are taken from the open literature by Prechelt[11]. The numerical results have been carried out on a Pentium IV using MATLAB version 7.0. On our test problems, four algorithms have been computed. The first algorithm is the standard CG with Fletcher-Reeves update (traincgf). The other algorithm is a standard CG algorithm computed from the NAG library (CGFR). The third one is the Nawi et al.[12] NN-algorithm computed by using their algorithm with their computer program listed in[12]. Finally, the new proposed ECG based NN algorithm (NEW) which is implanted by modifying NN-algorithm. This algorithm uses a major change in calculating the gradient step by in forcing the value of the new parameter ρi which is given in Eq. 25 of this study. We have noticed that if the value of this parameter is one, then, this new algorithm will coincide with Nawi et al.[12] algorithm but for positive values the new algorithm produces better numerical results because of the use of the non-quadratic model in the derivation of the CG-formula To compare the performance of the proposed algorithm with respect to others: standard optimization algorithms from the (MATLAB NN-toolbox) network parameters such as network size and architecture (number of nodes, hidden layers etc), values for the initial weights and gain parameters were kept same. For our test problem the neural network had one hidden layer with five hidden nodes and sigmoid activation function was used for all nodes. All algorithms were tested using the same initial weights that were initialized randomly from range [0, 1] and received the input patterns for training in the same sequence.

Default values were used for the heuristic parameters, of the above algorithms, unless stated otherwise. For the purpose of comparison, all tested algorithms were fixed with the values of learning rate = 0.3 and momentum term = 0.4. The initial value used for the gain parameter was one. The results of all these four algorithms will be presented as Table 1 and 2 which summarize the performance of the algorithms for simulations that have reached solution. All algorithms were trained with 100 trials, if an algorithm fails to converge, it is considered that it fails to train the FNN, but its epochs, CPU time and generalization accuracy are not included in the statistical analysis of the algorithms.

Test problem (1) Fisher[13]: Iris classification problem: This is a classical classification dataset made famous by Fisher[13], who used it to illustrate principles of discriminate analysis.


Table 1: Comparisons of four different algorithms for test problem (1)
Note: NOE: Number Of Epochs; CPU: Time in seconds per epochs CT: Time of Convergence

Table 2: Comparisons of four different algorithms for test problem (2)

This is perhaps the best-known database to be found in the pattern recognition literature. Fisher's study is a classic in the field and is referenced frequently to this day. The selected architecture of the FNN is (4-5-3-3) with target error was set as (0.01) and the maximum epochs to (2000).

Table 1 shows that the proposed algorithm reached the target error after only about (25) epochs as opposed to the standard CGFR at about (39) epochs and clearly we see that there is an improvement ratio, nearly (3.6), for the number of epochs compare to NN-toolbox and almost (3.681) for the convergence time. The following main computer program used with Nawi et al.[12] routine is:

% main learning program
P = [0 1 2 3 4 5 6 7 8 9 10];
T=[0 1 2 3 4 3 2 1 2 3 4];
net=Newff(minmax(P),[5 1],{'tansig' 'purelin'});
Y=Sim(net,P); plot(P,T,P,Y,'o')
net.trainParam.epochs = 50;
net=Train(net,P,T);
Y=Sim(net,P);

Test problem (2) Fisher[13]: Winconsin breast cancer problem: This dataset was created based on the ‘breast cancer Wisconsin’ problem dataset from[12]. This problem tries to diagnosis of breast cancer by trying to classify a tumor as either benign or malignant based on cell descriptions gathered by microscopic examination. The selected architecture of the FNN is (9-5-2-2). The target error is set as to (0.015) and the maximum epochs to (2000).

In Table 2, it is worth noticing that the performance of the new algorithm since it take only (33) epochs to reach the target error compare to Nawi et al.[12] (39) epochs and to CGFR at about (65) epochs and worst for traincgf that need about (71) epochs to converge. Still the proposed algorithm outperforms others three algorithms with a considerable improvements.

DISCUSSION

In this study, we have introduced neural network algorithm based on several optimization update. The new algorithm is compared with three well known standard CG and NN algorithms using the Iris Classification Problem and Winconsin Breast Cancer Problem. Our numerical results indicate that the new technique has an improvements of about (10-15%) NOE; CPU and CT tools.

CONCLUSION

In this research, new fast learning algorithm for neural networks which are based on CG updates with adaptive gain training algorithm (NEW) is introduced. The proposed algorithm improved the training efficiency of BP-NN algorithms by adaptively modifying the search direction. The initial search direction is modified by introducing the gain value. The proposed algorithm is generic and easy to implement in all commonly used gradient based optimization processes. The simulation results showed that the proposed algorithm is robust and has a potential to significantly enhance the computational efficiency of the training process.

" target="_blank">View Fulltext    |   Related Articles   |   Back
 
 
   
 
 
 
  Related Articles

No Article Found
 
 
 
Copyright   |   Desclaimer   |    Privacy Policy   |   Browsers   |   Accessibility