Eq. 6 and gradient for gain value by using Eq. 9. Step 2:Calculate the gradient descent on error with respect to the weights and gains values. Step 3:Use the gradient weight vector and gradient of gain calculated in step 1 to calculate the new weight vector and vector of new gain values for use in the next epoch. In general, the objective of a learning process in neural network is to find a weight vector w which minimizes the difference between the actual output and the desired output. Namely:
Suppose for a particular input pattern o^{0} and let the input layer
is layer 0. The desired output is the teacher pattern t = [t_{1}…t_{n}]^{T}
and the actual output is ,
where L denotes the output layer. Define an error function on that pattern as:
The overall error on the training set is simply the sum, across patterns, of
the pattern error E. Consider a multilayer feed Forward Neural Network (FNN)^{[1]}
has one output layer and one input layer with one or more hidden layers. Each
layer has a set of units, nodes, or neurons. It is usually assumed that each
layer is fully connected with a previous layer without direct connections between
layers which are not consecutive. Each connection has a weight. Let be
the activation of the i^{th} node of layer s and let be
the column vector of the activation values in the layer s and the input layer
as layer 0. Let be
the weight on the connection from the i^{th} node in layer s 1 to the
j^{th }node in layer s and let be
the column vector of weights from layer s 1 to the j^{th} node of layer
s. The net input to the j^{th} node of layer s is defined as and
let be
the column vector of the net input values in layers. The activation of a node
is given by a function of its net input:
Where:
f 
= 
Any function with bounded derivative 

= 
A real value called the gain of the node 
Note that this activation function becomes the usual logistic activation function
if .
By introducing "gain variation" term in this activation function,
then only updating formulas for are
changed while others are the same as the standard back propagation. To simplify
the calculation, taken from the Eq. 2 we then can perform
gradient descent on E with respect to. The
chain rule yields:
where, .
In particular, the first three factors of (4) indicate that:
As we noted that the iterative formula (5) for is
the same as standard BP^{[11]} except for the appearance of the value
gain. By combining (4) and (5) yields the learning rule for weights:
where, η is a small positive constant called ‘step length’ or ‘learning
rate’ and the search direction or gradient vector is .
In this approach, at step n is the calculation for gradient of error g^{(n)}
is modified by including the variation of gain value to yield:
The gradient descent on error with respect to the gain can also be calculated by using the chain rule as previously described; it is easy to compute as: Then the gradient descent rule for the gain value becomes: At the end of each iteration the new gain value is updated using a simple gradient based method as given by the formula: For more details of this method and its implementations^{[12]}. CGtype method: In order to overcome the difficulties of the standard delta rule learning technique, where this turning through the steepest descent direction. Which causes the zigzag problematic, where it increases the turning time. To improve the global rate of convergence of the standard delta rule learning technique, we have suggested the new delta rule based on CG type method. The CG method is an iterative optimization approach, the gradient is ncomponents vector and the gradient direction is called the steepest ascent or descent (the FletcherReeves method tries to exploit the fact that for a quadratic function of n variables, n linear searches along the mutually conjugate directions will locate the minimum). But this direction has a local property and not a global one. Thus any method which makes use of the gradient vector, can be expected to give the minimum point. The steps of CG algorithm are: Algorithm (CGbased NNalgorithm): • 
Select the initial weights randomly with small values and
set p = 1 
• 
Select the next training pair from the training set [T_{p}]
apply the input vector [x_{p}] to the network input and specify
the desired output vector. Then calculate the actual outputs of the network
by using these two formulae: 
Where:

= 
Summation of multiple weight with inputs for j 

= 
The actual output of the network j by using function 

= 
It represents input patterns, p number of pattern and j is number of net 
And:
net_{j} 
= 
The result of summation of j node 
x_{i} 
= 
Input pattern 
w_{ij} 
= 
Weight between node j and i 
• 
Calculate the errors between the actual output of the network
and the desired output by using these steps 
Start with x_{i}. If (i = 1) then d_{1} = g_{1}. Obtain
(μi) using line search and update the object (x) using the formula: .
Increment i by 1 and calculate the gradient vector g_{n} to obtain β_{i}.
Update the direction d_{i} using the formula and
find the corresponding step size (μ_{i}) by updating the object
(x) to minimize the value of object (x) and adjust the weights of the network
in a way that minimizes the error.
When (x_{i+1}) is minimum terminate the iterations; if not, continue to repeat step (3) for each vector in the training set until the total error for the entire set is acceptable low, where d_{i} is the update direction of the object at the iteration and d_{i}, the initial direction, is chosen to be the steepest descent direction, μi is the step size along the direction d_{i}; β_{i} is chosen to insure that the direction vector is conjugate to all previous directions and g_{i} is the gradient of the activation function of the iteration. There is a number of different choices for β_{i}. In this study, Fletcher and Reeves has been used. It is one of the most commonly used and its formula is: The other type of β_{i} is Al Assady and Al Bayati and Polak and Ribere. The formula of β_{i} of these two types are:
Following we are going to investigate and develop a new formula for the extended FR formula which is based on the nonquadratic models, this will increase the rate of convergence of the In fact, these quantities are wellknown and derived with respect of quadratic models, certainly the FR formula has a standard FRCGmethod. Now most of the currently used optimization methods use a local quadratic representation of the objective function. But the use of quadratic model may be inadequate to incorporate all the information so that more general models than quadratic are proposed as a basis for CG algorithms.
It is shown that CG methods with in
the numerator of β_{i} has strong global convergence theorems with
exact and inexact line searches but has poor performance in practice. On the
other hand the CG methods with in
the numerator of β_{i} has uncertain global convergence for general
nonlinear functions, but has good performance in practice. Therefore CG methods
has been frequently modified and improved by many authors, for example^{[1416]}
have proposed further modifications of the conjugate gradient methods which
are based on some nonquadratic models.
Here we generalize the FRCG by considering more general function than quadratic which we call it as quasisigmoid function, our goal is to preserve and increase the convergence properties of FR algorithm and to force its performance in practice. A generalized FRCG algorithm (based on nonquadratic model): If q(x) is a quadratic function defined by:
Where:
G 
= 
nxn symmetric and positive definite matrix 
b 
= 
=Constant vector in R^{n} and c is constant 
Then we say that f is Defined as a nonlinear scaling of q(x) if the following conditions hold: And: The following proportions are immediately derived from the above conditions: • 
Every contour line of q(x) is a contour line of f 
• 
If x• is minimize of q(x) then it's also a minimize of
f 
In this area there are various published works. The special polynomial case: where, ε_{1}, ε_{2} scalars are has been investigated by^{[16]}. A rational model has been developed by^{[15]} where: Another rational model was considered by^{[14]} where: A new extended CGformula: Here we consider the new model function defined by: We call it as quasi sigmoid function where q>0 with further assumption that: From Eq. 20 we have:
Return to Eq. 20 and solve it for q assuming that where
the remainder .
It is clear that and
set .
Then:
And: Use (21) and (23) to compute F^{t} using only function values:
where, assuming
Now define ρ_{i} as follows: Then we can compute the value of ρ_{i} using function value at two points x_{k+1} and x_{k} from Eq. 24:
Where:
Our objective is to investigate an extended FRCG method applied to function F(q(x)) obtained by nonlinear scaling of q(x). The extended FRCG method can be done by modifying the search directions as follows: And for i≥0: Where: Where:
is
the search direction applied to and
.
We can show that the original FRCG method and extended FRCG algorithm defined in (2729) generates the same set of directions and same sequence of points {x_{i}}by using the following theorem: Theorem 1: Given an identical starting point x_{0∈}R^{n}. The method of FRCG applied to f(x) = q(x) and the extended (ERFCG) method defined by (2729) and applied to f(x) = F(q(x)) with ρ_{i} defined by (26) generate identical set of directions and identical sequence of points {x_{i}}. Proof: The prove is by induction for k = 0 we have:
Suppose: ,
to prove for i+1
Hence the two methods generates the same set of directions. Our proposed algorithm known as (NEW) begins the minimization process with an initial estimate w_{0} and an initial search direction as: Then, for every epoch by using our proposed method in Eq. 7 the search direction at (n+1)^{th }iteration is calculated as: where the scalar β_{n} is to be determined by the requirement that d_{n} and d^{n+1} must fulfill the conjugacy property^{[3]} and (ρ_{n} ) is defined by Eq. 26. Outlines of the new algorithm: Step 1:Initializing the weight vector randomly, the gradient vector g_{0} = 0 and gain value as one. Let the first search direction d_{0} = g_{0}. Set β_{0} = 0, epoch = 1 and n = 1. Let N_{t} is the number of weight parameters Select a Convergence Tolerance CT. Step 2:At step n, evaluate gradient vector g_{n}(c_{n}) with respect to gain vector c_{n} and calculate gain vector. Step 3:Evaluate E(w_{n}). If E(w_{n})<CT then STOP training ELSE go to step 4.
Step 4:Calculate a new search direction: ,
where
is defined in Eq. 25 for I = n1.
Step 5:For the first iteration, check if n>1 THEN with the function of gain, update:
ELSE go to step 6
Step 6:If [(epoch +1)/Nt] = 0 THEN ‘restart’ the gradient vector with d_{n} = g^{n1}(c_{n1}) ELSE go to step 7. Step 7:Calculate the optimal value for learning rate η* by using line search technique such as:
Step 8:Update ,
where d_{n} is a descent direction.
Step 9:Evaluate new gradient vector g_{n+1}(c_{n+1}) with respect to gain value (c_{n+1}). Step 10:Calculate new search direction:
Step 11:Set n = n+1 and go to step 2. RESULTS The performance criterion used in this research focuses on the speed of convergence, measured in number of iterations and CPU time. The test problems used to verify our new proposed algorithm are taken from the open literature by Prechelt^{[11]}. The numerical results have been carried out on a Pentium IV using MATLAB version 7.0. On our test problems, four algorithms have been computed. The first algorithm is the standard CG with FletcherReeves update (traincgf). The other algorithm is a standard CG algorithm computed from the NAG library (CGFR). The third one is the Nawi et al.^{[12]} NNalgorithm computed by using their algorithm with their computer program listed in^{[12]}. Finally, the new proposed ECG based NN algorithm (NEW) which is implanted by modifying NNalgorithm. This algorithm uses a major change in calculating the gradient step by in forcing the value of the new parameter ρ_{i} which is given in Eq. 25 of this study. We have noticed that if the value of this parameter is one, then, this new algorithm will coincide with Nawi et al.^{[12]} algorithm but for positive values the new algorithm produces better numerical results because of the use of the nonquadratic model in the derivation of the CGformula To compare the performance of the proposed algorithm with respect to others: standard optimization algorithms from the (MATLAB NNtoolbox) network parameters such as network size and architecture (number of nodes, hidden layers etc), values for the initial weights and gain parameters were kept same. For our test problem the neural network had one hidden layer with five hidden nodes and sigmoid activation function was used for all nodes. All algorithms were tested using the same initial weights that were initialized randomly from range [0, 1] and received the input patterns for training in the same sequence. Default values were used for the heuristic parameters, of the above algorithms, unless stated otherwise. For the purpose of comparison, all tested algorithms were fixed with the values of learning rate = 0.3 and momentum term = 0.4. The initial value used for the gain parameter was one. The results of all these four algorithms will be presented as Table 1 and 2 which summarize the performance of the algorithms for simulations that have reached solution. All algorithms were trained with 100 trials, if an algorithm fails to converge, it is considered that it fails to train the FNN, but its epochs, CPU time and generalization accuracy are not included in the statistical analysis of the algorithms. Test problem (1) Fisher^{[13]}: Iris classification problem: This is a classical classification dataset made famous by Fisher^{[13]}, who used it to illustrate principles of discriminate analysis.
Table 1: 
Comparisons of four different algorithms for test problem
(1) 

Note: NOE: Number Of Epochs; CPU: Time in seconds per
epochs CT: Time of Convergence 
Table 2: 
Comparisons of four different algorithms for test problem
(2) 

This is perhaps the bestknown database to be found in the pattern recognition literature. Fisher's study is a classic in the field and is referenced frequently to this day. The selected architecture of the FNN is (4533) with target error was set as (0.01) and the maximum epochs to (2000). Table 1 shows that the proposed algorithm reached the target error after only about (25) epochs as opposed to the standard CGFR at about (39) epochs and clearly we see that there is an improvement ratio, nearly (3.6), for the number of epochs compare to NNtoolbox and almost (3.681) for the convergence time. The following main computer program used with Nawi et al.^{[12]} routine is:
% main learning program
P = [0 1 2 3 4 5 6 7 8 9 10];
T=[0 1 2 3 4 3 2 1 2 3 4];
net=Newff(minmax(P),[5 1],{'tansig' 'purelin'});
Y=Sim(net,P); plot(P,T,P,Y,'o')
net.trainParam.epochs = 50;
net=Train(net,P,T);
Y=Sim(net,P);
Test problem (2) Fisher^{[13]}: Winconsin breast cancer problem: This dataset was created based on the ‘breast cancer Wisconsin’ problem dataset from^{[12]}. This problem tries to diagnosis of breast cancer by trying to classify a tumor as either benign or malignant based on cell descriptions gathered by microscopic examination. The selected architecture of the FNN is (9522). The target error is set as to (0.015) and the maximum epochs to (2000). In Table 2, it is worth noticing that the performance of the new algorithm since it take only (33) epochs to reach the target error compare to Nawi et al.^{[12]} (39) epochs and to CGFR at about (65) epochs and worst for traincgf that need about (71) epochs to converge. Still the proposed algorithm outperforms others three algorithms with a considerable improvements. DISCUSSION In this study, we have introduced neural network algorithm based on several optimization update. The new algorithm is compared with three well known standard CG and NN algorithms using the Iris Classification Problem and Winconsin Breast Cancer Problem. Our numerical results indicate that the new technique has an improvements of about (1015%) NOE; CPU and CT tools. CONCLUSION In this research, new fast learning algorithm for neural networks which are based on CG updates with adaptive gain training algorithm (NEW) is introduced. The proposed algorithm improved the training efficiency of BPNN algorithms by adaptively modifying the search direction. The initial search direction is modified by introducing the gain value. The proposed algorithm is generic and easy to implement in all commonly used gradient based optimization processes. The simulation results showed that the proposed algorithm is robust and has a potential to significantly enhance the computational efficiency of the training process. " target="_blank">View Fulltext
 Related
Articles  Back
