HOME JOURNALS CONTACT

Information Technology Journal

Year: 2008 | Volume: 7 | Issue: 4 | Page No.: 639-646
DOI: 10.3923/itj.2008.639.646
Research on Structure Learning of Product Unit Neural Networks by Particle Swarm Optimization
Xian-Hui Wang, Zheng Qin, Xing-Chen Heng and Yu Liu

Abstract: In this study, we put forward a new method to learn structure of Product Unit Neutral Network (PUNN). The technique used in our research is based on Particle Swarm Optimization (PSO) algorithm. The technique can optimized collocate network structure and weight of the PUNN at the same time using PSO algorithm through standard data set. Moreover, the number of Hidden Layer units of PUNN is decided by training set, not prefixed by the designer`s prior knowledge. Particles encoding scheme is simple and effective. The design of fitness function considers not only the mean square error between networks output and desired output, but also the number of hidden layer units. Therefore, the resulting network can alleviate the problem of over-fitting. The results of the experiment indicate that PSPUNN algorithm can achieve rational architecture for PUNN relying on standard data set and the resulting networks hence obtain strong generalization abilities.

Fulltext PDF Fulltext HTML

How to cite this article
Xian-Hui Wang, Zheng Qin, Xing-Chen Heng and Yu Liu, 2008. Research on Structure Learning of Product Unit Neural Networks by Particle Swarm Optimization. Information Technology Journal, 7: 639-646.

Keywords: structure learning, particle swarm optimization, Product unit neural network and cascade correlation

INTRODUCTION

It is often needs the designer prefixes structure of network according to his prior knowledge before using neural network to solve practical issues. For the designer, it is extremely tough because the optimal network structure can be set solely relying on the designer’s deep understanding of the problem. The structure of neural network embodies its degree of complexity, on the one hand, over simple network is hard to fully fitting training data dues to its limited flexibility; on the other hand, over complex network products weak generalization ability due to its feature of over flexibility, which fittings training the noises of data. Therefore, how to learn and optimize the structure of neural network according to training data set has been on the international hotspot. The methods which learn and optimize structure of neural network include: growth, pruning, evolutionary algorithm etc. Fahlman and Lebiere (1990) presented famous cascade correlation algorithm (CC). It is a supervision-training algorithm to construct close to the smallest network in the training process.

Neurons receive process signals from other ones and yield output signals and then send the output signals to other neurons again. General neurons which also called Summation Units (SUs) sum input signals; while Product Units (PUs) make product for input signals. For input signals, product units can form high-level combination and increase information capability. Product Unit Neural Networks (PUNN) (Durbin and Rumelhart, 1990; Schmitt, 2001; Ismail and Engelbrecht, 2002) has the virtue of strong ability of information storage which denotes that it can use simple network structure to render complex information. Compared to Summation Unit Neural Networks (SUNN), such as BP network, its network has the advantage of simple network topology, scarce amount of connective weight; at the meantime, it shows superior performance in the application fields such as pattern classification etc. (Fischer, 2002; Martinez-Estudillo et al., 2006).

Learning algorithm of PUNN is rather complicated relative to SUNN. Leerink et al. (1995) studied PUNN training algorithm widely. When using BP algorithm to train PUNN, on account of local minimum, the training would encounter many obstacles. Different researchers adopt different algorithm to train PUNN: Janson and Frenzel (1993) suggested that using Genetic Algorithm (GA), Ismail and Engelbrecht (2000) trained it through testing various global optimization algorithm, for instance, Particle Swarm Optimization (PSO) algorithm, Genetic Algorithm (GA), Van Den Bergh and Engelbrecht (2001) proposed Cooperative Particle Swarm Optimization (CPSO) algorithm. Among those different types of global optimization algorithm, PSO algorithm has relatively better effect in training PUNN (Engelbrecht and Ismail, 1999). Therefore, in this study, we depend on the PSO algorithm to learn and optimize PUNN structure, concurrently train PUNN.

PRODUCT UNIT NEURAL NETWORKS

Durbin and Rumelhart (1990) firstly use Product Unit (PU) to extend multilayer perceptron. Product Unit is different from common neuron which is called Summation Unit (SU). All the neurons receive input signals from other neurons and make output signals, then send the output signals to other neurons. A PU combines input signals according to bellowing formula: while the form of SU is . Here, N represents the total amount of neurons connect to it; xi is the ith input signal connected neuron and wi is the ith weight vector connected neuron.

All layers of Product Unit Neural Networks (PUNN) can be constructed by PU or alternation between PU and SU. A PUNN may have several hidden layers. A common PUNN network’s topology is a three-layer feed-forward network. Figure 1 is a typical a three-layer feed-forward PUNN network, hidden layer consists of PU, output layer consists of SU.

Suppose PUN’s input layer has D nodes and hidden layer has M nodes, while output layer has C nodes; suppose the connective weight vectors between input and output layers is u, the connective weight vectors between nodes of hidden and output layer is w; if the p-th input pattern applied to the input nodes, the k-th output node will output yk,p (the value with a bias θk). We can use the bellowing formula to calculate yk,p:

(1)

Fig. 1: The architecture of product unit neural networks

PARTICLE SWARM OPTIMIZATION

Particle Swarm Optimization (PSO), a new population-based evolutionary computation technique inspired by social behavior simulation, was first introduced in 1995 by Kennedy and Eberhart (1995). Since it is simple and effective, it has been successfully applied to many fields such as non-linear optimization, neural network and pattern recognition. PSO is a population based iterative search algorithm. A swarm consists of N particles moving around in a D-dimensional search space. Each particle adjusts its flying to a promising area according to its own flying experience and sharing social information among particles. The position of the i-th particle at the iteration is represented by Xi(t) = (xi1, xi2,...,xiD) that are used to evaluate the quality of the particle. During the searching process the particle successively adjusts its position to the global optimum according to two factors: the best position encountered by itself (pbest) denoted as Pi = (pi1, pi2,...,piD) and the best position encountered by the whole swarm (gbest) denoted as Pg = (pg1, pg2,...,pgD). Its velocity at t iteration is represented by Vi(t) = (vi1, vi2,...,viD). The position at next iteration is calculated according to the following equations:

(2)

(3)

where, c1 and c2 are two positive constants, called cognitive learning rate and social learning rate, respectively; rand() is a random function in the range [0,1]; w is inertia factor (Shi and Eberhart, 1998) and λ is constriction factor (Clerc, 1999). In addition, the velocities of the particles are confined within [Vmin, Vmax]D. If an element of velocities exceeds threshold Vmin or Vmax, it is set equal to corresponding threshold. The pseudo code of PSO is as follows:

Begin PSO
End PSO

PSPUNN ALGORITHM

It is proved by Hornik et al. (1989) that any measurable function can be approximate to any desired degree of accuracy by three-layer network with only one hidden layer, provided sufficient hidden units are available. Therefore, PUNN’s structure adopt three-layer feed-forward network as showed in Fig. 1. The hidden layer is consist of PU, the output layer is consist of SU. Suppose all the layers make use of linear activation functions, from we know that if xi and wi are negative, there will have a part of imaginary number. In order to avoid it, the input xi can normalize by the bellowing formula.

(4)

Because 0-1/2 is meaningless, so plus a bias more than or equal to 2, we adopt 2, after preconditioning, xi’s normalized value among (1, 3).

In the neural network, the quantity of hidden layer units decides its computation complexity degree of complexity. So the duty of training is to determine 3 groups of parameters: the weight connected input and hidden layers (denoted as W_IH), the amount of hidden layer units (denoted as Num_H) and the weight connected hidden and output layers (denoted as W_HO). Since the output layer is constituted by SUs and the activation function is linear function, so the determination of W_HO can be formulized a problem of solving linear algebraic equations when the outputs of hidden layers are given. Thus, just Num_H and weights W_IH are encoded into a particle for stochastic searching. In each iteration, each particle determines the weights W_IH and Num_H. Then the outputs of hidden layer are figured out, according to which, as well as expected target patterns, W_HO can be obtained by Singular Value Decomposition (SVD). In this study, we formulate the neural network training as an optimization problem and then employ PSO algorithm to resolve it. When adopting PSO, we must tackle two issues: the representation of a particle and designing fitness function.

The encoding representation of a single particle: HUniti denotes the weight vector that connects the i-th hidden unit to input layer, where the number of element of this vector is equal to that of input units. Flagi indicates whether or not the i-th hidden unit is involved into the network, i.e., whether or not HUniti is included in calculation. If Flagi >= 0, the i-th hidden unit is included in the network. Otherwise the i-th hidden unit is removed from the network (Fig. 2).

Fig. 2: Structure of a particle

In such a way, particles can be interpreted to networks with various number of hidden units though they have a fixed length for a particle. Here, different particles correspond to networks with different number of hidden units.

This is hybrid representation that the first part consists of discrete variables and the second part continuous ones. It is well known that PSO performances very well on continuous variables. In 1997, Kennedy and Eberhart (1997) proposed a method to handle discrete variables with a little modification of the original PSO. The only difference is in the position update equation, which is modified to:

(5)

where, S is a sigmoid function and rand() is a random function in the range (0,1); Vmin, Vmax are usually set to -6, 6. Here, the velocity is a new meaning. Each vid represents the probability of bit xid taking the value 1. Suppose the previous best position pid is 1 in any bit, (pid – xid) may be 1, 0. When xid is same to pid (xid = pid = 1), (pid – xid) is equal to 0, which has no contribution to the change of velocity according to formula 2. When xid is different from pid (xid=0; pid=1), (pid – xid) is equal to 1, which increases the velocity. On the other hand, the previous best position pid is 0 and xid is different from pid, the velocity will be decreased, which increases the probability of xid at the next step to be 0. In a word, formulas 2, 5 increase the probability of any bit to be corresponding the best position. Therefore, the essence of discrete PSO is same as that of continuous one. They both promote particles gathering toward the best particle. If we incorporate above discrete PSO into neural network training, a particle must be updated by two different formulas at each iteration. Regarding to a particle, the first part is updated by formula 5 and the second part by formula 3, which will complicate the algorithm.

Usually, evolutionary algorithm uses consistent encoding method, that is, either all discrete variables or continuous variables. Now that the discrete and continuous PSO have the same essence, we still use formulas 2, 3 as velocity update equation and position update equation respectively to handle discrete problem. When the fitness function is evaluated, xid is given the different explanation.

mid = hardlim (xid)

hardlim function is defined as follows:

(6)

mid is used to calculate the fitness according to the problem to be solved, xid is still a continuous variables. The following experiment will proof the validity of our proposal. The discrete test problem (Kennedy and Eberhart, 1997) is used to compare the performances of our proposal and Eberhart’s discrete PSO algorithm.

(7)

ui is represented by 10 binary bits. Since there are three ui, the dimension of particles is 30. Every 10 binary bits will be transformed to a real value. For example, a 10-bit string is s = 1010001111 where ui is 1.4300.

The transform function is as follows:

ui = (decimal(s) - 29)/2
(8)

where, decimal function converts binary number into decimal number. In fitness evaluation, binary bits mid are first formed from xid according to formula 6 and then are transformed according to formula 8. The resulting values are taken into formula 7 to calculate the fitness.

As shown in Fig. 3, our proposal can effectively handle discrete problem. Furthermore, we conducted many experiments with different values of Vmin, Vmax. The experimental results show that Vmin and Vmax have no special requirement and their values do not influence the performance. So their values of the first part can be assign the same value as that of the second part. Therefore, with our proposal the two parts of a particle are consistent. The training algorithm can use the same velocity and position update equations and parameters, which is simple and effective.

The design of fitness function: In our training algorithm, Flagi has new meaning. Flagi≥0 means that the i-th hidden unit is included in the network, otherwise means that the i-th hidden unit is remove from the network. According to the following formula, the resulting network is measured on the training set where PN input-target patterns are given.

Fig. 3: The fitness against iterations

(9)

where, tp and op are the desired output and network output for pattern p respectively, k is a constant; Nhidden is the number of hidden units involved in networks; Nmaxhidden is predefined the max possible number of hidden units. The above formula takes into account both approximate accuracy and scale of networks, thus the resulting PUNNs can alleviate over-fitting. Parameter k is used to balance approximate accuracy and structure complexity. Value of k is decided during the course of the experiments according to problem’s prior knowledge. In the experiments of the paper, it is suitable that k = 2.

EXPERIMENTAL SET-UP

The data sets used in this section were obtained from the UCI repository of machine learning databases (Blake et al., 1998). Based on every single standard data set, we take three algorithm experiments; due to all of the three algorithms betake three layers of neural network, so we use the bellowing quality index to evaluate them: (a) the unit number of hidden layer, which embodies the complex grade of neural network structure and (b) the generalization ability of neural network, in other words, the correct classification rate of testing sets of standard data sets. Its generalization ability is an important technical index of judging quality of neural network.

Three different algorithms:

Algorithm 1: PSPUNN: It is a kind of algorithm which is proposed by authors. It is a type of training algorithm which uses PSO allocating the structure and weight of PUNN. PSPUNN is a sort of neural network structure optimization algorithm; here the Nmaxhidden represents the maximum possible number of hidden layer units; Nhidden represents the obtained ultimate number of hidden layer units through PSPUNN.

Algorithm 2: PSO-fixation: It is a sort of training algorithm which uses PSO optimizing the weight of PUNN. Here, the structure of PUNN is prefixed by the designer’s prior knowledge. Owing to the PUNN structure used here is the three-layer feed-forward network of Fig. 1, fixing the number of the hidden layer unit is enough. We take two experimentations about fixed structure: PSO-FHmax and PSO-FHmin. PSO-FHmax represents making use of PUNN which have a number of Nmaxhidden hidden units; PSO-FHmin represents making use of PUNN which have a number of Nhidden hidden units.

Algorithm 3: Cascade correlation algorithm: CC algorithm (Cascade correlation algorithm) is a famous classic neural network structure optimization algorithm. In this paper, it is used to allocate structure and weight of PUNN depending on standard data set. The neural network which inferred from CC algorithm is called CC network. The CC network can be seemed as changed three-layer feed-forward neural network comparing with common three-layer feed-forward neural network. It has connections among the hidden layer nodes. CC hidden represents ultimate number of hidden layer unit which is obtained from CC algorithm.

In the PUNN, from we know: if xi and wi are minus, it will have an imaginary number section. In order to avoid it, all the experimental datasets in the paper are standardized by formula 4. The experimentation adopts 4 standard data sets (Table 1).

Parameter set of PSO like bellow: inertia factor w is linearity descending between 0.9 and 0.4, learning rate c1 = c2 = 2, Vmin = -2, Vmax = 2, the maximum iterative degree is 2000; the population size is 10. Nmaxhidden: represents the number of the maximum possible hidden layer neuron; Nhidden: represents the unit number of ultimate hidden layer which gained by PSPUNN.

CC algorithm’s parameter set like bellow: the activation function of input layer using linearity function, hidden layer using SIGMOID function, output layer using linearity function. The maximum hidden layer unit number MaxHUnit_CC = 40.

Table 1: UCI experiment data sets information

The condition which terminates the increasing of CC algorithm hidden layer unit number: aiming at relative data sets’ training set, if CC corrective classification rate more than or equal to PSPUNN’s or CC hidden = MaxHUnit_CC, CC algorithm will stop increasing the unit number of hidden layer. Other experimental parameter uses CC’s default setting (Fahlman and Lebiere, 1990).

In the study, we use 3-fold cross-validation to do all experiments. All experimentations will be circulated 3 times. Firstly, every data set should be divided into three parts. Then, in each of them, one part as testing set, others’ two parts as training set. Each experiments should select different one-part from data set. Train Accuracy and Test Accuracy point out separately that on the training set and testing set, the experiment working 3 times will get average correct classification rate; while on the PSPUNN and CC, the number of hidden layer unit can be obtained by running 3 times to get average value, then round down integer. Computer resources as follows: CPU is intel Celeron 2.4 G and 512 M RAM.

EXPERIMENTAL RESULTS

Comparison of PSPUNN and PSO-fixation: Through Table 2 and 3 we know that PSPUNN can allocate logical network structure automatically and its resulted network has quite higher predicted precision degree on the testing set. Figure 4 shows the results comparing of PSPUNN and PSO-Fixation on different dataset. On reverse, PSO-FHmax trained PUNN, because the higher complex degree of network, owns poor accuracy on the testing set. PSO-FHmin trained PUNN, due to its number of hidden layer neuron are decided by number (which called Nhidden) which gained by PSPUNN, so its network’s generalization ability is better than PSO-FHmax’s.

Table 2: The result of PSPUNN algorithm

Table 3: The result of PSO-Fixation algorithm (Including PSO-FHmax and PSO-FHmin)


Fig. 4: Results comparing of PSPUNN and PSO-Fixation on different algorithm (a) Iris, (b) Wine, (c) New-thyroid and (d) Page-blocks




Fig. 5: Results comparing of PSPUNN and CC on different algorithm (a) Iris, (b) Wine, (c) New-thyroid and (d) Page-blocks

Fig. 6: The fitness values against iterations

Fig. 7: The No. of hidden units against iterations

It demonstrates that PSPUNN has the automatic ability to realize PUNN’s rational structure allocation dynamically. Upwards experiments show that PSPUNN is a kind of simple and effective algorithm and it is inferred network has stronger generalization ability.

Figure 6 shows the fitness evolution curve when PSO-Var trained PUNNs on Wine data set. Figure 7 shows the corresponding average number of hidden units of PUNNs that were formed according to particles at each iteration. As shown in Fig. 7, in early iterations, the structures of PUNNs were intensively explored. The structures gradually became stable along the iterations. In later iterations, the algorithm focused on adjusting the weights.

Comparison of PSPUNN and CC: PSPUNN is a sort of mentioned algorithm in this study, which learns and optimizes structure of neural network, while CC (Cascade correlation algorithm) is a famous classical algorithm which optimizes structure of neural network.

Table 4: The result of CC algorithm

Table 2 and 4 show that PSPUNN has better optimized ability of neural network structure than CC after experiences tests of four data sets. Figure 5 shows the results comparing of PSPUNN and CC on different dataset Depending on standard data sets, PSPUNN can figurate smaller network structure (through comparing the unit number of hidden layer) automatically compared with CC, besides it also demonstrates that the making out network has better ability of generalization.

CONCLUSIONS

In this research, a approach (PSPUNN) to learn the structure of PUNN is proposed, it can configure the architecture and weight of PUNN simultaneously, depending on training sets and using PSO. During the process of iteration of PSO, every particle determined the weights of connecting input and hidden layer and the number of the hidden layer units and then the weights that connect the hidden and output layers are obtained by Singular Value Decomposition (SVD). Consequently, a PUNN network was constructed. Fitness function takes into account not only mean square error between networks output and desired output, but also the number of hidden units. Therefore, the resulting network can alleviate over-fitting problems. Through experimental comparing among PSPUNN, PSO-Fixation and CC algorithms, the results show that PSPUNN algorithm has ability of allocating smaller network structure automatically, compared to CC algorithm and the resulting networks obtain strong generalization abilities.

ACKNOWLEDGMENTS

The research is funded by National Basic Research 973 Program of China (No. 2004CB719401) and also supported by the National Defense 11th-Five-Year Preliminary Research fund.

REFERENCES

  • Blake, C.L. and C.J. Merz, 1998. UCI Repository of Machine Learning Databases. 1st Edn., University of California, Irvine, CA
    Direct Link    


  • Clerc, M., 1999. The swarm and the queen: Towards a deterministic and adaptive particle swarm optimization. Proceedings of the Congress on Evolutionary Computation, July 6-9, 1999, Washington, DC., USA., pp: 1951-1957.


  • Durbin, R. and D.E. Rumelhart, 1990. Product Units: A Computationally Powerful and Biologically Plausible Extension to Backpropagation Networks. Advances in Neural Information Processing Systems, San Mateo, CA: Morgan Kaufmann Publishers, 1: 133-142.


  • Engelbrecht, A.P. and A. Ismail, 1999. Training product unit neural networks. Stability Control Theor. Appl., 2: 59-74.


  • Fahlman, S.E. and C. Lebiere, 1990. The Cascade-Correlation Learning Architecture. In: Advances in Neural Information Systems II, Touretzky, D.S. (Ed.). Morgan Kaufmann Publishers, Morgan, pp: 524-532


  • Fischer, M.M., 2002. A novel modular product unit neural network for modeling constrained spatial interaction flows. Proceedings of the Congress on Evolutionary Computation, Volume 2, May 12-17, 2002, Honolulu, HI., pp: 1215-1220.


  • Hornik, K., M. Stinchcombe and H. White, 1989. Multilayer feedforward networks are universal approximators. Neural Networks, 2: 359-366.
    CrossRef    


  • Ismail, A. and A.P. Engelbrecht, 2000. Global optimization algorithms for training product unit neural networks. Proceedings of the International Joint Conference Neural Networks, Volume 1, July 24-27, 2000, Como, Italy, pp: 132-137.


  • Ismail, A. and A.P. Engelbrecht, 2002. Pruning product unit neural networks. Proceedings of the International Joint Conference on Neural Networks, May 12-17, 2002, IEEE Press, Piscataway, NJ., USA., pp: 257-262.


  • Janson, D.J. and J.F. Frenzel, 1993. Training product unit neural networks with genetic algorithms. IEEE Expert, 8: 26-33.
    CrossRef    Direct Link    


  • Kennedy, J. and R. Eberhart, 1995. Particle swarm optimization. Proceedings of the International Conference on Neural Networks, Volume 4, November 27-December 1, 1995, Perth, WA., USA., pp: 1942-1948.


  • Kennedy, J. and R. Eberhart, 1997. A discrete binary version of the particle swarm algorithm. IEEE Int. Conf. Syst. Man Cybernet., 5: 4104-4108.
    CrossRef    


  • Leerink, L.R., C.L. Giles, B.G. Horne and M.A. Jabri, 1995. Learning with product units. Adv. Neural Proc. Syst., 7: 537-544.


  • Martinez-Estudillo, F.J. and C. Hervas-Martinez, 2006. Evolutionary product-unit neural networks for classification. Lecture Notes Comput. Sci., 4224: 1320-1328.
    Direct Link    


  • Schmitt, M., 2001. Product unit neural networks with constant depth and superlinear VC dimension. LNCS., 21: 253-258.


  • Shi, Y. and R. Eberhart, 1998. A modified particle swarm optimizer. Proceedings of the World Congress on Computational Intelligence and IEEE International Conference on Evolutionary Computation, May 4-9, 1998, Anchorage, AK., pp: 69-73.


  • Van den Bergh, F. and A.P. Engelbrecht, 2001. Training product unit networks using cooperative particle swarm optimisers. Neural Networks, 1: 126-131.
    CrossRef    

  • © Science Alert. All Rights Reserved