INTRODUCTION
Combinatorial optimization problems are expressed by a cost function with or
without constraints to be minimized or maximize on a set of definitions finished
or countable. Metaheuristics bring approximate solutions to large problem instances
and require an intensive scientific computation. However, their resolutions
are limited by available resources capacities. Thus, suitable distributed parallel
models decrease the computing time and improve the quality of obtained solutions.
Also, the exploitation of several workstations is an opportunity for parallel
computing but new problems are posed (resources heterogeneity, failure of node).
These new challenges are overcome with the use of high performance clusters
and grids computing. Grids computing are an emerging computing model that provides
the ability to perform higher throughput computing by taking advantage of many
networked computers to model a virtual computer architecture that is able to
distribute process execution across a parallel infrastructure. Grids computing
are often confused with cluster computing. The key difference is that a cluster
is a single set of compute nodes sitting in one location, while a grid is composed
of many clusters and other kinds of resources (Parsopoulos
et al., 2004).
In this study, we analyse a parallel distributed model performance of a bioinspired
algorithm: the Parallel Swarm Optimization metaheuristic. The application solves
a scheduling problem of tasks in a hybrid flow shop production system. We deploy
a cluster of nodes based on resources manager (Condor) (Pruyne
et al., 1995) coupled with Parallel Virtual Machine communication
libraries (Requile, 1995).
THE PSO ALGORITHM
In PSO algorithm, each particle is characterized by:
• 
The objective function cost for its current position or that
acquired previously: pbest 
• 
The knowledge of its neighbors 
• 
The best previous and current position acquired among all
the particles in the swarm gbest 
The PSO is initialized with a set of random particles (solutions) and then
searches for optima by updating generations (Hu et al.,
2003).

Fig. 1: 
Pseudo code of PSO algorithm 
In each step, the particle is updated and makes a compromise between three possible choices (Fig. 1):
• 
To return towards its best obtained position 
• 
To move towards the best obtained position of the swarm 
Each particle, updates velocity and position according pbest and gbest.
The velocity restriction constant V_{max }was also included in the
algorithm. If the sum of the tree parts exceeds
a constant value, then Particle’s velocity is clamped to a maximum velocity
V_{max} that is specified by the user. This mechanism prevents the phenomenon
of swarm explosion (LiPing et al., 2005).
The pseudo code of the procedure is as follows (Kennedy
et al., 1995):
PARALLEL DISTRIBUTED STRATEGY OF PSO ALGORITHM
Initial swarm is divided into subswarms and distributed to cluster’s
compute nodes (Fig. 2). Each processor executes the PSO algorithm
independently and accelerates convergence to the global optima by sending their
best solutions (fitness) at each interval of iterations to a nearby node in
a ring topology. This approach was applied to a problem of multiobjective optimization
(Parsopoulos et al., 2004).

Fig. 2: 
Parallel Distributed PSO model 

Fig. 3: 
Parallel distributed PSO: C_{max} 
In this study, we investigated this approach with an aim of measure parallel
efficiency by studying the impact of the number of subswarms and particles
on solution quality.
In order to minimize internodal communication, which often forms the performance
bottleneck on networked machines, we used a master/slave communication model
with a minimal amount of communication. In our application, the master node
is used exclusively for distributing subswarms and to coordinate the particle
queue. The slaves executes PSO algorithm in parallel, migrate their best solution
to a nearby slave and finally send the solution to the master (Fig.
3). We used a cluster of 14 nodes and PVM library routines for sending and
receiving messages.
Table 1: 
The characteristics of the cluster 

The main characteristics of our cluster are reported in Table
1.
EXPERIMENTAL TESTS AND RESULTS
Serial execution algorithm results are shown by Table 2:
In this case, the objective function corresponds to C_{max.}
The performance of a parallel program is related to execution time. By measuring how long the parallel program needs to run to solve our problem, we can directly measure its effectiveness. There are two important performance metrics: speedup and parallel effectiveness.
Speedup is the ratio of the serial program execution time T_{s} to the parallel execution time T_{p} on N processors. Parallel efficiency is the ratio between the speedup and the number N of processors. Parallel effectiveness is an indication of scalability. Ideal Speedup should equal the number of processors with a maximum of Parallel efficiency of 1 (100%).
Our infrastructure is optimized for large sized problem instances. The swarm is initialized with a population of 2048 particles and then divided into subswarms as follow:
• 
2 subswarms of 1024 particles 
• 
4 subswarms of 512 particles 
• 
8 subswarms of 256 particles 
• 
146 subswarms of 146 particles 
The swarm is divided 4 times. At each time, a cluster of 2, 4, 8 and 14 nodes evaluates the parallel implementation. These experiments aim to show if it is better to choose a parallel optimization with a significant number of subswarms with reduced sizes rather than some subswarms with a large population. For each case, 10 independent experiments have been performed. The C_{max} cost is computed; Speedup and the parallel effectiveness are measured.
According to the numerical results, it seems that using 2 swarms or more improves
the performance of parallel algorithm in some cases (Fig. 4).
This improvement reaches the optimum (maximum) when using 4 subswarms of 512
particles.
Table 2: 
Serial execution algorithm results 


Fig. 4: 
Parallel distributed PSO: speedup 

Fig. 5: 
Parallel distributed PSO: effectiveness 
This implies that cooperation between subswarms by sending and receiving the
C_{max} cost with regular intervals has advantages.
Reducing the swarm’s size produces this degradation. For the case of 14 swarms, the size is equal 146. Thus, when used many subswarms with reduced amount of particles, the algorithm evolves within a restricted research space and that in spite of exchanges.
Concerning the parallel performance, speedup shows a good acceleration of the parallel algorithm, then performance degradation when using a cluster of 8 or 10 nodes (Fig. 5).

Fig. 6: 
Parallel distributed (PSO,GA): C_{max} 
Although, the experimentations were accomplished within a highperformance cluster, we seen that parallelization becomes ineffective with 10 compute nodes. Indeed, when speedup is lower than 1, the computational time of the parallel algorithm exceeds that of the serial algorithm (Fig. 6).
The enhanced time is due to bottleneck. The main performance bottleneck is the communication latency between processors.
To compare these results with those with those obtained in the implementation
of the parallel genetic algorithm model with migration in a ring topology (Hao
et al., 1998) we applied this model to solve the same scheduling
problem. Experiences were accomplished in a configuration of shared memory.
The parallel implementation resulted in an improvement of the performance. However, the number of threads remains limited within a shared memory.
So, we have adapted this schema to our hardware configuration. Thus, with a large population size; the genetic algorithm searches the solution space more thoroughly. In a master/slave paradigm, the Master initialises the known parameters of the algorithm: population size, crossover rate, mutation rate and migration frequency.
Then, like the previous schema, the master divides the population in multiple subpopulations in which each computational node carries out genetic operations on its own chromosome set and communicates with only the neighbours in a ring topology.
As can be seen, the obtained results using 2, 4, 6, 8, 10, 12 and 14 subpopulations follow the same reasoning. C_{max} grows gradually by reducing the size of the population (Fig. 7).

Fig. 7: 
Parallel distributed (PSO,GA): speedup 

Fig. 8: 
Parallel distributed (PSO, GA): Parallel effectiveness 
This model presents an acceptable acceleration (Fig. 8) and parallel effectiveness which decreases more slowly than in the PSO model but we notice that the parallel distributed PSO model gave better results.
CONCLUSION
This study has presented a parallel implementation of the Particle Swarm Optimization algorithm. The method was validated by an industrial scheduling problem. The PSO is an approach to problems whose solutions can be represented as a point in an ndimensional solution space. Many improvements of original model PSO were proposed by adjusting parameters in the algorithm. However, Parallel optimization uses multiple computers or processors with an aim of to reduce the elapsed time. Then, a good acceleration can be obtained. Both single node and parallel implementations of the algorithm have been developed and applied on a highperformance cluster with a faulttolerant strategy to obtain an efficient and robust parallel scheme. Two widely used metrics have been used for the evaluation of the results and for comparisons with the corresponding results of the parallel distributed genetic algorithm approach. Through the results, we note that cooperation between computational nodes produce improvements but too many processors decreases the parallel efficiency. Also, the PSO model outperforms the genetic algorithm model. Lastly, we can say that distributed system technology, grid computing offers a number of potential uses and benefits for parallel and distributed algorithms and a wide range of computational problems. Nevertheless several parameters should be studied beforehand: the parallel model, the topology, the population’s size as well as the infrastructure used play a dominate role in the solution quality and the parallel effectiveness. Future research includes parallelization of other another bioinspired algorithm: antcolony optimization.
ACKNOWLEDGMENTS
Our acknowledgments are intended to University of Science and Technology of Oran and in particular to Laboratory of Simulations and Modelling of Industrial Systems which has placed at our disposal the necessary material and a suitable environment for carrying out this study.