INTRODUCTION
Refactoring of programming codes is a very useful technique to improve the
quality of existing computational codes (Fowler, 1999).
A series of small steps has to be applied for changing the internal structure
of the required code during refactoring process. On the other hand, the system
functionality and the main code external behavior would not be affected (Griswold,
1991). Refactoring improves the structure of a code, making it easier to
maintain and to extend. So, the activity of refactoring has two general categories
of benefits: maintainability and extensibility of a code. The first category
can be achieved by reducing large well-named, single-purpose methods as well
as monolithic routines into a set of individually concise ones, through moving
a method to a more appropriate class or by removing misleading comments. Then
processes of bug fixing would be effortless and the modified code would be more
readable, easier to understand and grasp (Robert, 2008).
Extensibility means that capabilities of the application could be easier to
be extended by using recognizable design patterns which provide flexibility
where never before may have existed (Kerievsky, 2004).
Some types of refactoring techniques are illustrated in Table
1. Some of these might be only applied to certain languages or language
types. A longer list can be found in Fowlers Refactoring book (Fowler,
1999). This study presented the work of the authors on refactoring codes
by using the programming language C++ via a message passing interface implementation
of a new bit-reversal algorithm on multiprocessors. The new code outperforms
an existing one included in the Fast Fourier Transform (FFT) program (Cooley
and Tukey, 1965). Further, Sarkar et al. (1986),
Peng et al. (2009), Jing
et al. (2010), He et al. (2011) and
Arioua et al. (2012), presented some applications
of FFT algorithms in solving real-life problems. For an efficient implementation
of the Discrete Fourier Transform (DFT) to an array of 2N samples,
FFT could be considered as an optimized computational algorithm. DFT of N data
samples can be mathematically defined as follows:
Hence:
Herein, WN is the Nth primitive root of unity, x is the original
sequence and X is the FFT of x, k = 0, 1, 2,..., N-1.
Table 1: |
Types of refactoring techniques adopted from Fowler
(1999) |
 |
WniK is defined as a NxN matrix. Generally, X, x and
W are complex numbers. This can be commutated by using matrix-vector multiplication.
This method required O (N2) complex multiplications. But using FFT
can enhance fast DFT calculations. It works by dividing the input points into
two sets (k sets in general), calculating the FFT of the smaller sets recursively
and combining them together to get the Fourier transform of the original set.
That will reduce the number of multiplications from O (N2) to O (N
log N). Furthermore, the total time must be proportional to N2, i.e.,
it is an O (N2) algorithm. However, the algorithm could be optimized
down to O (N log N) based on rearranging these operations, what makes a huge
difference for large N. As far as FFT must be calculated into two steps, the
first step would transform the original data array into a bit reverse order
and then its implementation required the so-called bit-reversal reordering of
the data for making the mathematical calculations of the second part much easier.
But if bit-reversal is not done, performing the FFT could take a substantial
fraction of the total computational time. For example, FFT has been widely used
in computer based numerical techniques for solving electronic structure where
bit-reversal algorithm could be considered as main part of any atomic structure
software. If the bit-reversal code was not properly designed, FFT could take
about 10-30% of the overall computational time, delaying calculations (Saad
et al., 2010). Therefore, bit-reversal program must be carefully
designed to enhance the quality of the code. In this study, a performance analysis
of a parallel programming implementation of new bit-reversal algorithm (Kumar
et al., 1994) on multiprocessors, has been developed via code refactoring
of an existing library programs recently used in the computational process of
solving the electronic atomic structure of an online atomic database (Tahat
and Salah, 2011; Tahat et al., 2010; Tahat
and Tahat, 2011).
So, the main goal of this study was the improvement of numerical techniques
able to solve Schrodingers time-dependent
equation, achieving higher speed and efficiency.
THEORY AND METHODS
Discrete Fourier transform using assumptions: Rewriting Eq.
1 and 2 by using a complex-valued function ψ(X) instead
of X and implementing periodic boundary conditions ψ(X+L) = ψ(X) is
shown in Fig. 1, including discrete function values, where
the propagation of the wave function has been set to be in X direction in the
range Xε[0, L]. Also ψ(X) has been discretized on N mesh points equally
spaced by ΔX as: Xj = jΔX, with j = 0,
., N-1 (Fig.
1). Further, L, X, j, N, Xj represented, the length of the box
where ψ is defined (in A°), the spatial coordinate (in atomic units),
a counter index, the total number of spatial mesh points of the sample and the
discretized step distance along X, respectively. On the other hand, Wave functions
ψ(X) at coordinate and momentum spaces must satisfy boundary conditions
ψ(X+L) = ψ(X) in the interval Xε[0, L]. So that ψ(0) = ψ(L)
and ψ (+/-π N/L) = 0, respectively, so that the meaning of each term
(Fig. 1a, b), (L, X, j, N, Xj)
should be the same as before.
The expression of Eq. 3, shows that DFT presents the wave
function ψ(X) as a linear combination of trigonometric components [exp(iKX)
= cos(KX)+i sin(KX)] with different wave numbers:
Here, Km represents discrete wave numbers and can be defined as:
Equation 3 has periodicity L with the expansion coefficients
;
where:
|
Fig. 1(a-b): |
Discrete function ψ(X) values and boundary conditions,
(a) Discrete function ψ(X) values along X axis on N mesh points and
equal mesh spacing ΔX = L/N, using a discrete step of Xj
= jΔX (j = 0,......, N-1) and (b) Discrete function ψ(K) values
at different wave numbers (K) in the range of [-π/ΔX, π/ΔX]
at ΔX = L/N and even number N; using a discrete step of ΔK = 2π/L.
Here, (m) denoted the wave numbers for the higher indices and blank wholes
indicate periodic boundary conditions both in coordinates and momentum spaces |
Accuracy of the assumed Fourier expansion: The choice of wave numbers
in Eq. 4 guarantees that ψ(X) has L periodicity. Wave
numbers Kn have been separated by:
therefore, equivalent wave numbers would be expected to be found, since because
of the discrete sampling all are equivalent in real space (e.g., high values
of wave numbers oscillate more but come back to the same value as their lower
wave number counterparts at Xj). Moreover, mathematically, the discrete
mesh points in real space cannot represent higher wave numbers. But physically
the smallest-magnitudes of wave numbers where involved in these assumptions
as far as they represent the lowest-energy state. For that reason, the wave
numbers for the higher indices in Eq. 4 must be folded back
by 2πN/L. For simplicity, it has been assumed that N is an even number.
From Fig. 1b, all wave numbers are in the range:
Considering the discrete function ψj, as a vector in N-dimensional
vector space, |ψ
= ψ0, ψ1, ψ2,..., ψ0N-1,
must be convenient to prove the correctness of Eq. 5. The
plane-wave basis set of Eq. 6 is defined in this vector space:
Equation 6 is orthonormal, i.e., inner products of the basis
functions are:
Then for m≠n, the sum of the geometric series is carried out; otherwise
(m = n), all N summands must be 1/N, Eq. 8 and
9:
For the complete basis set, a discrete function ψj in N-dimensional
vector space can be represented as a linear combination of N basis set functions
bm(Xj) in particular:
Or:
So that, suppose the function is expanded as:
Multiplying both sides by,
m|
and using the orthonormality condition, Eq. 7, leads to:
Therefore, Fourier coefficients
in Eq. 5 can be readily obtained from Eq. 10
by substituting the definitions of the basis functions and the inner product
in Eq. 10. This leads to obtain:
Identification of the expansion coefficients
in Eq. 3 and 5 can be done by comparing
Eq. 14 with 3.
Solving time-dependent Schrodinger equation using spectral method: Most
of the computational effort in electronic structure calculations development
had often been focused on solving the time-dependent Schrodinger equation:
For the purpose of solving this equation numerically, DFT and FFT are the two
required computable types of Fourier transformation. Here,
represents the Hamiltonian matrix consisting of the sum of the kinetic T and
potential operators V and can be defined as:
The ongoing work presents such numerical solution by re-factoring existing
software based on a numerical technique, the so called spectral method (Canuto
et al., 2006) which depends on Fourier transformation (Bracewell,
2000). Here, the exposition of Fourier transform of discretely sampled data
and fast Fourier transform from Press et al. (1992)
will be followed.
Consider the time-dependent Schrodinger equation (Eq. 15),
to be implemented in one dimension in atomic unit. Thus:
Here, Hamiltonian operator
in Eq. 14 can be defined in one dimension based on the kinetic
and potential-energy operators, within the pseudopotential approximation (Heine,
1970) and the momentum-space formalism (Ihm et al.,
1979), in which plane waves are used as the basis set for the eigenfunctions:
In momentum-space, the kinetic energy operator is diagonal and hence it trivially
applies to a single trial eigenvector. However, the calculation of the potential
energy would include multiplication in real-space as well as expensive convolution
in momentum-space. The kinetic energy operator is diagonal in the Fourier (or
momentum) space. To see this, in the present work T was operated on the wave
function in its Fourier representation using the assumed discrete Fourier transform
of Eq. 5:
i.e., the kinetic energy operator multiplied the factor, K2m/2,
applies to the Fourier coefficient of the wave function:
On the other hand, recalling the potential energy operator is diagonal in the
real space means that it multiplies a factor Vj = V(Xj)
to the wave function:
This suggests an efficient algorithm for the time evolution of the wave function,
where the kinetic and potential energy operators are diagonal in real and momentum
space, respectively. Then Trotter expansion Xiao-Ping and
Broughton (1987) can be used.
Trotter expansion and spectral method: Recalling the Trotter expansion
technique (Xiao-Ping and Broughton, 1987), when used
in the current simulation to compute the energy leads to:
Where:
can be defined as the time evolution operator that had arisen from the potential
energy V and it can be easily operated in the real space as:
On the other hand, the operator exp(iTΔt) can be defined as the time evolution
operator that had arisen from the kinetic energy T. This operator can be operated
in the Fourier space as:
Fast Fourier transforms: In view of the fact that the computation of
each of the N Fourier coefficients ψm involved summation over
N terms, the computational time grows as O(N)2. Then computational
cost associated with DFT would be one of the most important obstacles of implementing
the spectral method. Herein, using the FFT algorithm will improve the computational
effort by reducing the complexity to O(N log N) (Cooley
and Tukey, 1965). FFT will allow the cheaply eigenvector transformation
between real and momentum space. So potential energy operator can also be applied
efficiently, making the quantum dynamics simulation less intensive from the
computational point of view. Therefore, FFT has to be involved in the current
numerical solution. Equation 20 can be rewritten in terms
of forward (F) and inverse Fourier transformation (F-1) operators
(Press et al., 1992) as follows:
Where:
The adopted FFT program being re-factored in this work follows Danielson-Lanczos
procedures (Danielson and Lanczos, 1942). More detailed
description of the each step in our adopted FFT can be found in reference (Press
et al., 1992). In short, the input wave function values of the FFT
array were first re-ordered by applying the bit-reversal operation to each wave
function index. The Danielson-Lanczos procedures, such as in Eq.
26 and 27, then applied recursively, starting from the
smaller sub-arrays up:
Where:
ψj0 and ψ1j represent
N/2 element Fourier transforms consisting of even and odd sub-arrays, respectively.
Since there are log2N recursive steps, the number of complex floating-point
operations in the FFT algorithm would be 2log2N.
Computing the energy: Total energy can be computed by using a set of
equations based on the time-dependent Schrodinger equation appearing in Eq.
17 which presented the total energy as a conserved quantity. Thus, by discretizing:
Here, expansion of the wave function in terms of its Fourier components (Eq.
3) is required for calculating the first term (i.e., the kinetic energy
T
):
Finally, by substituting Eq. 33 in Eq. 28,
29 could be rewritten as follows:
Parallel Algorithm for spectral method: Equation 23
and 36 present the algorithm for the spectral method which
is the same for sequential and parallel codes. Parallel implementation of the
algorithm involved parallelizing each of the following four steps: (1) nonlinear
step, (2) forward F Eq. 25, (3) linear step and (4) backward
F-1 Eq. 24. The difficulty of parallelizing spectral
method algorithms arises in steps 2 and 4, because there are nontrivial data
dependences over the entire range 0≤L≤N, involving forward and backward
Fourier transforms (FFT and butterfly transform, BFT), respectively. On the
other hand, steps 1 and 3 could be trivially evolved because of the natural
independence of the data. The problem of parallelizing the one-dimensional FFT
has been of great interest for vector (Averbuch et al.,
1990; Swartztrauber, 1987) and distributed memory
computers (Dubey et al., 1994). Moreover, communication
issues and memory hierarchy are the most two important parameters to be recognized
in the paralyzed algorithms in order to achieve an enhanced speed up of the
one dimensional FFT.
These one-dimensional algorithms are architecture dependent and involve efficient
methods for the data rearrangement and transposition phases (Averbuch
et al., 1990). The complete parallel algorithm consists of the following
steps:
• |
Nonlinear step |
• |
Row FFT |
• |
Multiplication by  |
• |
Bit-reversal algorithms with respect column FFT |
• |
Linear step (transposed linear operator) |
• |
Column BFT |
• |
Multiplication by the complex conjugate of ,
( *) |
• |
Row BFT |
The parallelism would be due to the row and column subarray FFTs in steps 2,
4, 6 and 8 in addition to the independent operations in steps 1, 3, 5 and 7.
Distributed-memory approach: The Message Passing Interface (MPI) (Karniadakis
and Kirby, 2003) can be considered as a tool for distributed parallel computing
that has become an ad hoc standard (Gropp et
al., 1998). In distributed parallel programming, different processors
would work on completely independent data and explicitly used to send and receive
library calls to communicate data between processors (Snir
et al., 1996; Nehra et al., 2007).
To implement the distributed parallel algorithm for the time-dependent Schrodinger
equation, in this work the rows of FFT arrays will be distributed among 4 processors.
Fast communication between them would be the main reason of successful implementation
of the algorithm. The large performance cost in this algorithm is due to the
redistribution of data between row and column operations. If the row and column
computational stages result in significant speed up compared with the communication
expense incurred in redistributing the matrix data, then the algorithm will
be successful. When speeding up the process of commutation of the current parallel
code, bit-reversal problem can be solved by swapping columns and/or rows of
a matrix of size X (n)(n) in the FTT arrays. In particular, bit-reversal
swapping can be done in a specific time, causing a problem with communication
between parallel processes. Consequently, communication took a longer time than
the bit-reversal time. The present work has proposed a solution for this problem
based on a new algorithm of bit reversal procedure of FFT arrays, by solving
the conflict in time between the speed of bit-reversal swapping and the communication
between parallel processes on multiprocessors environments. The new algorithms
patter performance with parallel processes. In parallel programming, the need
for communications between tasks depends upon the nature of the proposed tasks.
To implement the final step, the scatter method will be adopted. In the scatter
operation, a single node sends a unique message of size m to every other node
(also called a one-to-all personalized communication), in order to distribute
the data into processors. A unique message from each node would be collected
by a single node during the gather operation. In the meantime, gathered operation
could be defined as the inverse of the scatter operation and could be executed
as such (Fig. 2).
Example: In order to simply explain the proposed algorithms, a compression
of the new developed bit-reversal algorithm and the current used algorithm is
presented in Appendices 1 and 2. In Appendix
1, it is reported how to distribute the following array of input data: A
= [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] into two processors based
on the new algorithms. In the meantime, this array can also be distributed into
4 processors as shown in Fig. 3. Further, in Appendix
3 the parallel source code of bit-reversal algorithm is presented.
Advanced example of distributing data on 4 processors using the new bit-reversal
algorithm: Breaking the problem into discrete "chunks" of work, must be
the first step of designing a parallel program, these chunks could be distributed
to multiple tasks as well (Kumar et al., 1994).
|
Fig. 2: |
The scatter method (Scatter and Gather) |
The steps of distributing the data of a matrix of 8x8 sizes on 4 processors
are presented in Fig. 4 based on four proposed chunks as follows:
• |
The one dimensional array fulfilled with the input data could
be arranged in columns as shown in Fig. 4a |
• |
The bit reversal for each column would be found, simply as shown in Fig.
5a, along with swapping between the columns as shown in Fig.
5b, the significant results of these two processes are illustrated in
Fig. 4b |
• |
Distributing the data into specified processors as appeared in Fig.
4c |
• |
Finding the bit reversal for each row then swapping between the rows must
be done. The final results would be appeared in the last step as shown in
Fig. 4d |
|
Fig. 3: |
Distributing the vector A = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15] in 4 processors |
On the other hand, number of process must be in base 2x2"; (n = 0, 1, 2, 3..,
etc.) for any arrangement (e.g., 2, 4, 8, 16, 32, 64, 128, 256, etc.); (Kumar
et al., 1994). Herein, the data of the (8x8) matrix could be also
distributed into 8 processors. Figure 6 presented the significant
results of the redistributed data.
|
Fig. 4(a-d): |
Example of an 8x8 matrix, (a) Arrange table, (b) Take the
bit reversal of column and then swap the columns, (c) Distribute data on
several processes (4 processes) and (d) Communication between the processors
takes the bit reversal of row and then swap the rows |
|
Fig. 5(a-b): |
Explanation of bit reversal and swap of columns, (a) Bit-reversal
of columns and (b) Swap between the columns |
|
Fig. 6: |
Distributing the data of the matrix (8x8) into 8 processes |
RESULTS
After refactoring the existing code and including the new bit-reversal algorithms
for reordering the input data and to do the FFT matrix in the spectral method,
the one-processor implementation of the bit reversal parallel code has become
8.7 to 17.3% faster than that of the sequential code (Table 5).
The produced parallel code has been tested by using a portable implementation
of MPI, called MPICH (http://www.mcs.anl.gov/research/projects/mpich2/).
Program input parameters:
• |
Matrix dimension |
• |
A (1, 1) = 0, 1, 2
any random number |
• |
Fill raw/column |
• |
Number of processors (n) |
Program outputs: Table 2-5 present
the computed results for distributed memory and MPI implementations of the new
bit-reversal algorithm for different square matrix of size X (n)(n)
of FTT arrays. They have been produced and tested using three methods for computing
the parallel speed up, in order to measure the program scalability and to show
how program could scale as more processors would be used.
Speed up for various matrix size X (n)(n) at different fractions
(P) and fixed number of processors (n = 4): As shown in Table
2, with four processors (n = 4), the speed up increases with the working
set size X(n). The measured results showed that the speed up eventually decreased
with larger X(n) at various values of fraction p. The maximum and minimum dimensionless
speed up ratios have been 3.8352 and 2.9110 occurred at p = 0.9856; X (n) =
2x2 and p = 0.8753; X (n) = 19x19, respectively. The speed up values can be
very close to each other at same values of parallel fractions p, far away from
the matrix size. For example the speed up was equalled (3.7352) at X (n) = 3x3,
p = 0.9763 which was very close to the speed up of X (N) = 4x4; (3.7123) at
p = 0.9741. The measured efficiency is closely related to speed up at fixed
n and dependent proportionally of matrix size. All values had been less than
one and ranged between 0.9588 and 0.7277. The communication cost for the MPI
code as a function of efficiency T1/E(n) has become close to perfect
speed up showing an excellent agreement with the expected ideal cost that may
be calculated by multiplying the number of used processors (n = 4) by parallel
processing time (T4).
Speed up at fixed fraction (P) and fixed matrix size X (n)(n)
on different number of processors (n): For fixed parallel fraction (p =
0.9545), the speed up and the cost rapidly increased with larger number of processors
for the same matrix size (8x8), with decreasing efficiency. The maximum speed
up has been 20.3134 and occurred at the highest number of processors n = 256
(Table 3).
Speed up at large matrix size X (n)(n) with various fractions
(P): The results of Table 4 and 5 revealed
that the speed up increased as the problem size X (n) became larger, for high
values of parallel fraction; (e.g., p ≈ 0.9554) with increasing the number
of processors.
Table 2: |
Results for distributed-memory and MPI implementations of
(2x2)≥X(n)n≤(19x19), at n = 4, A (1, 1) = 0, with filling
columns, repeat = 1000 |
 |
Table 3: |
Speed up for implementing a matrix of size (8x8) at (p = 0.9545;
n = 4) |
 |
Table 4: |
Speed up at large matrix size |
 |
Table 5: |
Output of running the re-factored code after including new
bit-reversal algorithm |
 |
However, speed up eventually decreased with larger X (n) on the same processor
at lowered values of parallel fraction (p = 0.8753). Generally speaking, all
measured speed up of the large matrix has increased when increasing the number
of processors based on p. In all cases p was less than one and maximum speedup
was around 20.
DISCUSSION
Parallel performance analysis: The reported results in this work indicated
that parallel implementations of distributed memory for the bit-reversal of
FFT arrays over the range (2x2)≥X(n)n≤(19x19), produced important
speed up of the calculations, with maximum values at high parallel fraction,
beyond which the communication cost increased and the computation/communication
ratio decreased. The communication cost for the MPI code has been a result of
communication stages; the less than perfect speed up is thus due
to the volume of data communicated between processors in redistribution stages
and not to the constant sharing of small sub-array data. Many studies in the
literature had covered such topic. For example, Yan and
Zhang (1999), focused on investigating the relationship between cost-effective
parallel computing along with profit-effective parallel computing. In the present
work the cost results match their ideal result. With four processors (n = 4),
the speed up increased with the working set size X(n) based on (p). This has
been due both to the faster computation stages and to the reduced volume of
data communicated between single processors in the redistribution stage. The
division of the problem among four processors would not be beneficial in the
case of small problem sizes, because there would not be enough work to make.
In the main time, the speed up could be attributed to data independency contained
in the processors local cache between stages of data rearrangement. These
observations are in excellent agreement with the qualitative trends seen in
Amdahls law (Amdahl, 1967), empirical data on
speed up for parallel computing method, where the maximum speed up was attained
with a problem size. Therefore, the current results showed that if the fraction
p is not high enough for effective scaling, there is no point in further tuning
until it had been increased (e.g., 8x8 matrix, p = 0.9545), increasing the number
of processors will not affect the value of the speed up at very low fractions
(e.g., p = 0.1). For this reason, parallel computing will be only useful for
either small numbers of processors (n = 4) or problems with very high values
of p. The present results also show a good agreement with those reported by
Gustafson (Gustafson, 1988). The efficiency estimated
the processors affectivity in solving problems, compared to the wasted effort
in the communication and synchronization processes. In addition, efficiency
had to be ranged between 0 and 1. Thus, this work showed that all measured values
of efficiency were less than one. This showed a good agreement with expected
efficiency according to Albas work (Alba, 2002).
Finally a comparison between spectral method techniques of solving the time
dependent Schrodinger equation can be found in (Feit et
al., 1980; Pshenichnov and Sokolov, 1964; Zhirnov
and Turev, 1980). The difference between these studies and the current work
is the present use of the distributed memory parallel programming in solving
a critical part of the fast Fourier transform that reduces the time of solving
the equation which has lead to increase the speed up and to reach the ideal
value of Amdahls law (Amdahl, 1967), where S(n)
must equal the number of processors (n). In this work all measured speed up
on 4 processors have been very close to 4. On the other hand this work could
be retested by using the sharing memory approach in solving the Schrodinger
equation other than the distributed parallel programming.
The spectral method commonly used in the numerical solution of the time-dependent
Schrodinger equation often proved very slow in serial versions, even on the
fastest workstations. This study has successfully implemented and tested the
parallelization of new bit-reversal algorithms that have been shown to be appropriated
for multiprocessors that outperformed an existing one. Distributed bit-reversal
FFT speed up is a function of the number of processors (n) which reduces both
the computational time and the communication volume between single processors.
The speed up of the parallel algorithm has been strongly dependent on reductions
in both communication time (Tn) and contention in the multiprocessor. Thus,
MPI has been revealed to be an optimal method of implementing parallelized bit-reversal
algorithms.
APPENDIX
Appendix 1: Bit-reversal algorithm after refactoring |
|
Appendix 2: Bit-reversal algorithm before refactoring |
|
Appendix 3: C++ (MPI) code |
 |
 |