INTRODUCTION
Fast Fourier Transform (FFT) and its inverse (IFFT) play a significant role in many digital signal processing applications. It has been applied in a large range of fields and applications such as Asymmetrical Digital Subscriber Line (ADSL), Digital Audio Broadcasting (DAB), Digital Video Broadcasting (DVB) and Orthogonal Frequency Division Multiplexing (OFDM) systems.
Recently, OFDM has become a key modulation technique for both broadband wireless
and wireline applications (Van Nee and Prasad, 2000;
Keller and Hanzo, 2000). It has been advocated for Wireless
Local Area (WLAN) and Metropolitan Area (WMAN) applications; it incorporates
a large number of orthogonally selected carriers to transmit a highdatarate
stream in parallel form in the frequency domain. The problem of InterSymbolInterference
(ISI) introduced by a multipath channel, in conventional single carrier system
confines the data rate, however, it is importantly reduced in OFDM due to the
parallel of data transmission through multiple carriers (Arioua
et al., 2009). The sub carrier’s orthogonality in OFDM allows
the sub carrier spectra to be densely packed in the frequency domain resulting
in a high spectral efficiency. Therefore, multipath immunity and spectral efficiency
are two major features of OFDM systems (Verma et al.,
2009).
The fast increasing demand of OFDMbased applications, including future UWB systems and wireless LAN and MAN, makes the processing speed a significant factor in Fast Fourier Transform (FFT) architecture design. Indeed, these applications need the FFT and IFFT processors to perform realtime operations for modulation and demodulation of signals. Hence, the study of highperformance VLSI FFT architecture becomes increasingly important.
The FFT/IFFT processor is one of the most complex and intensive computation
module of various communication systems PHY layer (IEEE 802.11a, IEEE 802.11n
and IEEE 802.16, etc.) (Li, 2003). However, the main
constraints nowadays for such processors are execution time and lower power
consumption (He and Torkelson, 1996). One of the most
essential arithmetic operations used in FFTs is complex multiplication. It is
often the most expensive arithmetic operation and one of the dominate factors
in determining the performance in terms of speed and power consumption. As observed
by Widhe (1997), the complex multiplier may consume more
than 70% of the power in an FFT/IFFT processor. Therefore, an effective design
of FFT processor is vital in lowpower applications. The aim here is to reduce
the multiplication complexity and develop architecture for use in OFDM based
communication system (IEEE 802.16).
One method proposed in this study to reduce the complexity is to replace the
complex multiplication with less expensive real and constant multiplications
(Arioua et al., 2010; Jiang
et al., 2004). We applied this method to an efficient FFT 16point
based on 4point module. The 16point FFT processor has to satisfy the timing
constraint required in the communication specification system employed. This
impose that one has to implement highly parallel or parallelpipelined architecture
(Kannan and Srivatsa, 2007) or to use a very high frequency
operation. However it leads to high area usage and power consumption. Therefore,
it is essential to build a simple and efficient architecture to keep the area
and power consumption as low as possible while meeting the timing constraint.
The performance analysis of the proposed architecture reveals its superiority
compared to other efficient butterfly approaches (Li and
Wanhammar, 2002; Li et al., 2001; Shanine,
2009).
FFT ALGORITHM
The FFT/IFFT is derived from the main function DFT (Discrete Fourier Transform). The computation for Npoints of the DFT will be calculated one by one for each point while for FFT/IFFT, the computation is done simultaneously which saves a quite a lot of time.
The Npoint discrete Fast Fourier Transform is defined as:
Where:
X (k) is the kth harmonic and x (n) is the nth input sample. Direct DFT calculation
requires a computational complexity of O (N^{2}). By using The CooleyTukey
FFT algorithm, the complexity can be reduced to O (N.log_{r}N) (He
and Torkelson, 1996; Li and Wanhammar, 2002).
CooleyTukey FFT algorithm: The CooleyTukey FFT is the most universal
of all FFT algorithms, because of any factorization of N is possible (Cooley
and Tukey, 1965; MeyerBaese, 2007). The most popular
CooleyTukey FFTs are those where the transform length is a power of a basis
r, i.e., N = r^{S}. These algorithms are referred to as radixr algorithms.
The most commonly used are those of basis r = 2 (radix2) and r = 4 (radix4).
Those algorithms and others such as radix2^{i} and SplitRadix have
been developed based on the basic CooleyTukey algorithm to further reduce the
computational complexity (Chang and Nguyen, 2006).
For r = 2 and S stages, for instance, the following index mapping of CooleyTukey
algorithm gives:
And:
The CooleyTukey algorithm is based on a divideandconquer approach in the frequency domain and is therefore referred to as DecimationInFrequency (DIF) FFT. The DFT formula is split into two summations:
X (k) can be decimate into evenand oddindexed frequency samples:
The computational procedure can be repeated through decimation of the N/2point
DFTs X (2k) and DFTs X (2k+1). The entire algorithm involves log_{2}N
stages. Where each stage involves N/2 operations units (Butterflies). The computation
of the N point DFT via the decimationinfrequency FFT as in the decimationintime
algorithm requires (N/2)·log_{2}N complex multiplications and
N.log_{2}N complex addition (Petrov and Glesner,
2005).
In general, the other fast algorithms like radix4, radix8, radix2^{2}
and splitradix (based on the same approach) recursively divide the FFT computation
into odd and evenhalf parts and then obtain as many common twiddle factors
as possible. The number of needed real additions and multiplications is generally
used to compare efficiency of different FFT algorithms.

Fig. 1: 
16point DIF FFT algorithm based on radix2 
As indicated by multiplicative comparison by Shanine (2009),
the splitradix is computationally better than others FFT algorithms, due to
its most trivial multiplication with twiddle factors W^{n}, i.e., ±1
and ±j. Nevertheless, the splitradix is an irregular algorithm by its
nature, because of the combination of two algorithms radix2 and radix4 stages
used, respectively for the evenhalf operations and for the oddhalf operations.
Therefore, this architecture leads to an Lshaped butterfly units and affect
the delay of the pipeline path and make it unbalanced (Chang
and Nguyen, 2006).
16points FFT module: The flow graph of complete decimationinfrequency decomposition of 16point DFT computation based on radix2 is represented in Fig. 1. The basic operation in the signal flow graph is the butterfly operation; it’s a 2point DFT computation as shown in Fig. 2.
The FFT computation of 16points radix2 based architecture is achieved in
four stages. The x (0) until x (15) variables are denoted as the input values
for FFT computation and X (0) until X (15) are denoted as the output values.
In the butterfly process as shown in Fig. 2, the upward arrow
will execute addition operation, beside that; downward arrow will execute subtraction
operation. The subtracted value is multiplied with twiddle factor value W_{N}
before being processed into next stage; this computation accomplished concurrently.
The complex multiplication with the twiddle factor requires four real multiplications
and two add/subtract operations (Li, 2003; Li
and Wanhammar, 2002; Petrov and Glesner, 2005).
Complex multiplication: Complex multiplication is an expensive
operation, because its computation need large chip area and power consumption.

Fig. 2: 
Basic butterfly computation 
We can reduce the multiplicative complexity of the twiddle factor inside the
butterfly processor by calculating only three real multiplications and three
add/subtract operations. The complex twiddle factor multiplication:
However, the complex multiplication can be simplified:
And:
The above implementation of reduced complex multiplication is useful for general
complex multiplication. However, in our FFT implementation, the twiddle factors
coefficients are known in advance. i.e., C and S in Equations
6 and 7 are precomputed and stored in a memory table.
Therefore, it is necessary to store the following three constants C, C+S and
CS. those constants can be saved as Canonical Signed Digits (CSD) to implement
complex multiplication with carry and save tree (Jiang et
al., 2004; Kannan and Srivasta, 2007). Thus,
the area and power consumption of the complex multiplier can both be reduced.
The storage operation is used to simplify the complex multiplication, for instance,
the complex multiplication with:
requires only two real multiplications rather than three multiplications. Moreover,
the complex multiplication can be reduced further with an efficient number representation
of fixedpoint arithmetic (He and Torkelson, 1996).
The implemented algorithm of complex multiplication used in this work uses
three multiplications, one addition and two subtractions as shown in Fig.
3. This is done at the cost of an additional memory Table. In the Hardware
Description Language (VHDL) program, the twiddle factor multiplier was implemented
using component instantiations of three lpmmult and three lpmaddsub modules
from Altera library. Worth to note that lpm modules are supported by most of
EDA vendors and LPM provides an architectureindependent library of logic functions
or modules that are parameterized to achieve scalability and adaptability (Qi,
2010).
In the 16point FFT processor using radix2 algorithm, multiplications with
W_{16}^{4 }= j and W_{16}^{0} factors are trivial,
multiplication with W_{16}^{4} simply can be done by swapping
from real to imaginary part and vice versa, followed by changing the sign (Li
et al., 2001; Petrov and Glesner, 2005).
The rest of complex multiplications in this FFT scheme are nontrivial, W_{16}^{2}
was implemented with two multiplications and three multiplications for the other
nontrivial coefficients. Therefore, the total number of real multiplications
is 24 (10 complex multiplications).
The radix2 algorithm is attractive for its simplicity but has the disadvantage for being not adapted for large point FFT calculation, due to the high multiplier requirement. However, to reduce the power consumption in the 16points FFT, the number of multiplications must be reduced. For this reason, we propose a design methodology for simple and efficient 16point FFT architecture to keep the area and power consumption as low as possible.

Fig. 3: 
Implementation of complex computation 
16POINT FFT/IFFT ARCHITECTURAL DESIGN
Mathematical formulation: The 16point FFT/IFFT proposed architecture internally uses one 4points module for computation. The radix16 representation of the FFT/IFFT can be formulated in the following way:
We suppose: N = 4T, k = s+Tt and n = l+4m where, s, l ε {0, 1, 2, 3} and m, t ε{0, 1,…,T1}.
We apply this values in Eq. 9, we obtain:
For T = 4, N = 16. The 16point FFT can be expressed as:
Equation 11 shows that the application of the FFT algorithm
for computation of the 16point FFT requires calculation of 4point FFT two
times. The first calculation of the appropriate data slot of 4point FFT takes
place as described in Eq.10 and then multiplies the output
with 4*4 interdimensional constants coefficients and once again computing the
4point FFT of the resultant data with the fitting data reordering.
The 4point FFT module uses radix2 algorithms for computation. The multiplication
with the twiddle factor W_{4}^{1 } = j can be done by swapping
and inversion (Li et al., 2001; Maharatna
et al., 2000). Therefore, the proposed processor computes complex
multiplications exclusively with interdimensional constants coefficients, this
computation processed by a Multiplier Unit as shown in Fig. 4.
Architectural design: Many communication systems require high throughput
and continuous input/output data. The pipeline architecture is considerably
appropriate to attain these ends (Umar et al., 2004);
also it is an ideal choice to implement highspeed longsize FFT due to its
regular structure and simple control (Wang et al.,
2010). The efficiency of a pipeline FFT processor can be improved by optimizing
the structure and saving hardware resources (Zhi et al.,
2009). The proposed architecture of 16point FFT/IFFT module is illustrated
in Fig. 4. Our design consists of an essential unit, the butterfly
unit which is the kernel of the 16FFT processor as interpreted in Eq.11.
It has twostages pipelined structure carrying out trivial complex multiplication
and eventually processes 4point FFT.

Fig. 4: 
The 16point FFT proposed design 
This block requires a data memory (input buffer) of size N/2 (N = 4) to store
the input serial data in four parallel vectors, in order to be arranged for
computation by FFT processor according to Eq. 11, Since the
input data takes place in natural order.
After the computation of the initial input data sequences in the 4point FFT,
the demultiplexes holds the data to a serial parallel bloc in the first FFT
computation or to 16FFT output in the second computation. In the first case
the data are arranged for multiplication in the multiplier unit according to
Eq. 11.
The arranged data undergoes an operation of interdimensional complex multiplication.
The multiplier unit should normally perform complex multiplications with 16
elements. However, in practice the reduction approach mentioned above leads
to a significant reduction of complex multiplication at the cost of far less
expensive additions:
Indeed, adders requires less hardware in FPGA plateform and consume much less power than that of multipliers with the same word length. Moreover, they have fewer glitches.
At circuitry level, the multiplication with the elements W^{0} can
be replaced by a simple connexion and the multiplication by W_{16}^{4}
can be done by swapping and sign inversion. The non trivial multiplication with
W_{16}^{6} and W_{16}^{2} can be implemented
with two real multiplications, however, in order to reduce more multiplication
complexity, the multiplication with W_{16}^{6} and W_{16}^{2}
coefficients (1//2) was done by add and shift operations (Jung
et al., 2003) as shown in Fig. 5 and 6.
Applying shiftandadd operations with twiddle factors inside the multiplier
unit instead complex multiplications reduces areausage as well as power consumption.

Fig. 5: 
Intern operations inside multiplier unit 

Fig. 6: 
Switch, shiftandAdd operations inside multiplier unit 
The complex multiplication with other nontrivial twiddle factors W_{16}^{1}
W_{16}^{3} W_{16}^{9 }was implemented with three
real multiplications.
The employed multiplier unit allows us to cut down the number of complex multiplications. Therefore, the total number was reduced to 6 multiplications. Consequently, the total number of real multiplication in our design is 16 multiplications.
An important result within the proposed 16point architecture is the fewer
multiplications used when compared to the reduced 16point FFT Radix2 (24 real
multiplications), to the reduced 16points FFT with Radix4 (20 real mWidheultiplications)
and to the attractive 16point SplitRadix architecture (Li
and Wanhammar, 2002; Li et al., 2001; Wang
et al., 2010; Takahashi, 2001).
PERFORMANCE OF THE PROPOSED ARCHITECTURE
16points FFT implementation: The 16point FFT computation with the proposed architecture was first coded in VHDL using Quartus software tool from Altera and then simulated and synthesized on the low cost Altera Cyclone 2 EP2C35F672C6 device. The purpose is to determine the resource usage of the proposed design. The functional and timing simulation were successfully made. The main hardware resources of this design are given as follows. The total logic elements used are 1146 and the total embedded multiplier 9bit elements are 6. Meanwhile, the maximum frequency is 75.93 MHZ.
The proposed structure combines three resource reductions, the complex multiplication reduction inside the multiplier unit, combined with the fact using pipeline architecture. In addition to that, this structure has eliminated the complex multiplication for some coefficients inside multiplier unit.
From the algorithmic perspective, the proposed architecture requires less number of arithmetic operations compared to the conventional CooleyTukey algorithm and to efficient ones implemented in pipeline structure for radix2 and for radix4 and splitradix. This comparison is shown in Table 1.
It shows that the simple architecture of 16point FFT proposed in this work requires 13, 66, 80, 80% of real multiplications, used, respectively in CooleyTukey, 16point Radix2 (3M3A: three Multiplications and three additions), 16point Radix4 architectures and 16point splitradix.
In order to verify the accuracy of computation of the FFT core, we had simulated
the calculation of 16point FFT in functional simulation mode. The output of
the implemented FFT bloc approximately matches the output of an FFT function
written in Matlab representing a theoretical example of FFT calculation. Two
same input signals (square signals) were given to Matlab and Quartus tool for
simulation. The inputs and outputs are plotted and are shown in Fig.
7. Another random signal was given to Matlab and Quartus tool for simulation
to valide if the FFT results are correct as shown in Fig. 7
and 8.

Fig. 7: 
16point FFT simulation 

Fig. 8: 
Simulation results of the FFT block’s response in Quartus
tool 
Table 1: 
Comparison of the proposed architecture with different pipelines
architectures 

The 16point FFT processor was described with hardware VHDL using fixedpoint
arithmetic while Matlab simulation uses floating point based calculations. Consequently,
the resulting VHDL simulation can not completely match the one with floating
point exactly at the maximum value of the lobes due to the lost of point precision
during the computation. In these simulation, we used only 8 bits to represent
a FFT point. Using 16 bits; we would have an output signal close to the real
one in Matlab shown in Fig. 7. In addition, published results
(Umar et al., 2004) of 16bit fixedpoint and
the 16bit fating point FFTs both give highly accurate and comparable results.
Therefore, the 16bit fixedpoint is more suitable to implement the proposed
FFT architecture.
The proposed architecture is more useful when a computation of a large FFT/IFFT is needed. 16point processor elements can be then used as building blocks for these FFTs/IFFTs; like the 256Point FFT for OFDM based communication system (IEEE 802.16).
CONCLUSION
The number of multiplications has been used in this study as a key metrics for comparing FFT algorithms since it has a large impact on the resource usage and power consumption. The efficient 16point FFT/IFFT architecture proposed in this study gives an advantage in terms of resource usage and power dissipation using a complex multiplication reduction approach and pipelined method. The simulation results show that proposed architecture significantly reduces the number of operations inside the processor compared to others efficient processor. The proposed processor can be integrated with other components to be used as standalone processor (128, 256, 1024point FFT) applied for OFDM based Wireless Broadband Communication (WiMAX).