INTRODUCTION
With the development of deep subMicron CMOS technologies and the increase
in complexity of VLSI chips, the market for portable applications, digital signal
processors and ASIC implementations has focused significant effort on the design
of Lowpower systems. Lowpower circuits have many advantages over those that
do not employ powersaving strategies. First, the digital system for portable
applications such as personal communications, hearing aids and personal digital
assistants, allow the use of lighter batteries and/or prolong the battery life
(Chong et al., 2005). Second, lowpower techniques
decrease the costs of cooling and packaging. The circuit reliability deteriorates
with increased heat dissipation, so lowpower techniques can improve the robustness
of CMOS circuits. As an essential logic component in microprocessors and digital
signal processing systems (Mosch et al., 2000)
a multiplier significantly contributes to the overall system power consumption.
In this study, we present a multiplier that uses several novel techniques to
minimize its power consumption. Digital multipliers are major source power dissipation
in digital signal processors. Array architecture is a popular technique to implement
these multipliers due to its regular compact structure. High power dissipation
in these structures is mainly due to the switching of a large number of gates
during multiplication.
In addition, much power is also dissipated due to a large number of spurious
transitions on internal nodes. However, recent research on signal transition
activity indicated that array multipliers have an architectural disadvantage.
This is mainly due to no uniform path delays in the structure, which results
in multiple signal transitions on internal nodes before they settle to a final
value. These multiple transitions are spurious or redundant and, consequently,
dissipate unnecessary power. In fact, in a recent study of an array multiplier,
almost 50% of the dynamic power was consumed due to these spurious transitions
(Chen and Chu, 2007).
In the signal processing offered in modern audio applications, multipliers
are certainly among the most powerhungry elaboration units. At the same time,
they are very frequently used components in ApplicationSpecific Integrated
Circuits (ASICs) and fundamental blocks in Digital Signal Processors (DSPs).
Being rather complex combinational modules with numerous unbalanced reconvergent
paths, multipliers suffer particularly from spurious switching activity generation
and propagation, which can even dominate the total dynamic consumption (Alioto
and Palumbo, 2002). While trying to optimize the efficiency of multipliers,
many works in the past investigated (Chang et al.,
2005; Shams et al., 2002) only the basic
constitutive cell, namely the fulladder. This way of proceeding overlooks the
previouslymentioned relevant aspect of glitch propagation and does not take
wire parasitic into account either. The easiest solution to reduce spurious
activity propagation is certainly pipelining. Yet, the large power and area
overheads due to the introduction of FlipFlops (FFs) limit its use to high
speed implementations (Sulistyo and Ha, 2003). Apart
from that, three fundamental approaches have been proposed in the literature
so far to abate glitch generation and propagation in parallel multipliers, namely:
• 
Shortening fulladder chains 
• 
Equalizing internal delays 
• 
Aligning sum and carry signals 
The first technique consists in rearranging the fulladder cells in order to
carry out the same operation within shorter paths. The advantage is that fewer
glitches are generated and propagated. When this can be done with no extra logic,
as in a Wallace tree introduced by Wallace (1964), the
energy efficiency is destined to increase with no other limitation than a growing
routing complexity. Yet, a large proportion of spurious activity still remains.
In the second technique, the delays of the internal signals are equalized by
redesigning the full adders (MahantShetti et al.,
1999). The efficiency is generally dependent on parasitic and process variations.
The third technique consists in the alignment of the internal signals by means
of selftimed circuits (Sobelman and Raatz, 1995). For
example (Chong et al., 2005; Carbognani
et al., 2006) independent delay line triggers special cells that
implement the functionality of both a fulladder and a latch. These circuits
present superior glitch suppression (Sulistyo and Ha, 2003).
However, large energy overhead and strong process dependence represent a heavy
burden.
Two more general techniques for glitch suppression, which do not specifically
address multiplier architectures, have been proposed. The first one acts on
transistor sizes to adjust the cell delays, in order to balance reconverging
paths, hence reducing glitch generation. The second publication implements a
special resistive cell to increase internal ramp times. Compared to these two
lowpower strategies, the hereby introduced technique presents the following
advantages:
• 
It limits the area increase, which is relevant in Uppalapati
et al. (2005) 
• 
It can do without large consuming transistors, needed by Agrawal
(1997) 
• 
It is more robust to process and voltage variation 
This study confirms the relevant power efficiency of the Wallace tree over
other traditional structures, by presenting a comprehensive study on the spurious
activity propagation. The effect of transistor sizing is also evaluated: in
lowfrequency lowvoltage applications, minimumsize devices decrease the switching
capacitance without leading to large cross over currents (Buergin
et al., 2006). Based on these results, new multiplier architecture
is introduced, called TGMultiplier that reduces spurious activity further compared
with both traditional and recently published architectures. At the same time,
TGMultiplier has positive effects on leakage reduction and it is robust to
process variation and voltage scaling, without imposing any overhead in terms
of energy. The introduced technique combines static CMOS with transmission gates
that abate glitches via ResistanceCapacitance (RC) equivalent lowpass filtering.
Additionally, it guarantees limited overhead of propagation delay and area,
hence finding potential application in lowfrequency portable devices, such
as hearing aids.
SIGNED MULTIPLICATION
Given two unsigned binary bit wide numbers and, the multiplication operation is defined as follows:
where, Z represents the product, X_{i }the i th bit of the multiplicand and Yj the jth bit of the multiplier. The modified BaughWooley algorithm allows the conversion from unsigned to signed multiplication.
When the Booth radix4 recoding is applied, transforms into the following:
where, Yb_{j}0 {2,1, 0, 1, 2} represents the jth operand of the multiplier
after Booth recoding. As can be noticed from Eq. 1 and 2,
Booth recoding allows the number of partial products to be halved, hence halving
the number of additions. Yet, the precalculation of and the multiplication of
the multiplicand by 2,1 and 2 require extralogic, which is paid in terms
of power dissipation and area occupation.
MULTIPLIER ARCHITECTURES
Traditional multiplier architectures: Equation 1 and
2 suggest a matrix of full and halfadders; the way these
cells are connected together defines the specific multiplier architecture. The
most widespread architectures are the following:
• 
Array multiplier 
• 
Carrysave multiplier (CSM) 
• 
Wallace tree 
Array multiplier: Advances in VLSI technology have made it possible
to build combinational multipliers, extra logic to allow the product to be computed
in one step, arrays of simple combinational elements like add, shift operation.Array
multiplier is an efficient layout of combinational multiplier. It may be pipelined
to decrease clock period at the expense of latency. Consider the multiplication
of two unsigned binary integers
X = x_{n 1}…x_{1}x_{0}
Y = y_{n 1}…y_{1}y_{0}

The product P can be expressed as:
or
Each of the n^{2}1 bit product terms x_{i}y_{j} can be computed by an AND gate. An n x n array of AND gates can compute all the x_{i}y_{j} terms simultaneously. The terms are summed by an array of n (n1) full adders. The resulting circuit is similar to a twodimensional ripplecarry adder. The shifts implied by the 2^{i} and 2^{j} factors are implemented by the spatial displacements of the adders along the x and y dimensions.
The array multiplier offers lower speed but consume smaller areas than wallace
tree multiplier. As shown array multiplier in Fig. 1a and
b, multiplicand bits are simultaneously input to all the partial
product generators at every stage. All full adders start comuting at the same
time without waiting for the propagation of sum and carry signals from the previous
stage. The outputs of the multiplier have many glitiches. As for the power dissipation,
the signal transition activity directly influences the dynamic power dissipation.
This results in sprious transitions at the output and wastes power. Furthermore,
as shown in Fig. 2, since these spurious transition are propagated
to the next stage continuously, their numbers grow stage by stage like snow
ball. This causes a significant increase in power dissipation.

Fig. 1: 
(a) array multiplierrow ripple and (b) array multipliercolumn
ripple 

Fig. 2: 
Spurious transition in multiplier 
Carry save multiplier: A carry save adder is just a set of onebit full
adder without any carrychaining. Therefore, an nbit CSA receives three nbit
operands namely A(n1)…A(0), B(n1)…B(0) and CIN( n1)…CIN(0)
and generates two nbit result values, SUM(n1)…SUM(0) and COUT(n1)…….COUT(0).
The most important application of a carry save adder is to calculate the partial
products in integer multiplication. The CSM is a very regular structure, in
which the carry bits descend a row while propagating from the least significant
to the most significant bit. Booth recoding has been introduced to speed up
the operation of multiplication. The number of partial products is halved at
the expense of some extra logic inside the Booth encoder. Tree multipliers are
different fulladder rearrangements, compared to array multipliers, such as
the CSM. In particular, in the Wallacetree multiplier the AND terms are added
all at once before entering the fulladder matrix shown in Fig.
3. This results in an irregular architecture, which allows the longest path
to be shortened up to the final addition. The latter can be carried out according
to wellknown adder topologies.
The final unit in a parallel multiplier is fast adders, which performs fast
addition for the sum and carry bit vectors from outputs of the PPRT. There are
many different fast adders that suit parallel multipliers, such as carry look
ahead, carry skip adder, carry save adder. Glitches are the main responsible
for the different dissipation of traditional multiplier architectures. Assuming
that all input signals arrive at the same time, the spurious activity originates
from the following:
• 
Different delays of sum and carry bit in the fulladders 
• 
Uneven collection of the terms in the Fulladders 
• 
Irregularity of the multiplier architecture 
While the first point is applicable to all the previouslymentioned architectures,
the second one holds only for the CSM structures, whereas the Wallace suffers
more from the third one.

Fig. 3: 
Wallace carry save multiplier 
Fundamental skeleton of selftimed multipliers, as analyzed. Standard fulladders
are replaced by the socalled latchadder cells, which combine the functionality
of a fulladder and a latch. The output is retained until the enable signal
arrives, hence actually limiting the spurious activity to the final RCA. Latchadders
are, however, ratioed cells. Therefore, transistor sizing is critical and it
depends on the technology; as a consequence, minimumsize devices cannot be
used. Additionally, the switching of the enable transistors entails a large
energy overhead.
WALLACE MULTIPLIER
Wallace (1964) showed that the partial products can
be added in a fast way by using multiple levels of CSA shown in Fig.
4. In each level of the tree, the numbers are grouped into three. A CSA
is used to add the numbers in each group. The process continues until there
are only two numbers left to be added. To add these numbers, a carry propagate
adder is used. Each level reduces the number of terms to be added by a factor
of:
The Wallace tree was proposed by Wallace (1964). This
method can be used to sum up all the bits of the partial product in each column.
The summation is independent and simultaneous due to each modified Booth encoder
works in parallel. It results in all bits of partial products arrive at the
adder tree at the same time. Thus, the Wallace tree structure increases the
speed of the multiplication by introducing parallelism.

Fig. 4: 
Wallace multiplier 

Fig. 5: 
Wallace tree with 3 :2 counter 

Fig. 6: 
Structure of 4:2 compressors 
The Wallace tree was
first constructed by using 32 counters (carry save adders) shown in Fig.
5. A 32 counter is also called a 32 compressor, which has three inputs and two
outputs. This counter has a maximum of two XOR delays. The Wallace tree uses
32 counters to sum up all the partial products with the same weight and produce
two bits, one is the carry bit with the weight of n + 1 and the other is the
sum bit with the weight of n. Compressors are mostly used in multipliers to
reduce the operands while adding terms of partial products. A compressor C_{i}
is a combinatorial device that compresses N input lines in the position i to
2 output lines i.e., sum and carry. The 4:2 compressors has 4 input lines i1,
i2, i3 and i4 that must be summed and has two output lines s and c, which are
so called results of compression. The additional lines are input and output
carries. A design is developed for a multiplier which generates the product
of two numbers using purely combinational logic, i.e., in one gating step as
in Fig. 6. Using straight forward diodetransistor logic,
it appears presently possible to obtain products. Multiplication of binary fraction
is normally implemented as the addition of a number of summands; each some simple
multiple of the multiplicand, chosen from a limited set of available multiplies
on the basis of one or more multiplier digits.

Fig. 7: 
Proposed TG multiplier 
The multiplier unit requires
a great deal of equipment, amounting perhaps to 10% of the to total semiconductor
complement of a very large modern computer, but probably, because of its simplicity,
costing rather less than 10% of the cost of the computer. In sense, this equipment is used inefficiently. It is useful for only arithmetic
operations and even in these, circuits with delay times of 30 msec are used
only about once per microsecond. However, some mismatch between propagation
delay and repetition rate is apparently inherent in the type of circuit postulated
and equally bad mismatches could probably be found in many present computers.
If the word length is increased, the equipment cost rises as the square of the
word length and the times as the logarithm of the word length. The inefficiency
is not in tolerable.
Proposed architecture: TGMult transmission gate cells, the advantages
of which will be discussed latter. The fulladder cells in the final RCA are
again levelrestoring static CMOS gates to recover the driving capability. Therefore,
from outside an electrical behavior similar to a standard (Carbognani
et al., 2008) CMOS Wallace tree is maintained. Note, however, proposed
TGMultiplier and equivalent (as for instance in the circuits) are shown in
Fig. 7. Hence, the static CMOS gates in the RCA do not need
to restore the level of the fulladder matrix output Signals. TGMult reduces
activity by 23 and 29% compared to the two Wallacetree. The results of the
previous sections can be summarized in the following four statements.
• 
Spurious activity limits multiplier efficiency 
• 
Wallace reduces glitch generation and propagation 
• 
Minimumsize transistors increase energy efficiency 
• 
A more sophisticated approach (Chong et al.,
2005) indeed succeeds in decreasing the spurious activity at the expense,
however, of a large energy overhead and technology dependent transistor
level techniques. TGMultiplier is a simple architecture based on the Wallace
tree with minimumsize transistors. The AND gates that create the terms
are implemented in levelrestoring static. CMOS, which present purely capacitive
inputs, hence decoupling the multiplier from the input drivers. The fulladder
matrix makes use of standard 18transistor. 
SIMULATION RESULTS
Measurements of dynamic power confirm the results of transistor level simulations
(Table 1 in terms of relative benefits), although simulated
results tend to underestimate the measured consumption (accuracy ranges from
15 to 30%). According to measurements, the Wallacetree multiplier dissipates
less energy than the reference CSM. Further Area savings are possible by implementing
minimumsize transistors. We are synthesizing all these multipliers using Xilinx.
Simulation results for the array multiplier, carry save multiplier, Wallace
tree multiplier and TG multiplier are shown in Fig. 811,
respectively.
The power report of array multiplier, carry save multiplier, Wallace tree multiplier
and TG multiplier are shown in Fig. 1215,
respectively. The power reports are taken by using Xilinx Xpower analysis method
with help of VCD file which is generated at the time of ModelSim functional
verification period and its report are tabulated in Table 1.
Table 1: 
Comparison of power, delay and area of various multipliers 


Fig. 8: 
Simulation results for array multiplier 

Fig. 9: 
Simulation results for carry save multiplier 

Fig. 10: 
Simulation results for Wallace tree multiplier 

Fig. 11: 
Simulation results for TG multiplier 

Fig. 12: 
Power report of array multiplier 

Fig. 13: 
Power analysis of Carry saves multiplier 

Fig. 14: 
Power analysis of Wallace tree multiplier 

Fig. 15: 
Power Analysis of TG multiplier 
The following two reasons allow TGMultiplier to be robust against leakage:
• 
The implementation of minimumsize devices 
• 
The reduction of the number of to ground paths 
Similarly to Wallace, the implementation of minimum size devices results in
the increase of the transistor channel resistances, hence decreasing the sub
threshold currents. Substantial further power savings are due to the transmissiongate
full adders, which reduce the number of Vdd toground paths compared to CMOS
mirror fulladders. Finally, the novel multiplier architecture is much more
robust to process parameter and placeandroute variations than other glitch
suppressing techniques. Compared to delay balancing and selftimed circuits,
the new structure does not rely on the propagation delay of single cells. Therefore,
the limited variation of transistor channel resistance or internal node capacitance
may affect the RC time constant, but not the overall lowpass filtering property
of transmission gates.
CONCLUSION
Multiplier energy efficiency is the result of careful tradeoffs among several, often contrasting factors, from architectural down to transistor level. The new multiplier structure introduced in this work (TGMult) succeeds in reducing spurious switching activity significantly without compromising the benefits with energyhungry addon sub circuits. Transmission gates combined with levelrestoring static CMOS gates suppress glitches via RC lowpass filtering, while preserving unaltered driving capabilities. Measurements point out the proposed TG based low power multiplier has considerable energy savings over a regular Wallace architecture and more than as compared to a Wallace featuring minimumsize devices.