Abstract: The multistandard video and image codec using a single platform is the recent trend in multimedia technologies, in which the Two-Dimensional forward and inverse Discrete Cosine Transform (2-D DCT/IDCT) is used as a transform core. Therefore, the 2-D DCT/IDCT core should support multiple transforms and that to be implemented with low cost hardware, while keeping the required performances for real time applications. In this study, a fast multistandard DCT/IDCT generalized algorithms are developed with low complexity for 2x2, 4x4 and 8x8 forward/inverse transforms in H.264/Advanced Video Coding (AVC), 8x8 forward/inverse transform in Audio Video Coding Standard (AVS), 4x4 and 8x8 forward/inverse transforms in VC-1 and 8x8 forward/inverse discrete cosine transform in JPEG and MPEG-1/2/4. The hardware implementation of these generalized algorithms is performed with low cost hardware, in which the hardware sharing design and shared factorization of coefficient multiplications are used. Furthermore the multiplications are avoided by using additions and shifts. The multistandard 2-D DCT/IDCT is achieved with the proposed 1-D DCT/IDCT sharing architecture and a robust transpose buffer. These pipelined hardware architectures are described in VHDL hardware language, synthesized, verified and implemented on low cost FPGA. The implementation results of the proposed multistandard 2-D DCT/IDCT show a decrease of computational complexity by 29.7% of additions while the number of shifts remains almost the same and an increase of maximum frequency and throughput by 18.2% in comparison with other previous design.
INTRODUCTION
Actually, several standards of image and video coding use the forward and inverse discrete cosine transform (DCT/IDCT) as the key component in image and video compression. The transform coding is an efficient technique for energy compaction in image and video coding. Indeed, the DCT is used to transform image from the spatial domain to the frequency domain which leads to remove the spatial correlation. The IDCT is used to perform the inverse operation of DCT by decoding the data in frequency domain. The 8x8 2-D DCT/IDCT is employed in JPEG (ISO and IEC, 2009), MPEG-1/2 (ISO and IEC, 1993; Video Coding Standard, 1995) and MPEG-4 standard (ISO/IEC 14496-2, 2004). The H.264/AVC standard (Wiegand et al., 2003) uses 4x4 and 8x8 integer transforms, the 4x4 integer transform is used to perform fast data compression, while the 8x8 integer transform is employed to achieve better energy compaction in high quality video (Malvar et al., 2003; Wien, 2003). For these reasons, the high profile of H.264/AVC Fidelity Range Extension (FRExt) leads to select an adaptive mode amongst 4x4 and 8x8 transforms (Marpe et al., 2005). The VC-1 standard (SMPTE Standard, 2006) uses 4x4 and 8x8 integer transforms for the video coding, this video standard is developed by Microsoft Corporation and standardized by the Society of Motion Picture and Television Engineers (SMPTE). The AVS video standard is developed by China Audio Video Coding Standard Working Group (Gao et al., 2004; Yu et al., 2009), in which an 8x8 integer transform is used to achieve high efficiency and to be applied to code HDTV data. Nowadays, the trend of the mobile devices is to have different functions such as Video on Demand (VOD), Digital Multimedia Broadcasting (DMB), Portable Multimedia Player (PMP), cell phone, camera and so on. Due to this reason, it is necessary to support the extensively used video compression standards in a single system-on-chip (SoC) platform.
In recent years, there is a growing interest to develop multistandard DCT/IDCT architectures for advanced multimedia applications. Therefore, the hardware implementation of multiple transforms into a single chip increases the area and power consumption which has a negative impact on the overall system. Hence, the objective is to find a suitable implementation method of multiple transforms that achieves high performances with low cost hardware. The circuit sharing is an efficient method for the hardware cost reduction, so that the area of the integrated multitransform to be smaller than the total areas of different single transforms.
The multiple transforms that support the JPEG, MPEG, VC-1 and H.264 video decoder are proposed by Lee and Cho (2008). Fan and Su (2008) and Su and Fan (2008) presented respectively hardware sharing architectures between H.264/AVC and VC-1 and between H.264/AVC and AVS, in which the inverse integer transform matrices are decomposed by using the sparse matrix factorizations. A high parallel architecture for all transforms of H.264/AVC is proposed by Li et al. (2008), where the matrix decomposition is used in inverse transform architecture for circuit saving. Huang et al. (2008) proposed the 2-D transform architecture with a unique kernel to support H.264/AVC, JPEG and MPEG-1/2/4. A flexible transform processor design is proposed by Park et al. (2006) to achieve IDCT in MPEG and integer transform in H.264/AVC. Qi et al. (2010) proposed the multistandard inverse transforms for MPEG-2, MPEG-4 ASP, H.264/AVC and VC-1 video decoder. Unlike previous designs, a multiple inverse transforms design that support JPEG, MPEG-1/2/4, H.264/AVC, AVS and VC-1 standard video coding is presented by Fan et al. (2011).
In this study, the proposed hardware architecture of fast DCT/IDCT supports multistandard video coding as Fan et al. (2011). This architecture is proposed with low computational complexity; it also supports three types of variable-sized transform such as 2x2, 4x4 and 8x8 transform. Indeed, we have used matrix decomposition method to develop the generalized algorithms for multistandard DCT/IDCT. Moreover, we have reduced as maximum possible the number of multiplications. The shared factorization method is used to implement these multiplications with a minimum number of shifts and additions. Our architecture of multistandard DCT/IDCT is based on hardware sharing design, while providing the flexibility for adding other standard video. The Altera Cyclone II FPGA has been used to perform our pipelined hardware implementations.
REVIEWS OF DCT/IDCT ALGORITHMS
4x4 transform algorithms: The 4x4 forward and inverse transforms are defined respectively in Eq. 1 and 2 (Hwangbo and Kyung, 2010) as:
(1) |
(2) |
where, X is a 4x4 residual block input to the forward transform and W is an inversely quantized 4x4 block input to the inverse transform.
The 1-D 4x4 transform matrices Cf4_VC-1 and Ci4_VC-1 of VC-1 standard are given as:
(3) |
In the H.264/AVC standard, the 1-D 4x4 transform matrices Cf4_AVC and Ci4_AVC are given as:
(4) |
(5) |
The 4x4 forward and inverse Hadamard transforms are defined in Eq. 6 and 7 as:
(6) |
(7) |
where, WD is the 4x4 block containing DC components for each of the 16 4x4 submacroblocks and ZD is an inversely quantized 4x4 DC block. The transform matrix H4 is given as:
(8) |
The 2x2 forward and inverse Hadamard transforms are defined in Eq. 9 and 10 as:
(9) |
(10) |
where, WD is the block of 2x2 DC chroma coefficients and YD is the block after transformation, ZD is an inversely quantized 2x2 DC chroma block. The transform matrix H2 is given as:
(11) |
8x8 transform algorithms: The 8x8 forward and inverse transforms are defined respectively in Eq. 12 and 13 (Hwangbo and Kyung, 2010) as:
(12) |
(13) |
where, X is a 8x8 residual block input to the forward transform and W is an inversely quantized 8x8 block input to the inverse transform.
In the H.264/AVC standard, the 1-D 8x8 integer transform matrix C8_AVC is given as:
(14) |
The 1-D 8x8 integer transform matrix C8_AVS of AVS standard is given as:
(15) |
In AVS standard, the 1-D 8x8 integer transform matrix C8_VC-1 is given as:
(16) |
According to Lee and Cho (2008), the 1-D 8x8 integer transform matrix C8_MPEG1/2/4 of JPEG and MPEG1/2/4 standards is given as:
(17) |
PROPOSED GENERALIZED ALGORITHMS FOR MULTISTANDARD DCT/IDCT
Proposed 1-D 4x4 DCT/IDCT generalized algorithm: In Eq.
1 and 2, the 1-D 4x4 DCT matrices
(18) |
(19) |
Where:
(20) |
where a, f and g are the coefficients in DCT/IDCT matrix for different transform standards. Table 1 presents these coefficients of T4 matrix.
Table 1: | The 4x4 DCT and IDCT coefficients for different coding standards |
In Eq. 20, the T4 matrix can be decomposed as the product of 4 matrices expressed as follows:
(21) |
(22) |
Where:
(23) |
Let B42 the 2x2 submatrix of B4 given in Eq. 24:
(24) |
The hardware implementation of B42 requires 4 multiplications and 2 additions as shown below:
(25) |
where, I0 and I1 are the inputs, O0 and O1 are the outputs.
We can rewrite Eq. 25 as follows:
(26) |
Unlike Eq. 25, the hardware implementation of B42
based on Eq. 26 requires only 3 multiplications and 3 additions.
According to Eq. 24,
(27) |
Fig. 1: | The FGA of B42 |
In the same way of B42, the hardware implementation of
(28) |
In this study, we use just the unsigned coefficient multiplication. Therefore,
we have used in Eq. 26 and 28 (f-g) instead
of (g-f), because (f-g≥0) in different video standards. Figure
1 and 2 show respectively the hardware implementation
of B42 and
We need 16 multiplications and 12 additions for the hardware implementation
of T4, when there is no matrix decomposition. Whereas, we need just
5 multiplications and 9 additions if we use the matrix decomposition indicated
in Eq. 21. Indeed, according to Eq. 23,
A4 requires 4 additions, C4 requires 2 multiplications,
B4 needs just 5 additions and 3 multiplications thanks to Eq.
26, while D4 do not require any operation which only permutes
the outputs. In the same way, the hardware implementation of
Proposed 1-D 8x8 DCT/IDCT generalized algorithm: In Eq.
12 and 13, the 1-D 8x8 DCT matrix
(29) |
Where:
(30) |
Fig. 2: | The FGA of |
where a~g are the coefficients in DCT/IDCT matrix for different transform standards. Table 2 presents these coefficients of T8 matrix.
In Eq. 30, the T8 matrix can be decomposed as the product of 5 matrices expressed as follows:
(31) |
(32) |
Where:
(33) |
Table 2: | The 8x8 DCT/IDCT coefficients for different coding standards |
In Eq. 33 and 24, B42 is a submatrix of C8. According to Eq. 26, the hardware implementation of B42 requires 3 multiplications and 3 additions. Thus, the hardware implementation of C8 requires 3 multiplications and 5 additions.
Let B84 the 4x4 submatrix of B8 given in Eq. 34:
(34) |
B84 can be decomposed as the sum of 4 matrices expressed as follows:
(35) |
Where:
(36) |
The hardware implementations of S1, S2, S3 and S4 can be expressed, respectively as Eq. 37, 38, 39 and 40; each one needs just 3 multiplications and 3 additions:
(37) |
(38) |
(39) |
(40) |
(41) |
where, I0, I1, I2 and I3 are the inputs, O0, O1, O2 and O3 are the outputs.
The direct hardware implementation of B84 needs 16 multiplications
and 12 additions. Whereas, thanks to Eq. 35 B84
requires just 12 multiplications and 16 additions. So we have replaced 4 multiplications
by 4 additions, this is a gain in term of additions, because the hardware implementation
of each multiplication needs more of one addition. Thus, the hardware implementation
of B8 needs 12 multiplications and 20 additions. In the same way
as B4, C8 requires 3 multiplications and 5 additions.
D8 needs 2 multiplications and A8 needs 8 additions. E8
do not require any operation, because it only permutes the outputs. Therefore,
the hardware implementation of T8 requires 17 multiplications and
33 additions, instead of 64 multiplications and 56 additions if there is no
decomposition matrix. In the same way, according to Eq. 32,
the hardware implementation of
(42) |
Thus, the hardware implementations of
(43) |
(44) |
(45) |
(46) |
(47) |
where, I0, I1, I2 and I3 are the inputs, O0, O1, O2 and O3 are the outputs.
PROPOSED HARDWARE SHARING DESIGN OF FAST MULTISTANDARD DCT/IDCT
Proposed sharing hardware design of multistandard 1-D DCT/IDCT: According to Eq. 1 and 12, the 1-D of 4x4 and 8x8 DCT are computed respectively as follows:
(48) |
In the same way, in Eq. 2 and 13, the 1-D of 4x4 and 8x8 IDCT are computed respectively as follows:
(49) |
According to Eq. 48, 18 and 29, the 1-D of 4x4 and 8x8 DCT implementation can be achieved respectively as the hardware implementation of T4 and T8.
In the same way, according to Eq. 49, 19
and 29, the 1-D of 4x4 and 8x8 IDCT implementation can be
achieved respectively as the hardware implementation of
In Eq. 21, the product of A4, B4 and C4 is expressed as follows:
(50) |
According to Eq. 31, the product of B8, C8 and D8 is expressed as follows:
(51) |
In Eq. 50 and 51, we note that
Fig. 3: | The proposed sharing FGA of pipelined multistandard 1-D DCT |
As indicated by Fig. 3, the proposed sharing FGA of multistandard 1-D DCT contains 1-D 2x2, 4x4 and 8x8 DCT. This sharing FGA includes 8 operation modes, each mode can be selected by the signal Sel. We have developed the proposed sharing FGA of pipelined multistandard 1-D DCT with 4 stages only. Furthermore, we have balanced between the speed and the number of stages by optimizing the critical paths in pipelined operation. Indeed, we have avoided series of a block multiplication with other operators in stage 3 and 4. In Fig. 3, we have saved 6 paths between stage 2 and stage 3 in the lower half of FGA, by using the common paths; this is in order to reduce the required hardware of registers in pipelined operation. In the same way, we have developed the proposed sharing FGA of multistandard 1-D IDCT presented in Fig. 4. Table 3 lists the selection signals Sel of multiplexers and multiplication blocks for multiple operational modes.
In Fig. 3 and 4, we have performed the hardware implementation of multiplication blocks by using multiplierless method, in which adders and shifters are used instead of multipliers.
Fig. 4: | The proposed sharing FGA of pipelined multistandard 1-D IDCT |
Table 3: | Selection signals of multiplexers for multiple operation modes of the proposed multistandard 1-D DCT/IDCT |
Each multiplication block supports all modes of multistandard video coding, for this purpose we have used the shared factorizations as shown in Table 4 and 5 to reduce the amounts of additions and shifts.
Table 4: | Shared factorizations for multistandard 1-D DCT coefficients and complexity |
Table 5: | Shared factorizations for multistandard 1-D IDCT coefficients and complexity |
Furthermore, we have reduced the latency of the multiplierless computation to a maximum of 3 additions delay for each block multiplication. This is in order to increase the maximal frequency in pipelined operation.
As indicated by Table 4 and 5, we have minimized the required hardware of coefficient multiplications in term of additions and shifts, by looking for an efficient shared factorization between different values of multistandard transform, while keeping a maximum of 3 additions delay for each coefficient multiplication. Taking a as an example of coefficient multiplication, Fig. 5 presents the flow graph of this coefficient, in which 4 additions and 5 shifts are only required for its hardware implementation. The signal Sel is used to select the output mode of multistandard video.
Proposed multistandard 2-D DCT/IDCT sharing hardware design: We have performed the 2-D DCT/IDCT hardware implementation by using the row-column approach (Chang et al., 2000; Hyesook et al., 2000).
Fig. 5: | Flow graph of (a block) coefficient multiplication |
Fig. 6: | The 2-D DCT/IDCT architecture |
Fig. 7: | Low area architecture of transpose buffer supporting multistandard transform |
As indicated by Fig. 6, the hardware architecture of 2-D DCT/IDCT requires two blocks of 1-D DCT/IDCT and one block for transpose buffer which is used to connect the two 1-D DCT/IDCT architectures. In this work, we have used in these architectures 16-bit arithmetic accuracy as (Fan et al., 2011).
In the 2-D DCT/IDCT architecture, we have used efficient low area architecture for the transpose buffer supporting 2x2, 4x4 and 8x8 modes, while keeping the high-throughput in the full pipeline. Indeed, for 2-D 2x2, 4x4 and 8x8 DCT/IDCT, we obtain respectively 2, 4 and 8 output pixels of our 2-D DCT/IDCT at every clock cycle in the full pipeline without interruption. Figure 7 shows the low area transpose buffer architecture which contains one memory of 64-word 16-bit wide.
For 8x8 operation mode, the 8x8 array of 16-bit wide words is achieved to receive the inputs from the horizontal as well as from the vertical direction and to shift data in the same direction during every 8 clock cycles. At the beginning, the control unit delays the block memory operation by 4 clock cycles in order to achieve the full pipeline in 1-D row DCT/IDCT architecture. Then, the first 1-D DCT/IDCT architecture writes the output values line by line in vertical direction until the full pipeline in transpose buffer. Thereafter, when the second 1-D DCT/IDCT architecture reads the input values column by column from one direction of memory (Horizontal or Vertical), the first 1-D DCT/IDCT architecture writes simultaneously the output values line by line in the same direction (Horizontal or Vertical) during 8 clock cycles. The control unit block generates the control signals to manage the operation of this architecture. Indeed, these control signals allow to shift data and to switch the select lines of the multiplexer and demultiplexer, this is in order to switch the direction of data flow after every 8 clock cycles.
We have also performed the control unit to support 2-D 2x2 and 4x4 DCT/IDCT. Indeed, it delays the block memory operation for 2x2 and 4x4 modes of DCT/IDCT respectively by 1 and 3 clock cycles in order to achieve the full pipeline in 1-D row DCT/IDCT architecture. The control unit switches also the direction of data flow respectively after every 2 and 4 clock cycles for 2x2 and 4x4 modes.
IMPLEMENTATION RESULTS AND PERFORMANCE ANALYSIS
We have used Altera Cyclone II (EP2C35F672C6) FPGA, to perform the hardware implementation of the proposed multistandard 1-D and 2-D DCT/IDCT; we have also implemented the shifters by wiring. Table 6 shows the architecture comparison of different 1-D forward and inverse transforms for H.264/AVC, AVS, VC-1, JPEG and MPEG-1/2/4. In term of computational complexity, the amount of additions of our proposed 1-D DCT/IDCT design is reduced by 29.7% compared to 1-D IDCT design presented by Fan et al. (2011), this reduction is obtained thanks to the efficiency of our proposed generalized algorithms supporting multistandard DCT/IDCT which can decrease the amount of coefficient multiplications and that the shared factorization is achieved efficiently between different values of each coefficient multiplication. On the other hand, the required shifters of our DCT/IDCT design remain almost the same in comparison with Fan et al. (2011). In term of speed, the maximum frequency and throughput of the proposed 1-D DCT/IDCT design are increased by 24% in comparison with Fan et al. (2011), such as the architecture presented by Fan et al. (2011) used TSMC 0.18-μm CMOS standard cell technology for syntheses and chip layout. This increase of speed is explained by our optimization of critical paths and that the latency of the multiplierless computation is reduced to a maximum of 3 additions delay in each block multiplication. Furthermore, our proposed multistandard 1-D DCT/IDCT design requires 19 multiplexers while the architecture presented by Fan et al. (2011) needs 8, but the advantage of the proposed architecture is that there is the possibility to add other standards without adding any multiplexer. Whereas, there is no possibility to add any other standard in the architecture of Fan et al. (2011).
The proposed multistandard 2-D DCT/IDCT supports JPEG/MPEG-1/2/4, H.264/AVC 2x2/4x4/8x8, VC-1 4x4/8x8 and AVS 8x8. Fan et al. (2011) also used the row-column approach (Chang et al., 2000; Hyesook et al., 2000) in 2-D DCT/IDCT architecture. As shown in Table 7, the computational complexity of the proposed multistandard 2-D DCT/IDCT is decreased by 29.7% of additions while the number of shifts remains almost the same compared to Fan et al. (2011).
Table 6: | Comparison for various multistandard 1-D DCT/IDCT architectures |
Table 7: | Comparison for various multistandard 2-D DCT/IDCT architectures |
At every clock cycle, we obtain as Fan et al. (2011) 2, 4 and 8 output pixels respectively at 2x2, 4x4 and 8x8 mode of our 2-D DCT/IDCT in the full pipeline without interruption. The maximum frequency and throughput of the proposed multistandard 2-D DCT/IDCT is increased by 18.2 % in comparison with other architecture presented by Fan et al. (2011), this is obtained thanks to the efficiency of the proposed multistandard 1-D DCT/IDCT architecture.
CONCLUSION
In this study, the new multistandard 1-D DCT/IDCT generalized algorithms and their hardware sharing architectures have been proposed for H.264/AVC, AVS, VC-1, JPEG and MPEG 1/2/4 by using matrix decompositions, optimization of critical paths and shared factorization of coefficient multiplications using only shifts and additions. The proposed multistandard 1-D DCT/IDCT is achieved with low complexity and high speed compared to previous published design. The proposed multistandard 2-D DCT/IDCT is based on 1-D hardware sharing architecture and operates up to 147 MHz clock rate. Furthermore the proposed multistandard 2-D DCT/IDCT sharing design provides the flexibility to support other standard video, by upgrading only the shared factorization of each coefficient multiplication block.