HOME JOURNALS CONTACT

Research Journal of Information Technology

Year: 2014 | Volume: 6 | Issue: 1 | Page No.: 27-45
DOI: 10.17311/rjit.2014.27.45
Low Complexity and Low-Cost Hardware Sharing Design of Fast Multistandard 2-D DCT/IDCT for Image/Video Coding
Mohamed El Aakif, Said Belkouch and Moha M`rabet Hassani

Abstract: The multistandard video and image codec using a single platform is the recent trend in multimedia technologies, in which the Two-Dimensional forward and inverse Discrete Cosine Transform (2-D DCT/IDCT) is used as a transform core. Therefore, the 2-D DCT/IDCT core should support multiple transforms and that to be implemented with low cost hardware, while keeping the required performances for real time applications. In this study, a fast multistandard DCT/IDCT generalized algorithms are developed with low complexity for 2x2, 4x4 and 8x8 forward/inverse transforms in H.264/Advanced Video Coding (AVC), 8x8 forward/inverse transform in Audio Video Coding Standard (AVS), 4x4 and 8x8 forward/inverse transforms in VC-1 and 8x8 forward/inverse discrete cosine transform in JPEG and MPEG-1/2/4. The hardware implementation of these generalized algorithms is performed with low cost hardware, in which the hardware sharing design and shared factorization of coefficient multiplications are used. Furthermore the multiplications are avoided by using additions and shifts. The multistandard 2-D DCT/IDCT is achieved with the proposed 1-D DCT/IDCT sharing architecture and a robust transpose buffer. These pipelined hardware architectures are described in VHDL hardware language, synthesized, verified and implemented on low cost FPGA. The implementation results of the proposed multistandard 2-D DCT/IDCT show a decrease of computational complexity by 29.7% of additions while the number of shifts remains almost the same and an increase of maximum frequency and throughput by 18.2% in comparison with other previous design.

Fulltext PDF Fulltext HTML

How to cite this article
Mohamed El Aakif, Said Belkouch and Moha M`rabet Hassani, 2014. Low Complexity and Low-Cost Hardware Sharing Design of Fast Multistandard 2-D DCT/IDCT for Image/Video Coding. Research Journal of Information Technology, 6: 27-45.

Keywords: shared factorization, hardware sharing design, multiplierless, fast algorithm and Multistandard DCT/IDCT

INTRODUCTION

Actually, several standards of image and video coding use the forward and inverse discrete cosine transform (DCT/IDCT) as the key component in image and video compression. The transform coding is an efficient technique for energy compaction in image and video coding. Indeed, the DCT is used to transform image from the spatial domain to the frequency domain which leads to remove the spatial correlation. The IDCT is used to perform the inverse operation of DCT by decoding the data in frequency domain. The 8x8 2-D DCT/IDCT is employed in JPEG (ISO and IEC, 2009), MPEG-1/2 (ISO and IEC, 1993; Video Coding Standard, 1995) and MPEG-4 standard (ISO/IEC 14496-2, 2004). The H.264/AVC standard (Wiegand et al., 2003) uses 4x4 and 8x8 integer transforms, the 4x4 integer transform is used to perform fast data compression, while the 8x8 integer transform is employed to achieve better energy compaction in high quality video (Malvar et al., 2003; Wien, 2003). For these reasons, the high profile of H.264/AVC Fidelity Range Extension (FRExt) leads to select an adaptive mode amongst 4x4 and 8x8 transforms (Marpe et al., 2005). The VC-1 standard (SMPTE Standard, 2006) uses 4x4 and 8x8 integer transforms for the video coding, this video standard is developed by Microsoft Corporation and standardized by the Society of Motion Picture and Television Engineers (SMPTE). The AVS video standard is developed by China Audio Video Coding Standard Working Group (Gao et al., 2004; Yu et al., 2009), in which an 8x8 integer transform is used to achieve high efficiency and to be applied to code HDTV data. Nowadays, the trend of the mobile devices is to have different functions such as Video on Demand (VOD), Digital Multimedia Broadcasting (DMB), Portable Multimedia Player (PMP), cell phone, camera and so on. Due to this reason, it is necessary to support the extensively used video compression standards in a single system-on-chip (SoC) platform.

In recent years, there is a growing interest to develop multistandard DCT/IDCT architectures for advanced multimedia applications. Therefore, the hardware implementation of multiple transforms into a single chip increases the area and power consumption which has a negative impact on the overall system. Hence, the objective is to find a suitable implementation method of multiple transforms that achieves high performances with low cost hardware. The circuit sharing is an efficient method for the hardware cost reduction, so that the area of the integrated multitransform to be smaller than the total areas of different single transforms.

The multiple transforms that support the JPEG, MPEG, VC-1 and H.264 video decoder are proposed by Lee and Cho (2008). Fan and Su (2008) and Su and Fan (2008) presented respectively hardware sharing architectures between H.264/AVC and VC-1 and between H.264/AVC and AVS, in which the inverse integer transform matrices are decomposed by using the sparse matrix factorizations. A high parallel architecture for all transforms of H.264/AVC is proposed by Li et al. (2008), where the matrix decomposition is used in inverse transform architecture for circuit saving. Huang et al. (2008) proposed the 2-D transform architecture with a unique kernel to support H.264/AVC, JPEG and MPEG-1/2/4. A flexible transform processor design is proposed by Park et al. (2006) to achieve IDCT in MPEG and integer transform in H.264/AVC. Qi et al. (2010) proposed the multistandard inverse transforms for MPEG-2, MPEG-4 ASP, H.264/AVC and VC-1 video decoder. Unlike previous designs, a multiple inverse transforms design that support JPEG, MPEG-1/2/4, H.264/AVC, AVS and VC-1 standard video coding is presented by Fan et al. (2011).

In this study, the proposed hardware architecture of fast DCT/IDCT supports multistandard video coding as Fan et al. (2011). This architecture is proposed with low computational complexity; it also supports three types of variable-sized transform such as 2x2, 4x4 and 8x8 transform. Indeed, we have used matrix decomposition method to develop the generalized algorithms for multistandard DCT/IDCT. Moreover, we have reduced as maximum possible the number of multiplications. The shared factorization method is used to implement these multiplications with a minimum number of shifts and additions. Our architecture of multistandard DCT/IDCT is based on hardware sharing design, while providing the flexibility for adding other standard video. The Altera Cyclone II FPGA has been used to perform our pipelined hardware implementations.

REVIEWS OF DCT/IDCT ALGORITHMS

4x4 transform algorithms: The 4x4 forward and inverse transforms are defined respectively in Eq. 1 and 2 (Hwangbo and Kyung, 2010) as:

(1)

(2)

where, X is a 4x4 residual block input to the forward transform and W’ is an inversely quantized 4x4 block input to the inverse transform.

The 1-D 4x4 transform matrices Cf4_VC-1 and Ci4_VC-1 of VC-1 standard are given as:

(3)

In the H.264/AVC standard, the 1-D 4x4 transform matrices Cf4_AVC and Ci4_AVC are given as:

(4)

(5)

The 4x4 forward and inverse Hadamard transforms are defined in Eq. 6 and 7 as:

(6)

(7)

where, WD is the 4x4 block containing DC components for each of the 16 4x4 submacroblocks and ZD is an inversely quantized 4x4 DC block. The transform matrix H4 is given as:

(8)

The 2x2 forward and inverse Hadamard transforms are defined in Eq. 9 and 10 as:

(9)

(10)

where, WD is the block of 2x2 DC chroma coefficients and YD is the block after transformation, ZD is an inversely quantized 2x2 DC chroma block. The transform matrix H2 is given as:

(11)

8x8 transform algorithms: The 8x8 forward and inverse transforms are defined respectively in Eq. 12 and 13 (Hwangbo and Kyung, 2010) as:

(12)

(13)

where, X is a 8x8 residual block input to the forward transform and W’ is an inversely quantized 8x8 block input to the inverse transform.

In the H.264/AVC standard, the 1-D 8x8 integer transform matrix C8_AVC is given as:

(14)

The 1-D 8x8 integer transform matrix C8_AVS of AVS standard is given as:

(15)

In AVS standard, the 1-D 8x8 integer transform matrix C8_VC-1 is given as:

(16)

According to Lee and Cho (2008), the 1-D 8x8 integer transform matrix C8_MPEG1/2/4 of JPEG and MPEG1/2/4 standards is given as:

(17)

PROPOSED GENERALIZED ALGORITHMS FOR MULTISTANDARD DCT/IDCT

Proposed 1-D 4x4 DCT/IDCT generalized algorithm: In Eq. 1 and 2, the 1-D 4x4 DCT matrices and and IDCT matrices Ci4 and H4 can be respectively expressed as follows:

(18)

(19)

Where:

(20)

where a, f and g are the coefficients in DCT/IDCT matrix for different transform standards. Table 1 presents these coefficients of T4 matrix.

Table 1: The 4x4 DCT and IDCT coefficients for different coding standards

In Eq. 20, the T4 matrix can be decomposed as the product of 4 matrices expressed as follows:

(21)

(22)

Where:

(23)

Let B42 the 2x2 submatrix of B4 given in Eq. 24:

(24)

The hardware implementation of B42 requires 4 multiplications and 2 additions as shown below:

(25)

where, I0 and I1 are the inputs, O0 and O1 are the outputs.

We can rewrite Eq. 25 as follows:

(26)

Unlike Eq. 25, the hardware implementation of B42 based on Eq. 26 requires only 3 multiplications and 3 additions. According to Eq. 24, is expressed as follows:

(27)

Fig. 1: The FGA of B42

In the same way of B42, the hardware implementation of requires just 3 multiplications and 3 additions as shown below:

(28)

In this study, we use just the unsigned coefficient multiplication. Therefore, we have used in Eq. 26 and 28 (f-g) instead of (g-f), because (f-g≥0) in different video standards. Figure 1 and 2 show respectively the hardware implementation of B42 and as the Flow-Graph Algorithm (FGA).

We need 16 multiplications and 12 additions for the hardware implementation of T4, when there is no matrix decomposition. Whereas, we need just 5 multiplications and 9 additions if we use the matrix decomposition indicated in Eq. 21. Indeed, according to Eq. 23, A4 requires 4 additions, C4 requires 2 multiplications, B4 needs just 5 additions and 3 multiplications thanks to Eq. 26, while D4 do not require any operation which only permutes the outputs. In the same way, the hardware implementation of requires just 5 multiplications and 9 additions.

Proposed 1-D 8x8 DCT/IDCT generalized algorithm: In Eq. 12 and 13, the 1-D 8x8 DCT matrix and IDCT matrix C8 can be respectively expressed as follows:

(29)

Where:

(30)

Fig. 2: The FGA of

where a~g are the coefficients in DCT/IDCT matrix for different transform standards. Table 2 presents these coefficients of T8 matrix.

In Eq. 30, the T8 matrix can be decomposed as the product of 5 matrices expressed as follows:

(31)

(32)

Where:

(33)

Table 2: The 8x8 DCT/IDCT coefficients for different coding standards

In Eq. 33 and 24, B42 is a submatrix of C8. According to Eq. 26, the hardware implementation of B42 requires 3 multiplications and 3 additions. Thus, the hardware implementation of C8 requires 3 multiplications and 5 additions.

Let B84 the 4x4 submatrix of B8 given in Eq. 34:

(34)

B84 can be decomposed as the sum of 4 matrices expressed as follows:

(35)

Where:

(36)

The hardware implementations of S1, S2, S3 and S4 can be expressed, respectively as Eq. 37, 38, 39 and 40; each one needs just 3 multiplications and 3 additions:

(37)

(38)

(39)

(40)

(41)

where, I0, I1, I2 and I3 are the inputs, O0, O1, O2 and O3 are the outputs.

The direct hardware implementation of B84 needs 16 multiplications and 12 additions. Whereas, thanks to Eq. 35 B84 requires just 12 multiplications and 16 additions. So we have replaced 4 multiplications by 4 additions, this is a gain in term of additions, because the hardware implementation of each multiplication needs more of one addition. Thus, the hardware implementation of B8 needs 12 multiplications and 20 additions. In the same way as B4, C8 requires 3 multiplications and 5 additions. D8 needs 2 multiplications and A8 needs 8 additions. E8 do not require any operation, because it only permutes the outputs. Therefore, the hardware implementation of T8 requires 17 multiplications and 33 additions, instead of 64 multiplications and 56 additions if there is no decomposition matrix. In the same way, according to Eq. 32, the hardware implementation of requires just 17 multiplications and 33 additions. Indeed, is a submatrix of and the hardware implementation of is expressed in Eq. 28. On the other hand, is a submatrix of , so according to Eq. 35 can be expressed as follows:

(42)

Thus, the hardware implementations of can be expressed as follows:

(43)

(44)

(45)

(46)

(47)

where, I0, I1, I2 and I3 are the inputs, O0, O1, O2 and O3 are the outputs.

PROPOSED HARDWARE SHARING DESIGN OF FAST MULTISTANDARD DCT/IDCT

Proposed sharing hardware design of multistandard 1-D DCT/IDCT: According to Eq. 1 and 12, the 1-D of 4x4 and 8x8 DCT are computed respectively as follows:

(48)

In the same way, in Eq. 2 and 13, the 1-D of 4x4 and 8x8 IDCT are computed respectively as follows:

(49)

According to Eq. 48, 18 and 29, the 1-D of 4x4 and 8x8 DCT implementation can be achieved respectively as the hardware implementation of T4 and T8.

In the same way, according to Eq. 49, 19 and 29, the 1-D of 4x4 and 8x8 IDCT implementation can be achieved respectively as the hardware implementation of and .

In Eq. 21, the product of A4, B4 and C4 is expressed as follows:

(50)

According to Eq. 31, the product of B8, C8 and D8 is expressed as follows:

(51)

In Eq. 50 and 51, we note that is a submatrix of . According to Eq. 21, D4 serves just to permute the outputs. Therefore, in hardware implementation of T4 and T8, the FGA of T4 is a sub block of the one of T8. In Eq. 9, 21 and 23, we note that is a submatrix of B4, thus the FGA of is a sub block of the one of T4. From the decomposition of T4 and T8 matrices; we have developed the sharing FGA of 1-D 2x2, 4x4 and 8x8 DCT which is presented in Fig. 3. In the same way, from the decomposition of and matrices; we have developed the sharing FGA of 1-D 2x2, 4x4 and 8x8 IDCT which is presented in Fig. 4.

Fig. 3: The proposed sharing FGA of pipelined multistandard 1-D DCT

As indicated by Fig. 3, the proposed sharing FGA of multistandard 1-D DCT contains 1-D 2x2, 4x4 and 8x8 DCT. This sharing FGA includes 8 operation modes, each mode can be selected by the signal Sel. We have developed the proposed sharing FGA of pipelined multistandard 1-D DCT with 4 stages only. Furthermore, we have balanced between the speed and the number of stages by optimizing the critical paths in pipelined operation. Indeed, we have avoided series of a block multiplication with other operators in stage 3 and 4. In Fig. 3, we have saved 6 paths between stage 2 and stage 3 in the lower half of FGA, by using the common paths; this is in order to reduce the required hardware of registers in pipelined operation. In the same way, we have developed the proposed sharing FGA of multistandard 1-D IDCT presented in Fig. 4. Table 3 lists the selection signals Sel of multiplexers and multiplication blocks for multiple operational modes.

In Fig. 3 and 4, we have performed the hardware implementation of multiplication blocks by using multiplierless method, in which adders and shifters are used instead of multipliers.

Fig. 4: The proposed sharing FGA of pipelined multistandard 1-D IDCT

Table 3: Selection signals of multiplexers for multiple operation modes of the proposed multistandard 1-D DCT/IDCT

Each multiplication block supports all modes of multistandard video coding, for this purpose we have used the shared factorizations as shown in Table 4 and 5 to reduce the amounts of additions and shifts.

Table 4: Shared factorizations for multistandard 1-D DCT coefficients and complexity

Table 5: Shared factorizations for multistandard 1-D IDCT coefficients and complexity

Furthermore, we have reduced the latency of the multiplierless computation to a maximum of 3 additions delay for each block multiplication. This is in order to increase the maximal frequency in pipelined operation.

As indicated by Table 4 and 5, we have minimized the required hardware of coefficient multiplications in term of additions and shifts, by looking for an efficient shared factorization between different values of multistandard transform, while keeping a maximum of 3 additions delay for each coefficient multiplication. Taking ‘a’ as an example of coefficient multiplication, Fig. 5 presents the flow graph of this coefficient, in which 4 additions and 5 shifts are only required for its hardware implementation. The signal ‘Sel’ is used to select the output mode of multistandard video.

Proposed multistandard 2-D DCT/IDCT sharing hardware design: We have performed the 2-D DCT/IDCT hardware implementation by using the row-column approach (Chang et al., 2000; Hyesook et al., 2000).

Fig. 5: Flow graph of (‘a’ block) coefficient multiplication

Fig. 6: The 2-D DCT/IDCT architecture

Fig. 7: Low area architecture of transpose buffer supporting multistandard transform

As indicated by Fig. 6, the hardware architecture of 2-D DCT/IDCT requires two blocks of 1-D DCT/IDCT and one block for transpose buffer which is used to connect the two 1-D DCT/IDCT architectures. In this work, we have used in these architectures 16-bit arithmetic accuracy as (Fan et al., 2011).

In the 2-D DCT/IDCT architecture, we have used efficient low area architecture for the transpose buffer supporting 2x2, 4x4 and 8x8 modes, while keeping the high-throughput in the full pipeline. Indeed, for 2-D 2x2, 4x4 and 8x8 DCT/IDCT, we obtain respectively 2, 4 and 8 output pixels of our 2-D DCT/IDCT at every clock cycle in the full pipeline without interruption. Figure 7 shows the low area transpose buffer architecture which contains one memory of 64-word 16-bit wide.

For 8x8 operation mode, the 8x8 array of 16-bit wide words is achieved to receive the inputs from the horizontal as well as from the vertical direction and to shift data in the same direction during every 8 clock cycles. At the beginning, the control unit delays the block memory operation by 4 clock cycles in order to achieve the full pipeline in 1-D row DCT/IDCT architecture. Then, the first 1-D DCT/IDCT architecture writes the output values line by line in vertical direction until the full pipeline in transpose buffer. Thereafter, when the second 1-D DCT/IDCT architecture reads the input values column by column from one direction of memory (Horizontal or Vertical), the first 1-D DCT/IDCT architecture writes simultaneously the output values line by line in the same direction (Horizontal or Vertical) during 8 clock cycles. The control unit block generates the control signals to manage the operation of this architecture. Indeed, these control signals allow to shift data and to switch the select lines of the multiplexer and demultiplexer, this is in order to switch the direction of data flow after every 8 clock cycles.

We have also performed the control unit to support 2-D 2x2 and 4x4 DCT/IDCT. Indeed, it delays the block memory operation for 2x2 and 4x4 modes of DCT/IDCT respectively by 1 and 3 clock cycles in order to achieve the full pipeline in 1-D row DCT/IDCT architecture. The control unit switches also the direction of data flow respectively after every 2 and 4 clock cycles for 2x2 and 4x4 modes.

IMPLEMENTATION RESULTS AND PERFORMANCE ANALYSIS

We have used Altera Cyclone II (EP2C35F672C6) FPGA, to perform the hardware implementation of the proposed multistandard 1-D and 2-D DCT/IDCT; we have also implemented the shifters by wiring. Table 6 shows the architecture comparison of different 1-D forward and inverse transforms for H.264/AVC, AVS, VC-1, JPEG and MPEG-1/2/4. In term of computational complexity, the amount of additions of our proposed 1-D DCT/IDCT design is reduced by 29.7% compared to 1-D IDCT design presented by Fan et al. (2011), this reduction is obtained thanks to the efficiency of our proposed generalized algorithms supporting multistandard DCT/IDCT which can decrease the amount of coefficient multiplications and that the shared factorization is achieved efficiently between different values of each coefficient multiplication. On the other hand, the required shifters of our DCT/IDCT design remain almost the same in comparison with Fan et al. (2011). In term of speed, the maximum frequency and throughput of the proposed 1-D DCT/IDCT design are increased by 24% in comparison with Fan et al. (2011), such as the architecture presented by Fan et al. (2011) used TSMC 0.18-μm CMOS standard cell technology for syntheses and chip layout. This increase of speed is explained by our optimization of critical paths and that the latency of the multiplierless computation is reduced to a maximum of 3 additions delay in each block multiplication. Furthermore, our proposed multistandard 1-D DCT/IDCT design requires 19 multiplexers while the architecture presented by Fan et al. (2011) needs 8, but the advantage of the proposed architecture is that there is the possibility to add other standards without adding any multiplexer. Whereas, there is no possibility to add any other standard in the architecture of Fan et al. (2011).

The proposed multistandard 2-D DCT/IDCT supports JPEG/MPEG-1/2/4, H.264/AVC 2x2/4x4/8x8, VC-1 4x4/8x8 and AVS 8x8. Fan et al. (2011) also used the row-column approach (Chang et al., 2000; Hyesook et al., 2000) in 2-D DCT/IDCT architecture. As shown in Table 7, the computational complexity of the proposed multistandard 2-D DCT/IDCT is decreased by 29.7% of additions while the number of shifts remains almost the same compared to Fan et al. (2011).

Table 6: Comparison for various multistandard 1-D DCT/IDCT architectures

Table 7: Comparison for various multistandard 2-D DCT/IDCT architectures

At every clock cycle, we obtain as Fan et al. (2011) 2, 4 and 8 output pixels respectively at 2x2, 4x4 and 8x8 mode of our 2-D DCT/IDCT in the full pipeline without interruption. The maximum frequency and throughput of the proposed multistandard 2-D DCT/IDCT is increased by 18.2 % in comparison with other architecture presented by Fan et al. (2011), this is obtained thanks to the efficiency of the proposed multistandard 1-D DCT/IDCT architecture.

CONCLUSION

In this study, the new multistandard 1-D DCT/IDCT generalized algorithms and their hardware sharing architectures have been proposed for H.264/AVC, AVS, VC-1, JPEG and MPEG 1/2/4 by using matrix decompositions, optimization of critical paths and shared factorization of coefficient multiplications using only shifts and additions. The proposed multistandard 1-D DCT/IDCT is achieved with low complexity and high speed compared to previous published design. The proposed multistandard 2-D DCT/IDCT is based on 1-D hardware sharing architecture and operates up to 147 MHz clock rate. Furthermore the proposed multistandard 2-D DCT/IDCT sharing design provides the flexibility to support other standard video, by upgrading only the shared factorization of each coefficient multiplication block.

REFERENCES

  • Chang, T.S., C.S. Kung and C.W. Jen, 2000. A simple processor core design for DCT/IDCT. Trans. Circ. Syst. Video Technol., 10: 439-447.
    CrossRef    Direct Link    


  • ISO and IEC, 2009. Coding of still pictures. ISO/IEC JTC 1/SC 29/WG 1. http://www.itscj.ipsj.or.jp/pr/sc29/29w02901_2009-10.pdf.


  • Fan, C.P. and G.A. Su, 2008. Efficient low-cost sharing design of fast 1-D inverse integer transform algorithms for H.264/AVC and VC-1. Signal Process. Lett., 15: 926-929.
    CrossRef    Direct Link    


  • Fan, C.P., C.H. Fang, C.W. Chang and S.J. Hsu, 2011. Fast multiple inverse transforms with low-cost hardware sharing design for multistandard video decoding. Trans. Circ. Syst. II, Exp. Briefs, 58: 517-521.
    CrossRef    Direct Link    


  • Gao, W., C. Reader, F. Wu, Y. He and L. Yu et al., 2004. AVS-The Chinese next-generation video coding standard. Las Vegas, pp: 1-10. http://www.avs.org.cn/reference/AVS%20NAB%20Paper%20Final03.pdf.


  • Huang, C.Y., L.F. Chen and Y.K. Lai, 2008. A high-speed 2-D transform architecture with unique kernel for multi-standard video applications. Proceedings of the International Symposium on Circuits and Systems, May 18-21, 2008, Seattle, WA., pp: 21-24.


  • Hwangbo, W. and C.M. Kyung, 2010. A multitransform architecture for H.264/AVC high-profile coders. Trans. Multimed., 12: 157-167.
    CrossRef    Direct Link    


  • Hyesook, L., V. Piuri and E.E. Swartzlander, 2000. A serial-parallel architecture for two- dimensional discrete cosine and inverse discrete cosine transforms. Trans. Comput., 49: 1297-1309.
    CrossRef    Direct Link    


  • Lee, S. and K. Cho, 2008. Architecture of transform circuit for video decoder supporting multiple standards. Elect. Lett., 44: 274-275.
    CrossRef    Direct Link    


  • Li, Y., Y. He and S.L. Mei, 2008. A highly parallel joint VLSI architecture for transforms in H.264/AVC. J. Signal Process. Syst., 50: 19-32.
    CrossRef    Direct Link    


  • Malvar, H.S., A. Hallapuro, M. Karczewicz and L. Kerofsky, 2003. Low-complexity transform and quantization in H.264/AVC. Trans. Circ. Syst. Video Technol., 13: 598-603.
    CrossRef    Direct Link    


  • Marpe, D., T. Wiegand and S. Gordon, 2005. H.264/MPEG4-AVC fidelity range extensions: Tools, profiles, performance and application areas. Proceedings of the International Conference on Image Processing, September 11-14, 2005, Genova, Italy, pp: 593-596.


  • Park, J.H., S.H. Lee, K.S. Lim, J.H. Kim and S. Kim, 2006. A flexible transform processor architecture for multi-CODECs (JPEG, MPEG-2, 4 and H.264). Proceedings of the International Symposium on Circuits Systems, May 21-24, 2006, Island of Kos, pp: 5347-5350.


  • Qi, H., Q. Huang and W. Gao, 2010. A low-cost very large scale integration architecture for multistandard inverse transform. IEEE Trans. Circ. Syst. II, Exp. Briefs, 57: 551-555.
    CrossRef    Direct Link    


  • SMPTE Standard, 2006. VC-1 compressed video bitstream format and decoding process: Amendment 2. SMPTE 421M, The Society of Motion Picture and Television Engineers, Barker Avenue, White Plains, New York, USA.


  • Su, G.A. and C.P. Fan, 2008. Low-cost hardware-sharing architecture of fast 1-D inverse transforms for H.264/AVC and AVS applications. IEEE Trans. Circ. Syst. II, Exp. Briefs, 55: 1249-1253.
    CrossRef    Direct Link    


  • ISO and IEC, 1993. Information Technology-coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s-Part 2: Video. ISO/IEC 11172-2 MPEG-1. http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=22411.


  • Video Coding Standard, 1995. MPEG2 video IS. ISO/IEC 13818-2: 1995 (E). http://www.ece.cmu.edu/~ece796/documents/MPEG-2_Video_IS.doc.


  • ISO/IEC 14496-2, 2004. Information technology-coding of audio-visual objects-Part 2: Visual. ISO/IEC 14496-2 MPEG-4. http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=39259.


  • Wiegand, T., G. Sullivan and A. Luthra, 2003. Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T rec. H.264/ISO/IEC 14496-10 AVC. Proceedings of the 8th International Meeting on Joint Video Team (JVC) of ISO/IEC MPEG and ITU-T VCEG, May 23-27, 2003, Geneva, Switzerland pp: 1-156.


  • Wien, M., 2003. Variable block-size transform for H.264/AVC. IEEE Trans Circ. Syst. Video Technol., 13: 604-613.
    CrossRef    Direct Link    


  • Yu, L., S. Chen and J. Wang, 2009. Overview of AVS-video coding standards. Signal Process. Image Commun., 24: 247-262.
    CrossRef    Direct Link    

  • © Science Alert. All Rights Reserved