Article  <  Archive  <  Home
Low-Complexity Multi-size Cyclic-Shifter for QC-LDPC Codes
vol. 39, no. 3, June. 2017, pp. 319-325.
http://dx.doi.org/10.4218/etrij.17.0116.0341

This is an Open Access article distributed under the term of Korea Open Government License (KOGL) Type 4: Source Indication + Commercial Use Prohibition + Change Prohibition (http://www.kogl.or.kr/news/dataView.do?dataIdx=97).
• Abstract
• ### Abstract

The decoding process of a quasi-cyclic low-density parity check code requires a unique type of rotator. These rotators, called multi-size cyclic-shifters (MSCSs), rotate input data with various sizes, where the size is the amount of data to be rotated. This paper proposes a low-complexity MSCS structure for the case when the sizes have a nontrivial common divisor. By combining the strong points of two previous structures, the proposed structure achieves the smallest area. The experimental results show that the area reduction was more than 14.7% when the proposed structure was applied to IEEE 802.16e as an example.
• Authors

• Full Text
• ## I. Introduction

Low-density parity-check (LDPC) codes have been adopted in many communication standards since their rediscovery [1], [2]. Among the various kinds of LDPC codes, quasi-cyclic LDPC (QC-LDPC) codes are widely used because of their easy encoder/decoder implementation. The decoding process of a QC-LDPC code requires rotation operations because of its cyclic characteristic. Unlike an ordinary rotator, however, the rotator in the QC-LDPC decoder should support various sizes, where the size is the amount of data to be rotated.

A rotator with varying sizes is called a multi-size cyclic-shifter (MSCS) [3]. An MSCS block occupies just tens of thousands of gates, but a highly parallel QC-LDPC decoder usually includes around a dozen MSCS blocks [4], [5]. The area of an MSCS block, therefore, affects the total area of a decoder significantly, making area reduction a main subject of research on MSCS structures.

An MSCS block can be simply implemented by assigning a multiple-input multiplexer (MUX) to each output, but this structure requires wide-input MUXs. To overcome the overhead incurred by wide-input MUXs, a pre-rotator structure has been proposed, where small pre-rotators reduce the input width of the MUXs substantially [3]. Another kind of MSCS structure is based on normal barrel rotators. To support various sizes, two barrel rotators are placed in parallel [4], [6] or series [7]. The other kind is based on a Benes network, which can route any N × N permutation [8], [9]. A Benes network is efficient for permutation but inappropriate for optimization with rotation properties.

Some previous structures were optimized using the characteristic of the required rotation sizes. In some QC-LDPC codes used in communication standards, the rotation sizes are multiples of a common divisor. In this paper, this characteristic is called the size distribution property. For example, the QC-LDPC codes in IEEE 802.16e and IEEE 802.22 require rotation operations using sizes that are multiples of 4 [10], [11]. This property has been exploited by the pre-rotator structure and the rotator-in-series structure [3], [7].

In this paper, we propose a new low-complexity MSCS structure that is optimized using the size distribution property. The proposed structure combines the structures of [3] and [9]: the input data are processed by a pre-rotator as in [3] and then the remaining shifts are performed by a few small Benes networks rather than a MUX-network. With this combination, the proposed structure can take advantage of both structures, that is, the efficiency of a Benes network and pre-rotation optimization.

This paper is organized as follows. Section II describes the MSCS operation and previous MSCS structures. The proposed structure is presented in Section III, and the experimental results are compared in Section VI. Section V draws the conclusion.

## II. Multi-size Cyclic Shifter

An MSCS block rotates a specified amount of data. If N w-bit data (d0, d1, d2, … , dN−1) are input with a size z and shift amount s, the first z input data are rotated by s to produce the output (ds, ds+1, … , dz−1, d0, d1, … , ds−1). The output data beyond the specified size z are redundant. The sizes and shift amounts are determined by the adopted QC-LDPC codes.

One of the simplest ways to implement an MSCS block is to use a MUX-network. Because an output datum can be selected from N input data, an N-to-1 (w-bit) MUX is used to connect N inputs to an output. This structure is simple but requires many wide-input MUXs, which are usually implemented by a large number of 2-to-1 MUXs.

To improve the MUX-network structure, a pre-rotator can be introduced that exploits the size distribution property, as shown in Fig. 1(a) [3]. In the pre-rotator structure, the input data are rotated by pre-rotators for which the amount of input data is the greatest common divisor (GCD) of the rotation sizes, and then the outputs of the pre-rotators are processed by a MUX-network. This structure reduces the input width of each MUX substantially, but the MUX-network still occupies a significant area. If the GCD is g and N = g × Ng, the structure consists of Ng g-input pre-rotators and NNg-to-1 MUXs. The equivalent number of 2-to-1 MUXs is {⌈log2 g⌉ + (Ng − 1)} × Nw, where the first term is for the pre-rotators and the second term is for the MUX-network. The delay is (⌈log2 g⌉ + ⌈log2 Ng⌉) × TMUX, where TMUX is the delay of a 2-to-1 MUX.

Another class of MSCS structures is implemented with two barrel rotators. The two barrel rotators can be placed in parallel or in series, as shown in Figs. 1(b) and (c). In the rotator-in-parallel structure, the two barrel rotators rotate the input data to the left (the original direction) by s and to the right (the opposite direction) by z − s, respectively [6]. The results of the barrel rotators are then combined by MUXs to become the output data. In the rotator-in-series structure, the first barrel rotator rotates the input data by s, and the second one rotates the result of the first one in the same direction by N − z [7]. The output data are also selected from the barrel rotator outputs.

#### Previous structures: (a) pre-rotator and MUX-network structure [3], (b) rotator-in-parallel structure [6], (c) rotator-in-series structure [7], and (d) Benes-network structure [9].

The rotator-based structures have been optimized by the size distribution property or a redundant MUX elimination. In the rotator-in-series structure, if the GCD of the required sizes is a power of two, that is, 2m, the shift amount of the second barrel rotator N − z is always a multiple of 2m. This property eliminates m MUX stages of the second barrel rotator. With this elimination, the number of MUXs is (2 ⌈log2 N⌉ − m + 1) × Nw, where the last term in the parenthesis is for the last selection MUX. In the rotator-in-parallel structure, the data rotated around at each MUX stage of the rotators are not selected at the last selection stage [6]. The MUXs for the redundant data are not necessary and can be removed. With this elimination, the number of MUXs is {2 ⌈log2 N⌉ − (2⌈log2 N⌉+1 − 1)/N + 1} × Nw, where the subtraction term is due to the elimination.

The other class of MSCS structure is based on the Benes network, as shown in Fig. 1(d). The Benes network can perform any permutation on the input data with N/2 × (2log2 N − 1) 2 × 2 switches when N is a power of two [8]. A 2 × 2 switch consists of two 2-input MUXs, operating in the BAR or CROSS state, as shown in Fig. 2(a). In [9], the authors proposed a control signal generation scheme for the MSCS operation and a Benes network with 3 × 3 switches or 5 × 5 switches for the case when N = 3 × 2n or N = 5 × 2n. The 3 × 3 switch proposed in [9] consists of three 2 × 2 switches, as shown in Fig. 2(b).

#### Benes network switches [9]: (a) 2 × 2 switch and (b) 3 × 3 switch.

The Benes network for N = 3 × 2n consists of one stage of 3 × 3 switches at the center and n stages of 2 × 2 switches at the front and rear, respectively. An example for N = 12 is shown in Fig. 3. In these cases, the number of MUXs is (N/2 × 2 × 2n + N/3 × 6) × w = (2n + 2) × Nw and the delay is (2n + 3) × TMUX. In the two equations, the first term in the parentheses is for the n stages of the 2 × 2 switches in the front and rear, and the second term is for the center stage of the 3 × 3 switches.

#### Benes network when N = 12 [9].

The critical path of the 3 × 3 switch proposed in [9] includes three 2-to-1 MUXs. A faster 3 × 3 switch was proposed in [12], which has two 2-to-1 MUXs in the critical path. With these 3 × 3 switches, the total delay can be reduced to (2n + 2) × TMUX for N = 3 × 2n.

The previous structures are compared in Table 1 for N = 3 × 2n and GCD g = 2m. In the first column, MN, RIP, RIS, BN, and F3 are the MUX-network structure of [3], the rotator-in-parallel structure of [6], the rotator-in-series structure of [7], the Benes-network structure of [9], and the Benes-network structure with the fast 3 × 3 switch of [12], respectively. The number of MUXs, shown in the second column, is normalized by Nw, and the delay in the third column is normalized by TMUX. When m is small, the BN structure shows the least number of MUXs, but other structures are better with large m. This is because the BN structure itself is efficient but cannot exploit the size distribution property.

### Number of MUXs and delay for N = 3 × 2n and g = 2m.

StructuresNumber of MUXs (× Nw)Delay (× TMUX)
MN(3 × 2n−m + m − 1)(n + 2)
RIP(2n + 7/3 − 1/N)(n + 3)
RIS(2nm + 5)(2nm + 5)
BN(2n + 2)(2n + 3)
F3(2n + 2)(2n + 2)
FC(2nm + 2)(2nm + 2)

## III. Proposed Fine–Coarse Structure

In this section, we propose an MSCS structure based on the Benes network, which adopts pre-rotators to exploit the size distribution property. The proposed structure replaces the MUX-network in Fig. 1(a) with the Benes networks in Fig. 1(d). This combination compensates for the weakness of the two previous structures. The overhead of the MUX-network is reduced by the adoption of the Benes network, and the size distribution property, which has not been exploited by a Benes-network-based structure, can be used by the pre-rotation scheme. The proposed structure is called the fine–coarse (FC) structure because the coarse rotation of the Benes networks follows the fine rotation of the pre-rotators.

The FC structure consists of Ng g-input pre-rotators and g Ng-input Benes networks, as shown in Fig. 4. A rotation scheme with pre-rotators was introduced in [3], but the scheme was described for MUX-network control without a detailed mathematical analysis. In this paper, the scheme is modified for rotation control of the Benes network and a mathematical analysis is presented in the appendix.

#### Proposed FC structure.

In the FC structure, the input data are rotated in two steps. The rotation amount can be described by s = sB·g + sp, where 0 ≤ sp < g. The rotation by sp is performed by the pre-rotators, and the rotation by sB is done by the Benes networks. At the first step, each set of g input data are grouped and rotated by sp within the group. The kth output datum of each pre-rotator is then gathered by the kth Benes network. In the Benes network, the gathered data are rotated by sB with a rotation size of zg = z/g. Because a group has g data, the data are actually rotated by sBg. As a result, a rotation by s is performed.

With this scheme, however, some data are rotated by s − g. This insufficient rotation is caused by the inside-group rotation at the pre-rotation step. At this step, some data should be moved over the group boundary, but they are not. To address this problem, the kth Benes network rotates the gathered data by sB + 1 when k + spg.

The rotation process of the FC structure is illustrated in Fig. 5, where N = 16, z = 12, g = 4, and s = 5 with sp = 1 and sB = 1. In the figure, a small circle represents a datum, and the number in a circle is the index of the corresponding datum. There are four pre-rotators, each of which groups four input data and rotates them by sp = 1 within the group. The kth Benes network with k = 0, 1, or 2 then rotates the kth output data of the pre-rotators by sB = 1 because k + sp < g, but the third Benes network rotates the input data by sB + 1 = 2 because k + spg. The figure shows that the rotation by s = 5 is performed well. The ith output data with iz are redundant and indicated by question marks in the figure.

#### Rotation example when N = 16, z = 12, g = 4, and s = 5.

The FC structure reduces the number of MUXs by reducing the number of MUX stages. In a Benes network, two stages are removed as N is decreased by half. In the FC structure, the amount of input data of a Benes network is N/g, so the number of MUX stages is fewer by about 2log2 g than in a Benes network with N input data. Because the number of MUX stages in the pre-rotators is ⌈log2 g⌉, the resulting reduction is about log2 g stages for the MUXs.

As an example for a detailed analysis, let us assume that N = 3 × 2n and g = 2m. The number of MUXs of the BN structure is (2n + 2) × Nw, as shown in Table 1. In the FC structure, a pre-rotator occupies m × 2m × wMUXs, so the number of MUXs for Ng pre-rotators is m × 2m × w × Ng = mNw. A Benes networks includes {2(n − m) + 2} × NgwMUXs, so the number of MUXs for g Benes networks is {2(n − m) + 2} × Nw. The total number of MUXs is mNw + {2(n − m) + 2} × Nw = (2n − m + 2) × Nw. Compared to the BN structure, mNw MUXs are reduced in the new structure, as shown at the last row of Table 1.

Along with the reduction in the number of MUXs, the FC structure reduces the delay time. In the above example, the delay time of the pre-rotator is mTMUX, and that of a Benes network is {2(n − m) + 2} × TMUX with the fast 3 × 3 switch in [12]. The total delay time is (2n − m + 2) × TMUX, which is less than that of the F3 structure by mTMUX.

For comparison in a real application, Table 2 shows the number of MUXs and the delay of the FC and previous structures when they are applied to IEEE 802.16e. The terms in the first column follow those of Table 1. Tables 1 and 2 show that the FC structure is the smallest one with a comparable delay.

### Comparison for IEEE 802.16e.

StructuresNumber of MUXsDelay
MN25 × 96w7 TMUX
RIP12.3 × 96w8 TMUX
RIS13 × 96w13 TMUX
BN12 × 96w13 TMUX
F312 × 96w12 TMUX
FC10 × 96w10 TMUX

## IV. Experimental Results

To compare the synthesis results, the MSCS structures were implemented in the register transfer level for IEEE 802.16e and synthesized by the Cadence RTLCompiler with a 0.25 μm process library. The constraints were specified so that the synthesized circuits operate as fast as possible with 2-input MUXs preserved in the datapath. The data bit width w was assumed to be 8.

The synthesis results are shown in Table 3, with the same terms in the first column as those of Table 1. The second column of the table shows the area results converted to the number of equivalent 2-input NAND gates. The delay of each entire circuit is presented in the third column with the delay of the datapath part shown in the fourth column. The delay time is measured in picoseconds.

### Synthesis results.

StructuresGate countsDelay (ps)Datapath delay (ps)
MN57,9902,6151,946
RIP28,8842,4032,013
RIS33,7153,1342,722
BN33,0394,2362,788
F331,7023,6232,720
FC24,6273,0092,521

Within the previous structures, the RIP structure is smaller and faster than the Benes-network-based structures (BN and F3). In Tables 1 and 2, the Benes-network-based structures require fewer MUXs, but the RIP structure shows less area because it uses simpler control signals. The control signals required for the rotator-based structures are just shift or size values such as s or z − s. In contrast, the Benes-network-based structures transfer the shift and size values into the switch control signals, increasing the delay time and circuit complexity. The fast 3 × 3 switch in [12] compensates for this inferiority to some degree, but the compensation is not sufficient.

The proposed FC structure exceeds the limit of the Benes-network-based structures, achieving the smallest area and comparable delay time. Compared to the smallest previous structure, the area is reduced by around 14.7%. This reduction is somewhat less than the expectation from Table 2, but still significant. Within the Benes-network-based structures, the area reduction is 25.5% from BN and 22.3% from F3.

In terms of delay, the FC structure is not the fastest one, but is located about in the middle of the structures, faster than three previous structures and slower than two. If we do not consider the MN structure because of its much larger area, the FC structure is the second fastest one.

## V. Conclusion

This Paper proposed a low-complexity MSCS structure. By combining previous structures, the proposed FC structure achieves the smallest area. The area reduction is around 14.7% compared with the smallest conventional structure. The delay of the FC structure is comparable with other structures. Because a QC-LDPC decoder requires many MSCS blocks, the area reduction can contribute significantly to reducing the total decoder area. The experimental results are shown for IEEE 802.16e, but the FC structure can be applied to any QC-LDPC decoder when the required rotation sizes have a common divisor.

## Appendix

This appendix describes the process mentioned in Section III mathematically and shows its correctness. Let us define input data di and output data oi, where 0 ≤ i < N. Rotation by s with size z can be defined as follows:

 $o i = { d i + s if i + s < z , d i + s − z if i + s ≥ z .$ (1)

Output oi are not defined for i > z. If dj,k is the kth datum of the jth group after input data grouping, then di = dj,k, where i = j ·g + k and 0 ≤ k < g. The output data of the pre-rotation step ej,k can be written as follows:

 $e j , k = { d j , k + s p if k + s p < g , d j , k + s p − g if k + s p ≥ g .$ (2)

For a value of k, the data ej,k are gathered by the kth Benes network and rotated by sB or sB + 1 to be fj,k. The kth Benes network with k + sp < g (the first case of (2)) performs the rotation by sB as follows:

 $f j , k = { e j + s B , k if j + s B < z g , e j + s B − z g , k if j + s B ≥ z g .$ (3)

The kth Benes network with k + spg (the second case of (2)) performs the rotation by sB + 1 as follows:

 $f j , k = { e j + s B + 1 , k if j + s B + 1 < z g , e j + s B + 1 − z g , k if j + s B + 1 ≥ z g .$ (4)

For output data oi, oi = fj,k, where i = j · g + k and 0 ≤ k < g.

With the above equations, we can check whether (1) is satisfied. Four cases arise from the values of j and k in oi = fj,k, and one case is analyzed in this paper because of its length. If k + spg and j + sB + 1 ≥ zg, i + s = j · g + k + sB · g + sp = (j + sB + 1) · g + k + spgz, which belongs to the second case of (1). With this j and k,

 $o i = f j , k = e j + s B + 1 − z g , k = d j + s B + 1 − z g , k + s p − g = d ( j + s B + 1 − z g ) g + k + s p − g = d j g + k + s B g + s p − z = d i + s − z .$ (5)

Equation (5) satisfies the second case of (1).

## Footnotes

Hyeong-Ju Kang (hjkang@koreatech.ac.kr) is with the School of Computer Science and Engineering, Korea University of Technology and Education, Cheonan, Rep. of Korea.

Byung-Do Yang (corresponding author, bdyang@cbnu.ac.kr) is with the Department of Electronics Engineering, Chungbuk National University, Cheongju, Rep. of Korea.

This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2012R1A1A1011221).

• References
• ### References

[1]

D.J.C. MacKay and R.M. Neal, “Near Shannon Limit Performance of Low Density Parity Check Codes,” Electron. Lett., vol. 32, no. 18, Aug. 1996, pp. 1645–1646.

[2]

D.J.C. Mackay, “Good Error Correcting Codes Based on Very Sparse Matrices,” IEEE Trans. Inform. Theory, vol. 45, no. 2, Mar. 1999, pp. 399–431.

[3]

M. Rovini, G. Gentile, and L. Fanucci, “Multi-size Circular Shifting Networks for Decoders of Structured LDPC Codes,” Electron. Lett., vol. 43, no. 17, Aug. 2007, pp. 938–940.

[4]

X. Peng et al., “A 115 mW 1 Gbps QC-LDPC Decoder ASIC for WiMAX in 65 nm CMOS,” IEEE Asian Solid-State Circuits Conf., Jeju, Rep. of Korea, Nov. 14–16, 2011, pp. 317–320.

[5]

Y. Cui et al., “Ultra Low Power QC-LDPC Decoder with High Parallelism,” Int. SOC Conf., Taipei, Taiwan, Sept. 26–28, 2011, pp. 142–145.

[6]

X. Chen, S. Lin, and V. Akella, “QSN–a Simple Circular-Shift Network for Reconfigurable Quasi-Cyclic LDPC Decoders,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57, no. 10, Oct. 2010, pp. 782–786.

[7]

B. Xiang et al., “An 847–955 Mb/s 342–397 mW Dual-Path Fully-Overlapped QC-LDPC Decoders for WiMAX System in 0.13 μm CMOS,” IEEE J. Solid-State Circuits, vol. 46, no. 6, June 2011, pp. 1416–1432.

[8]

V.E. Benes, “Optimal Rearrangeable Multistage Connecting Networks,” Bell Syst. Tech. J., vol. 43, no. 4, July 1964, pp. 1641–1656.

[9]

D. Oh and K.K. Parhi, “Low-Complexity Switch Network for Reconfigurable LDPC Decoders,” IEEE Trans. Very Large Scale Integr. Syst., vol. 18, no. 1, Jan. 2010, pp. 85–94.

[10]

IEEE Std. 802.16, IEEE Standard for Air Interface for Broadband Wireless Access Systems, 2012.

[11]

IEEE Std. 802.22, IEEE Standard for Information Technology–Local and Metropolitan Area Networks–Specific Requirements–Part 22: Cognitive Wireless RAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications: Policies and Procedures for Operation in the TV Bands, 2011.

[12]

H.-J. Kang, “Multi-size Circular Shifter Based on Benes Network with High-Speed 3 × 3 Switch,” J. Korea Inst. Inform. Commun. Eng., vol. 19, no. 11, Nov. 2015, pp. 2637–2642.

• Cited by

• Metrics
• ### Metrics

439
461
Viewed

#### Citations

0
0
• Figure / Table

### Number of MUXs and delay for N = 3 × 2n and g = 2m.

StructuresNumber of MUXs (× Nw)Delay (× TMUX)
MN(3 × 2n−m + m − 1)(n + 2)
RIP(2n + 7/3 − 1/N)(n + 3)
RIS(2nm + 5)(2nm + 5)
BN(2n + 2)(2n + 3)
F3(2n + 2)(2n + 2)
FC(2nm + 2)(2nm + 2)

### Comparison for IEEE 802.16e.

StructuresNumber of MUXsDelay
MN25 × 96w7 TMUX
RIP12.3 × 96w8 TMUX
RIS13 × 96w13 TMUX
BN12 × 96w13 TMUX
F312 × 96w12 TMUX
FC10 × 96w10 TMUX

### Synthesis results.

StructuresGate countsDelay (ps)Datapath delay (ps)
MN57,9902,6151,946
RIP28,8842,4032,013
RIS33,7153,1342,722
BN33,0394,2362,788
F331,7023,6232,720
FC24,6273,0092,521