## I. Introduction

Low-density parity-check (LDPC) codes have been adopted in many communication standards since their rediscovery [1], [2]. Among the various kinds of LDPC codes, quasi-cyclic LDPC (QC-LDPC) codes are widely used because of their easy encoder/decoder implementation. The decoding process of a QC-LDPC code requires rotation operations because of its cyclic characteristic. Unlike an ordinary rotator, however, the rotator in the QC-LDPC decoder should support various sizes, where the size is the amount of data to be rotated.

A rotator with varying sizes is called a multi-size cyclic-shifter (MSCS) [3]. An MSCS block occupies just tens of thousands of gates, but a highly parallel QC-LDPC decoder usually includes around a dozen MSCS blocks [4], [5]. The area of an MSCS block, therefore, affects the total area of a decoder significantly, making area reduction a main subject of research on MSCS structures.

An MSCS block can be simply implemented by assigning a multiple-input multiplexer (MUX) to each output, but this structure requires wide-input MUXs. To overcome the overhead incurred by wide-input MUXs, a pre-rotator structure has been proposed, where small pre-rotators reduce the input width of the MUXs substantially [3]. Another kind of MSCS structure is based on normal barrel rotators. To support various sizes, two barrel rotators are placed in parallel [4], [6] or series [7]. The other kind is based on a Benes network, which can route any *N* × *N* permutation [8], [9]. A Benes network is efficient for permutation but inappropriate for optimization with rotation properties.

Some previous structures were optimized using the characteristic of the required rotation sizes. In some QC-LDPC codes used in communication standards, the rotation sizes are multiples of a common divisor. In this paper, this characteristic is called the size distribution property. For example, the QC-LDPC codes in IEEE 802.16e and IEEE 802.22 require rotation operations using sizes that are multiples of 4 [10], [11]. This property has been exploited by the pre-rotator structure and the rotator-in-series structure [3], [7].

In this paper, we propose a new low-complexity MSCS structure that is optimized using the size distribution property. The proposed structure combines the structures of [3] and [9]: the input data are processed by a pre-rotator as in [3] and then the remaining shifts are performed by a few small Benes networks rather than a MUX-network. With this combination, the proposed structure can take advantage of both structures, that is, the efficiency of a Benes network and pre-rotation optimization.

This paper is organized as follows. Section II describes the MSCS operation and previous MSCS structures. The proposed structure is presented in Section III, and the experimental results are compared in Section VI. Section V draws the conclusion.

## II. Multi-size Cyclic Shifter

An MSCS block rotates a specified amount of data. If *N w*-bit data (*d*_{0}, *d*_{1}, *d*_{2}, … , *d*_{N−1}) are input with a size *z* and shift amount *s*, the first *z* input data are rotated by *s* to produce the output (*d _{s}*,

*d*

_{s+1}, … ,

*d*

_{z−1},

*d*

_{0},

*d*

_{1}, … ,

*d*

_{s−1}). The output data beyond the specified size

*z*are redundant. The sizes and shift amounts are determined by the adopted QC-LDPC codes.

One of the simplest ways to implement an MSCS block is to use a MUX-network. Because an output datum can be selected from N input data, an *N*-to-1 (*w*-bit) MUX is used to connect *N* inputs to an output. This structure is simple but requires many wide-input MUXs, which are usually implemented by a large number of 2-to-1 MUXs.

To improve the MUX-network structure, a pre-rotator can be introduced that exploits the size distribution property, as shown in Fig. 1(a) [3]. In the pre-rotator structure, the input data are rotated by pre-rotators for which the amount of input data is the greatest common divisor (GCD) of the rotation sizes, and then the outputs of the pre-rotators are processed by a MUX-network. This structure reduces the input width of each MUX substantially, but the MUX-network still occupies a significant area. If the GCD is *g* and *N* = *g* × *N _{g}*, the structure consists of

*N*

_{g}*g*-input pre-rotators and

*NN*-to-1 MUXs. The equivalent number of 2-to-1 MUXs is {⌈log

_{g}_{2}

*g*⌉ + (

*N*− 1)} ×

_{g}*Nw*, where the first term is for the pre-rotators and the second term is for the MUX-network. The delay is (⌈log

_{2}

*g*⌉ + ⌈log

_{2}

*N*⌉) ×

_{g}*T*

_{MUX}, where

*T*

_{MUX}is the delay of a 2-to-1 MUX.

Another class of MSCS structures is implemented with two barrel rotators. The two barrel rotators can be placed in parallel or in series, as shown in Figs. 1(b) and (c). In the rotator-in-parallel structure, the two barrel rotators rotate the input data to the left (the original direction) by *s* and to the right (the opposite direction) by *z − s*, respectively [6]. The results of the barrel rotators are then combined by MUXs to become the output data. In the rotator-in-series structure, the first barrel rotator rotates the input data by *s*, and the second one rotates the result of the first one in the same direction by *N − z* [7]. The output data are also selected from the barrel rotator outputs.

The rotator-based structures have been optimized by the size distribution property or a redundant MUX elimination. In the rotator-in-series structure, if the GCD of the required sizes is a power of two, that is, 2* ^{m}*, the shift amount of the second barrel rotator

*N − z*is always a multiple of 2

*. This property eliminates*

^{m}*m*MUX stages of the second barrel rotator. With this elimination, the number of MUXs is (2 ⌈log

_{2}

*N*⌉ −

*m*+ 1) ×

*Nw*, where the last term in the parenthesis is for the last selection MUX. In the rotator-in-parallel structure, the data rotated around at each MUX stage of the rotators are not selected at the last selection stage [6]. The MUXs for the redundant data are not necessary and can be removed. With this elimination, the number of MUXs is {2 ⌈log

_{2}

*N*⌉ − (2

^{⌈log2 N⌉+1}− 1)/

*N*+ 1} ×

*Nw*, where the subtraction term is due to the elimination.

The other class of MSCS structure is based on the Benes network, as shown in Fig. 1(d). The Benes network can perform any permutation on the input data with *N*/2 × (2log_{2} *N* − 1) 2 × 2 switches when *N* is a power of two [8]. A 2 × 2 switch consists of two 2-input MUXs, operating in the BAR or CROSS state, as shown in Fig. 2(a). In [9], the authors proposed a control signal generation scheme for the MSCS operation and a Benes network with 3 × 3 switches or 5 × 5 switches for the case when *N* = 3 × 2* ^{n}* or

*N*= 5 × 2

*. The 3 × 3 switch proposed in [9] consists of three 2 × 2 switches, as shown in Fig. 2(b).*

^{n}The Benes network for *N* = 3 × 2* ^{n}* consists of one stage of 3 × 3 switches at the center and

*n*stages of 2 × 2 switches at the front and rear, respectively. An example for

*N*= 12 is shown in Fig. 3. In these cases, the number of MUXs is (

*N*/2 × 2 × 2

*n*+

*N*/3 × 6) ×

*w*= (2

*n*+ 2) ×

*Nw*and the delay is (2

*n*+ 3) ×

*T*

_{MUX}. In the two equations, the first term in the parentheses is for the

*n*stages of the 2 × 2 switches in the front and rear, and the second term is for the center stage of the 3 × 3 switches.

The critical path of the 3 × 3 switch proposed in [9] includes three 2-to-1 MUXs. A faster 3 × 3 switch was proposed in [12], which has two 2-to-1 MUXs in the critical path. With these 3 × 3 switches, the total delay can be reduced to (2*n* + 2) × *T*_{MUX} for *N* = 3 × 2* ^{n}*.

The previous structures are compared in Table 1 for *N* = 3 × 2* ^{n}* and GCD

*g*= 2

*. In the first column, MN, RIP, RIS, BN, and F3 are the MUX-network structure of [3], the rotator-in-parallel structure of [6], the rotator-in-series structure of [7], the Benes-network structure of [9], and the Benes-network structure with the fast 3 × 3 switch of [12], respectively. The number of MUXs, shown in the second column, is normalized by*

^{m}*Nw*, and the delay in the third column is normalized by

*T*

_{MUX}. When

*m*is small, the BN structure shows the least number of MUXs, but other structures are better with large

*m*. This is because the BN structure itself is efficient but cannot exploit the size distribution property.

## III. Proposed Fine–Coarse Structure

In this section, we propose an MSCS structure based on the Benes network, which adopts pre-rotators to exploit the size distribution property. The proposed structure replaces the MUX-network in Fig. 1(a) with the Benes networks in Fig. 1(d). This combination compensates for the weakness of the two previous structures. The overhead of the MUX-network is reduced by the adoption of the Benes network, and the size distribution property, which has not been exploited by a Benes-network-based structure, can be used by the pre-rotation scheme. The proposed structure is called the fine–coarse (FC) structure because the coarse rotation of the Benes networks follows the fine rotation of the pre-rotators.

The FC structure consists of *N _{g}*

*g*-input pre-rotators and

*g*

*N*-input Benes networks, as shown in Fig. 4. A rotation scheme with pre-rotators was introduced in [3], but the scheme was described for MUX-network control without a detailed mathematical analysis. In this paper, the scheme is modified for rotation control of the Benes network and a mathematical analysis is presented in the appendix.

_{g}In the FC structure, the input data are rotated in two steps. The rotation amount can be described by *s* = *s*_{B}·*g* + *s*_{p}, where 0 ≤ *s*_{p} < *g*. The rotation by *s*_{p} is performed by the pre-rotators, and the rotation by *s*_{B} is done by the Benes networks. At the first step, each set of *g* input data are grouped and rotated by *s*_{p} within the group. The *k*th output datum of each pre-rotator is then gathered by the *k*th Benes network. In the Benes network, the gathered data are rotated by *s*_{B} with a rotation size of *z _{g}* =

*z*/

*g*. Because a group has

*g*data, the data are actually rotated by

*s*

_{B}

*g*. As a result, a rotation by

*s*is performed.

With this scheme, however, some data are rotated by *s − g*. This insufficient rotation is caused by the inside-group rotation at the pre-rotation step. At this step, some data should be moved over the group boundary, but they are not. To address this problem, the *k*th Benes network rotates the gathered data by *s*_{B} + 1 when *k* + *s*_{p} ≥ *g*.

The rotation process of the FC structure is illustrated in Fig. 5, where *N* = 16, *z* = 12, *g* = 4, and *s* = 5 with *s*_{p} = 1 and *s*_{B} = 1. In the figure, a small circle represents a datum, and the number in a circle is the index of the corresponding datum. There are four pre-rotators, each of which groups four input data and rotates them by *s*_{p} = 1 within the group. The *k*th Benes network with *k* = 0, 1, or 2 then rotates the *k*th output data of the pre-rotators by *s*_{B} = 1 because *k* + *s*_{p} < *g*, but the third Benes network rotates the input data by *s*_{B} + 1 = 2 because *k* + *s*_{p} ≥ *g*. The figure shows that the rotation by *s* = 5 is performed well. The *i*th output data with *i* ≥ *z* are redundant and indicated by question marks in the figure.

The FC structure reduces the number of MUXs by reducing the number of MUX stages. In a Benes network, two stages are removed as *N* is decreased by half. In the FC structure, the amount of input data of a Benes network is *N*/*g*, so the number of MUX stages is fewer by about 2log_{2} *g* than in a Benes network with *N* input data. Because the number of MUX stages in the pre-rotators is ⌈log_{2} *g*⌉, the resulting reduction is about log_{2} *g* stages for the MUXs.

As an example for a detailed analysis, let us assume that *N* = 3 × 2* ^{n}* and

*g*= 2

*. The number of MUXs of the BN structure is (2*

^{m}*n*+ 2) ×

*Nw*, as shown in Table 1. In the FC structure, a pre-rotator occupies

*m*× 2m × wMUXs, so the number of MUXs for

*N*pre-rotators is m × 2

_{g}*×*

^{m}*w*×

*N*=

_{g}*mNw*. A Benes networks includes {2(

*n − m*) + 2} ×

*N*MUXs, so the number of MUXs for g Benes networks is {2(

_{g}w*n − m*) + 2} ×

*Nw*. The total number of MUXs is

*mNw*+ {2(

*n − m*) + 2} ×

*Nw*= (2

*n − m*+ 2) ×

*Nw*. Compared to the BN structure,

*mNw*MUXs are reduced in the new structure, as shown at the last row of Table 1.

Along with the reduction in the number of MUXs, the FC structure reduces the delay time. In the above example, the delay time of the pre-rotator is *mT*_{MUX}, and that of a Benes network is {2(*n − m*) + 2} × *T*_{MUX} with the fast 3 × 3 switch in [12]. The total delay time is (2*n − m* + 2) × *T*_{MUX}, which is less than that of the F3 structure by *mT*_{MUX}.

For comparison in a real application, Table 2 shows the number of MUXs and the delay of the FC and previous structures when they are applied to IEEE 802.16e. The terms in the first column follow those of Table 1. Tables 1 and 2 show that the FC structure is the smallest one with a comparable delay.

## IV. Experimental Results

To compare the synthesis results, the MSCS structures were implemented in the register transfer level for IEEE 802.16e and synthesized by the Cadence RTLCompiler with a 0.25 μm process library. The constraints were specified so that the synthesized circuits operate as fast as possible with 2-input MUXs preserved in the datapath. The data bit width *w* was assumed to be 8.

The synthesis results are shown in Table 3, with the same terms in the first column as those of Table 1. The second column of the table shows the area results converted to the number of equivalent 2-input NAND gates. The delay of each entire circuit is presented in the third column with the delay of the datapath part shown in the fourth column. The delay time is measured in picoseconds.

##### Table 3.

Structures | Gate counts | Delay (ps) | Datapath delay (ps) |
---|---|---|---|

MN | 57,990 | 2,615 | 1,946 |

RIP | 28,884 | 2,403 | 2,013 |

RIS | 33,715 | 3,134 | 2,722 |

BN | 33,039 | 4,236 | 2,788 |

F3 | 31,702 | 3,623 | 2,720 |

FC | 24,627 | 3,009 | 2,521 |

Within the previous structures, the RIP structure is smaller and faster than the Benes-network-based structures (BN and F3). In Tables 1 and 2, the Benes-network-based structures require fewer MUXs, but the RIP structure shows less area because it uses simpler control signals. The control signals required for the rotator-based structures are just shift or size values such as *s* or *z − s*. In contrast, the Benes-network-based structures transfer the shift and size values into the switch control signals, increasing the delay time and circuit complexity. The fast 3 × 3 switch in [12] compensates for this inferiority to some degree, but the compensation is not sufficient.

The proposed FC structure exceeds the limit of the Benes-network-based structures, achieving the smallest area and comparable delay time. Compared to the smallest previous structure, the area is reduced by around 14.7%. This reduction is somewhat less than the expectation from Table 2, but still significant. Within the Benes-network-based structures, the area reduction is 25.5% from BN and 22.3% from F3.

In terms of delay, the FC structure is not the fastest one, but is located about in the middle of the structures, faster than three previous structures and slower than two. If we do not consider the MN structure because of its much larger area, the FC structure is the second fastest one.

## V. Conclusion

This Paper proposed a low-complexity MSCS structure. By combining previous structures, the proposed FC structure achieves the smallest area. The area reduction is around 14.7% compared with the smallest conventional structure. The delay of the FC structure is comparable with other structures. Because a QC-LDPC decoder requires many MSCS blocks, the area reduction can contribute significantly to reducing the total decoder area. The experimental results are shown for IEEE 802.16e, but the FC structure can be applied to any QC-LDPC decoder when the required rotation sizes have a common divisor.

## Appendix

This appendix describes the process mentioned in Section III mathematically and shows its correctness. Let us define input data *d _{i}* and output data

*o*, where 0 ≤

_{i}*i*<

*N*. Rotation by

*s*with size

*z*can be defined as follows:

(1) |

Output *o _{i}* are not defined for

*i*>

*z*. If

*d*is the

_{j,k}*k*th datum of the

*j*th group after input data grouping, then

*d*=

_{i}*d*, where

_{j,k}*i*=

*j*·

*g*+

*k*and 0 ≤

*k*<

*g*. The output data of the pre-rotation step

*e*can be written as follows:

_{j,k}

(2) |

For a value of *k*, the data *e _{j,k}* are gathered by the

*k*th Benes network and rotated by

*s*

_{B}or

*s*

_{B}+ 1 to be

*f*. The

_{j,k}*k*th Benes network with

*k*+

*s*

_{p}<

*g*(the first case of (2)) performs the rotation by

*s*

_{B}as follows:

(3) |

The *k*th Benes network with *k* + *s*_{p} ≥ *g* (the second case of (2)) performs the rotation by *s*_{B} + 1 as follows:

(4) |

For output data *o _{i}*,

*o*=

_{i}*f*, where

_{j,k}*i*=

*j*·

*g*+

*k*and 0 ≤

*k*<

*g*.

With the above equations, we can check whether (1) is satisfied. Four cases arise from the values of *j* and *k* in *o _{i}* =

*f*, and one case is analyzed in this paper because of its length. If

_{j,k}*k*+

*s*

_{p}≥

*g*and

*j*+

*s*

_{B}+ 1 ≥

*z*,

_{g}*i*+

*s*=

*j*·

*g*+

*k*+

*s*

_{B}·

*g*+

*s*

_{p}= (

*j*+

*s*

_{B}+ 1) ·

*g*+

*k*+

*s*

_{p}−

*g*≥

*z*, which belongs to the second case of (1). With this

*j*and

*k*,

Equation (5) satisfies the second case of (1).