Article  <  Archive  <  Home
Low-Complexity Massive MIMO Detectors Based on Richardson Method
vol. 39, no. 3, June. 2017, pp. 326-335.
http://dx.doi.org/10.4218/etrij.17.0116.0732

This is an Open Access article distributed under the term of Korea Open Government License (KOGL) Type 4: Source Indication + Commercial Use Prohibition + Change Prohibition (http://www.kogl.or.kr/news/dataView.do?dataIdx=97).
• Abstract
• ### Abstract

In the uplink transmission of massive (or large-scale) multi-input multi-output (MIMO) systems, large dimensional signal detection and its hardware design are challenging issues owing to the high computational complexity. In this paper, we propose low-complexity hardware architectures of Richardson iterative method-based massive MIMO detectors. We present two types of massive MIMO detectors, directly mapped (type1) and reformulated (type2) Richardson iterative methods. In the proposed Richardson method (type2), the matrix- by-matrix multiplications are reformulated to matrix- vector multiplications, thus reducing the computational complexity from O(U2) to O(U). Both massive MIMO detectors are implemented using a 65 nm CMOS process and compared in terms of detection performance under different channel conditions (high-mobility and flat fading channels). The hardware implementation results confirm that the proposed type1 Richardson method-based detector demonstrates up to 50% power savings over the proposed type2 detector under a flat fading channel. The type2 detector indicates a 37% power savings compared to the type1 under a high-mobility channel.
• Authors

• Full Text
• ## I. Introduction

MIMO (multiple input multiple-output) technology [1] has been widely adopted in many wireless communication standards including 3GPP LTE, LTE-Advanced, and IEEE 802.11n [2][4]. As conventional MIMO systems are approaching their throughput limits, multi-user (MU) MIMO and massive (or large-scale) MIMO have become the most promising candidates for greater amounts of data transmission over wireless networks. In massive MIMO systems, the base stations (BSs) are equipped with hundreds of antennas serving tens of users or mobile stations (MSs) [5]. It has been theoretically proven that a massive MIMO system can increase the spectrum efficiency and energy efficiency of a wireless channel by approximately two orders and three orders of magnitude, respectively [6], [7]. The simplest linear detectors typically demonstrate near-optimal performance [8]. However, owing to the large number of antennas at a BS, signal detection is one of the most complex and essential components in the uplink of massive MIMO systems [9], [10].

The hardware implementation of a massive MIMO detector is of particular interest [5]. Owing to a high computational complexity, optimal massive MIMO detectors such as the maximum-likelihood (ML) detector are considered virtually infeasible [5]. The previous massive MIMO detection approaches were based primarily on the approximate matrix inversion [10] or iterative methods for solving a system of linear equations [11][17]. The hardware architectures have typically adopted Neumann series-based approximate inversion [5], [10], where the computational complexity is significantly reduced with reduced detection performance compared to stationary iterative approaches [11][16]. Although many detection methods and architectures of massive MIMO detectors have been proposed in the literature, a comprehensive analysis on the detection performance and hardware cost has never been reported.

In this paper, we present a low-complexity massive MIMO detector architecture based on the Richardson iterative method. With a comprehensive literature survey, we compared different iterative methods in terms of detection performance and computational complexity. Based on this analysis, the Richardson iterative method was selected as the most hardware-friendly approach with a reasonable BER performance. First, we present an efficient hardware architecture based on the conventional Richardson iteration (type1). In the conventional architecture, it is found that the complex matrix-by-matrix multiplications in the Richardson iterative method can be reformulated to relatively simpler matrix vector multiplications with considerably reduced complexity. Based on reformulations, a low-complexity Richardson method-based massive MIMO detector (type2) is proposed. Both types of massive MIMO detectors are implemented and compared in terms of detection performance and hardware cost (power/energy consumption) under the following two wireless channel conditions: high-mobility channel (time-variant and frequency selective channel) and flat fading channel.

## II. Background

### 1. Uplink System Model

A massive (or large-scale) MU MIMO uplink system with B antennas at BS that communicates with U (≪ B) single-antenna users can be modeled as follows:

 $y = H s + n .$ (1)

Here, y corresponds to a B-dimensional complex-valued symbol vector received at the BS, H is a complex-valued B × U (tall and skinny) uplink channel matrix, s is a U-dimensional complex-valued transmitted symbol vector that contains the transmitted symbols for all U users, and n represents additive noise at the BS.

For signal detection in the uplink system, the transmitted symbol vector s can be estimated by minimum mean square error (MMSE) detection as presented in (2).

 (2)

where ŝ is the estimate of the transmitted symbol vector, σ2 is the additive white Gaussian noise power, I is a U × U identity matrix, W = HHH + σ2I denotes the MMSE filtering matrix, and ŷ = HH y is the matched-filter output.

### 2. Approximate Matrix Inversion and Iterative Signal Detection Methods

Because the exact inverse of the W matrix in (2) incurs a significant number of computations with O(U3), an approximate matrix inversion using Neumann series terms [8] is proposed to reduce the computational cost. Using only the first two terms of the Neumann series, the computational complexity is reduced from O(U3) to O(U2) [10]. However, the BER performance of the Neumann series approximation with two terms is considerably less than that of the exact matrix inversion [11], [13], [14]. For low-complexity and near-optimal signal detection of massive MIMO systems, stationary iterative methods have been proposed [11][17]. In stationary iterative methods, the U-dimensional transmitted symbol vector is approximated to a solution of the ith iteration. The iterative solution of the Richardson [11], [12], successive over-relation (SOR) [13], [14], and Jacobi-based [15] iterative methods are presented in the following:

 $s ( i ) = s ( i − 1 ) + ω { y ^ − ( H H H + σ 2 I ) s ( i − 1 ) } ,$ (3)

 $( ( 1 / ω ) D + L ) s ( i ) = y ^ + { ( 1 / ω − 1 ) D − L H } s ( i − 1 ) ,$ (4)

 $s ( i ) = ω ( − D − 1 E s ( i − 1 ) + D − 1 y ^ ) + ( 1 − ω ) s ( i − 1 ) ,$ (5)

where s(i) is the estimate of the transmitted symbol vector at the ith iteration, ω is a constant scale factor, D is the main diagonal of the MMSE filtering matrix W = HHH + σ2I , L is the lower triangular matrix of W, and E is the off-diagonal of W (E = WD).

First, to identify a hardware-friendly and highly accurate detection method, approximate matrix inversion and iterative detection methods were compared in terms of both BER performance and computational complexity; the BER results are presented in Fig. 1. To compare the BER performances of various approaches under the same condition, a B × U = 128 × 16 massive MIMO uplink system with Rayleigh flat fading channel model and 64-QAM modulation [8], [10], [17] were division operation, the Richardson method was selected as the assumed. Compared to the Neumann series approximation using 2-terms, the Richardson, SOR, and Gauss-Seidel iterative methods demonstrated considerably superior BER performances when the number of iterations n = 5. Table 1 also indicates the computational complexity comparisons among the different massive MIMO detection methods. In terms of multiplication and addition, the Neumann (2-term), Richardson, SOR, Gauss-Seidel, and Jacobi methods demonstrated similar complexities of O(U2). Among the operations (multiplication, additions and division) indicated in Table 1, division was the most expensive operation. Complex division requires an approximately 9-times greater area [18] than complex multiplication. Because the remainder of the operations (multiplication and addition) are comparable and the Richardson method is the only one that does not require a low-complexity approach in this paper.

### Computational complexities of detection methods.

Neumann [10] (3-terms)4U3 + (2B + 1)·U2 + (4B − 1)U4U3 + 2BU2 + 4BUU
Neumann [10] (2-terms)(2B + 1)U2 + (4B − 1)U2BU2 + 4BUU
Richardson [11](4B + 4n)U2 + 2BU(4B + 4n − 2)U2 + 2BU0
SOR [13](4B + 4n − 2)U2 + 2(Bn + 1)U(4B + 4n − 3)U2 + 2(B − 3n)U + 2n − 22nU
Gauss-Seidel [14] (SOR, ω = 1)(4B + 4n − 2)U2 + 2(B − 2n + 1)U(4B + 4n − 4)U2 + 2(B − 4n)U + 2n − 22nU
Jacobi [15](4B + 4n + 1)U2 + 2BU(4B + 4n − 2)U2 + 2(Bn)UU

## III. Architectures of Massive MIMO Detector Based on Conventional and Proposed Richardson Iterative Methods

### 1. Hardware Architecture Based on the Conventional Richardson Iterative Method

Figure 2 presents the overall architecture of the proposed massive MIMO detector based on (3) (type1). Here, a B × U = 128 × 16 massive MIMO uplink system employing 64-QAM modulation is assumed. As indicated in the figure, the architecture is divided primarily into a preprocessing block and iteration block.

### Overall architecture of massive MIMO architecture based on the Richardson method (type1).

A. Preprocessing Block

The preprocessing block is composed of the GRAM and MF (Matched Filter) modules as presented in Fig. 3. Each row (G_ROW01, G_ROW02, … , G_ROW15, G_ROW16) of the systolic array is used to compute each row of the lower triangular part of the 16 × 16 G (= HHH) matrix. First, using the two types of processing elements (D-PE and OD-PE) presented in Figs. 4(a) and (b), a diagonal element (Gii) and a lower diagonal element (Gij where i > j) of the GRAM matrix are computed, respectively. As illustrated in Fig. 3, the GRAM matrix computation is performed during 159 (T159) clock cycles from T0 (H1,1). The matched filter computation (ŷ = HH y ) is performed using the MF module in Fig. 2. Here, the ith processing element (MF01 to MF16) of the MF module generates the ith element of the ŷ vector. The nth element of the mth column vector in the H matrix (H_col_m) and the nth element of the y vectors are delivered to each processing element of the MF module and those two elements ( $h m , n ¯$ and yn) are multiplied. Then, each multiplied term ( $h m , n ¯$ ·yn ) is accumulated during B (= 128) clock cycles.

### Hardware architecture of processing elements in the Gram module (a) D-PE and (b) OD-PE.

B. Iteration Block

The computation process of the ith iteration (s(i)) is illustrated in Fig. 5. First, each GS module (GS01 to GS16) computes G·S(i−1) (= HHH·S(i−1) ) in (3). Then, 16 elements of ω{ŷ − (G + σ2I)S(i−1)} are simultaneously computed in one clock cycle. A constant ω is set to (0.5)7 to (0.5)10, which provides the best convergence in our BER performance simulations. Finally, Adder Array (AA) performs 16 additions (S(i−1) + ω{ŷ − (HHH + σ2I)S(i−1)}) in one clock cycle. Each iteration of the iteration block requires 19 clock cycles (= 17 (GS module) + 1 (GS) + 1 (AA)) to complete. Because the outputs of the preprocessing module (G matrix and ŷ vector) can be reused from the second iteration, the total latency of n iterations is (19n + 144) clock cycles.

### 2. Proposed Reformulation and Iteration Skipping Approach Based on the Richardson Method

As presented in the previous section, the GRAM matrix multiplication (HHH) is the most computationally intensive operation in the Richardson method. The GRAM matrix multiplication requires approximately 90.5% of the multiplications and additions in (3) when n = 5, B = 128, and U = 16. To reduce the computational complexity of the matrix-by-matrix multiplication, the ith iteration of the Richardson method in (3) can be reformulated as follows:

 $s ( i ) = s ( i − 1 ) + ω { H H ( y − H s ( i − 1 ) ) − σ 2 ⋅ s ( i − 1 ) } .$ (6)

Although (6) is equivalent to (3), the expensive matrix-by-matrix multiplication (HHH) in (3) is replaced with matrix-vector multiplication (HH (yHs(i−1)) in (6) by modifying the order of operations. The computational complexities of (3) and (6) are compared in Table 2. For example, when n = 5, B = 128, and U = 16, the computational complexity of the proposed reformulation (type2) can be reduced to approximately 58.5% of (3) (type1).

### Computational complexity of detection methods.

Richardson (type1)(4B + 4n)U2 + 2BU(4B + 4n − 2)U2 + 2BU
Richardson (type2)(8BU + 2U)n(8BU + U)n

For a low-complexity hardware implementation of the Richardson iterative detector, another important issue is to reduce the number of iterations. In the conventional Richardson method [11], the initial solution s(0) is set to a U-dimensional zero vector. To force the BER curve to converge more quickly, a zone-based initial solution is proposed [12]. In the zone-based initial solution approach, the real and imaginary parts in the U-dimensional initial solution s(0) are determined by detecting the sign of $Re [ s ˜ k ]$ and $Im [ s ˜ k ]$ in (7).

 $Re ( or Im ) [ s ˜ k ] = Re ( or Im ) [ y ^ k ] − W k , k ⋅ z ,$ (7)

where z is a constant integer and Wk,k is the kth diagonal element of the MMSE filtering matrix W. The proposed iteration skipping-approach is derived from the initial solution that is set to the zero vector. If we assume that s(0) = 0 in (6), the solution of the 1st iteration always becomes a constant scaled matched filter output vector (s(0) = ωHHy). Because the solution of the first iteration is fixed, we can directly assume that s(0) = ωHHy to skip the first iteration. In Fig. 6, the BER simulation results of the proposed initial iteration skipping approach and conventional approaches are presented. In this figure, raw BER refers to “the BER before any detection algorithm is applied to restore the transmitted vector from the received signal vector y.” Owing to the skip of the first iteration in Fig. 6, the BER result of the proposed approach with the (n − 1)th iteration is the same as the nth iteration of the zero initial solution. Compared to the zone-based initial solution with the nth iteration, the proposed method with the nth iteration indicates minor BER performance degradation. However, in terms of computational cost, the zone-based initial solution approach requires U multiplications and 3U additions in (7), whereas the proposed approach requires only U additions for the constant multiplication of ω (= (0.5)7 to (0.5)10) and the matched filter output (HHy).

### 3. Proposed Type2 Richardson Detector Architecture

Applying the proposed reformulation and the initial iteration skipping approach, a low-complexity massive MIMO detector architecture based on the Richardson method (type2) is presented in Fig. 7(a). Here, a B × U = 128 × 16 massive MIMO uplink system employing 64-QAM modulation is considered. To determine the optimal parallel order of the hardware architecture, a trade-off between area and latency [19] is considered in Table 3. Because the parallel order of four has the smallest latency gate count product (M), the parallel order of four is employed in the proposed architecture. The timing diagram of the proposed architecture is presented in Fig. 7(b). (yHs(1−1)) is first computed using HS and SUB (Subtractor) modules, then HH (yHs(1−1)) is computed in MAC (Multiply-accumulate). Finally, the remaining s(1−1) + ω{HH (yHs(1−1)) − σ2 · s(1−1)} computation is performed in subtractor array (SA) and adder array (AA). Because the latency of each iteration is 36 clock cycles as presented in Fig. 7(b), n iterations require 36n clock cycles.

### Comparison of latency gate product for various parallel order.

Parallel order124816
Detection latency (cycles, n = 4)5282721448048
Estimated gate count (@415 Mhz)404K738K1.32M2.59M5.0M
Latency gate count product (M)213201191207.2240

### Proposed Richardson method-based hardware architecture and its timing diagram (a) overall architecture of proposed Richardson-based massive MIMO detector, (b) timing diagram of proposed Richardson-based massive MIMO detector, and (c) computation pattern of MAC module.

A. Computations in HS and SUB modules

To compute (yHs(1−1)), the HS and following SUB modules are used. As illustrated in Fig. 8(a), each HS module (HS1 to HS4) performs a vector multiplication using a row of the H matrix and s(1−1) vector. In Fig. 8(a), the black multiplier and adder are a complex multiplier (four parallel real multipliers and two adders) and a complex adder (the number of inputs − 1) × 2 real adders for the real and imaginary parts computation). Because the H matrix has 128 rows (= 4 × 32), four parallel vector multiplications of the HS modules are performed during 32 clock cycles, which is presented in Fig. 7(b). During the 32 clock cycles, each input (H_row_n + 1) of HSn + 1 (where n = 0 to 3) module is set to the (32n + 1)th to (32n + 32)th row of the H matrix. Using four parallel rows of the Hs(1−1) output and four elements of the y (y_1 to y_4) vector, four elements of the (yHs(1−1)) vector are simultaneously computed in the SUB modules during the 32 clock cycles.

### Submodules of the proposed Richardson-based massive MIMO detector (a) HS module, SUB module, and MU of MAC module (b) Accumulator of MAC module.

B. Multiply-Accumulate Module

Using the output of the SUB modules, which is (yHs(1−1)), the MAC module performs the matrix-vector multiplication HH (yHs(1−1)). The hardware architecture of the multiply unit (MU) and accumulator (ACC) inside the MAC are illustrated in Figs. 8(a) and 8(b), respectively. The timing diagram of the proposed type2 Richardson detector is also illustrated in Fig. 7(b). For each clock cycle (T01 to T32), the four parallel MU (MU1 to MU4) outputs are the 64 (= outputs of 4 MUs × 16 rows) product term numbers (hm,n·Ym) presented in Fig. 7(c). Here, each product term (hm,n·Ym) is the multiplication of the (m, n) element of the H matrix (hm,n) and the mth element of the (yHs(1−1)) vector. To compute HH (yHs(1−1)) , four product terms of each row (in Fig. 7(c)) must be accumulated using the ACC in Fig. 8(b). As presented in Fig. 7(c), the sum of the four product terms is repeatedly accumulated during the 32 clock cycles (T02 to T33). Specifically, four partial summation terms (hk,m·Yk, h32+k,m·Y32+k, h64+k,mY64+k, h96+k,mY96+k) and the accumulated summation term ACM(k − 1) in Fig. 8(b) are added at the kth clock cycle (k = 1 to 32).

C. Remaining Computations in the ith Iteration

Using the HH (yHs(1−1)) output of the MAC module, the SA modules, ω-scaling modules, and AA modules perform the remaining computations of the ith iteration. First, SAs compute the subtraction of two 16-dimensional vectors, which is implemented using 32 (= 16 rows × 2 (real and imaginary parts)) real multipliers (σ2 · s(1−1)) and 32 subtractors (HH (yHs(1−1)) − σ2 · s(1−1)). To remove the multiplication σ2 · s(1−1) and following subtraction HH (yHs(1−1)) − σ2 · s(1−1) from the critical path, σ2 · s(1−1) is computed before the outputs of the MAC module come out. Then, to compute ω{HH (yHs(1−1)) − σ2 · s(1−1)} , a ω-scaling module is used. Because the relaxation parameter ω value is set to (0.5)7 to (0.5)10, each ω-scaling module can be implemented using only two adders (for real and imaginary parts) as presented in Fig. 7(c). Finally, AA performs the addition of the two 16-dimensional vectors (s(1−1) + ω{HH (yHs(1−1)) − σ2 · s(1−1)}).

## IV. Numerical Results

### 1. Hardware Comparisons of Massive MIMO Detectors

Two types of Richardson massive MIMO detectors (type1 and type2) were implemented using the 65-nm CMOS standard library. In Table 4, the proposed Richardson detectors are compared with conventional detectors [8], [10], [17]. Those detectors are based on the same modulation order (64-QAM) and the same or similar dimensions ((B, U) = (128, 16), (128, 8)). The conventional works [10], [17] are FPGA-based designs and the power result has not been reported. An ASIC-based soft-output Neumann series-based detector architecture (3-term) is proposed in [8], which demonstrates the highest throughput with the greatest hardware overheads in Table 4. To evaluate the tradeoff between power consumption and throughput, normalized power [22] is introduced. When we compare the proposed architectures with the Neumann series-based detector (3-term) [8], the proposed type1 and type2 architectures demonstrate a 34% and 75% normalized power savings, respectively. Compared to the Neumann series-based detector (2-term) [10] with the same (B, U) = (128, 16), the proposed works with four iterations (n = 4) demonstrate lower latency and higher throughput. Though the conjugate gradient least square (CGLS)-based soft-output detector assumes a lower number of users U = 8, the proposed works outperform the CGLS-based architecture in terms of both latency and throughput. When we compare the two proposed architectures assuming that the H matrix is updated for every signal detection, the type2 architecture demonstrates superior latency/throughput and area consumption compared to the type1 architecture. However, in our observation, the better architectural choice between the two types of MIMO detectors can be changed depending on the channel conditions, which will be discussed in the following two subsections.

### Hardware comparisons of massive MIMO detectors.

Architecture[10][17][8]Proposed work
FPGA/ASICFPGAFPGAASICASIC
ProcessN/AN/A45 nm65 nm
AlgorithmNeumann series (2-term)CGLSNeumann series (3-term)RC (type1)RC (type2)
(B, U)(128, 16)(128, 8)(128, 8)(128, 16)
Modulation (QAM)6464646464
Max operating frequency (Mhz)222.414121,000467415
Latency (Cycles)246951 (n = 3)N/A220 (n = 4)144 (n = 4)
Max throughput (Mbps)128.64203,800203.778276.66
Area (Gate count)N/AN/A12.6M2.21M1.32M
Power @ Max.Oper. Freq. (mW)/Scaled power†N/AN/A8,000892.2/406.1477.8/217.5
Normalized Power* (nJ/bit)N/AN/A2.1051.380.544

† Scaled to 45 nm process technology [21].

* Normalized considering supply voltage and adopted technology [22].

### 2. Hardware Comparisons in the Case of Rayleigh Flat Fading Channel

Because the Rayleigh flat fading channel model has been widely used in massive MIMO research [23], the two proposed detectors were compared under a Rayleigh flat fading channel. In the Rayleigh flat fading channel, the estimated channel matrix H is generally assumed to be invariant during an Orthogonal frequency-division multiplexing (OFDM) symbol duration [23]. In Fig. 9(a), the computation process of the proposed architecture is presented during an OFDM symbol duration. Here, TOFDM denotes the length of the OFDM symbol duration, TH is the required time to estimate the H matrix, and TD is the process time to detect all data symbols inside the OFDM symbol. Because an estimated H matrix (during H estimation phase TH) can be used to detect all the data symbols (D1, D2, … , Dn−1, Dn), the GRAM matrix G must be computed only once during TOFDM. If we use the proposed type1 architecture in the Rayleigh flat fading channel, the outputs of the GRAM module (in Fig. 2) for a fixed H matrix can be reused during TD. As presented in Fig. 9(b), the control logic of the GRAM module, which updates the channel information, is used to reduce the dynamic power in the proposed type1 detector. When the estimated H matrix is considered invariant in the channel estimator [23], the logical “1” GRAM_HOLD bit goes to the GRAM module of the type1 architecture. After the G matrix is computed, the GRAM_HOLD bit “1” directs the GRAM module to hold and reuse the GRAM matrix results during TD. In the case of the type2 detector, s(i−1) is updated every iteration, which means the type2 detector must perform all the computations in (6) regardless of the H matrix.

### (a) OFDM symbol structure and overall simulation process of massive MIMO system for flat fading channel (b) control logic of GRAM module for updating channel.

To determine what type of the Richardson method-based detector is superior in terms of energy/power consumption, the simulation environments were set as follows:

1. 1) The channel estimation was performed using MATLAB simulator. Each element of the H matrix was quantized to 15 bits.

2. 2) In Fig. 9(a), TOFDM was assumed to be 32 μs. The number of data subcarriers was 28 and the period of each subcarrier was 1 μs.

3. 3) The simulation was performed for 40 consecutive OFDM symbol durations (for 1,120 received y vectors).

The power consumption results in Table 5 were obtained with the gate-level netlist simulations using Primetime-PX [24]. In terms of energy/power consumption, the proposed type1 architecture indicates 61% lower energy consumption and 50% lower power consumption compared to the proposed type2 architecture, which is primarily due to the reuse of the GRAM matrix multiplication (HHH) results in the type1 architecture.

### Numerical results assuming Rayleigh flat fading channel condition.

ChannelRayleigh flat fading channel
MethodRichardson type1Richardson type2
Power @ operating Freq. (mW)237477.8
Energy per Effective bit (nJ/bit)5.511.1

### 3. Hardware Comparisons in the Case of High-Mobility Channel

When users are highly mobile, the wireless channel becomes time variant and frequency selective even within one OFDM symbol because of the Doppler spread effect [23]. To realize reliable signal detection, channel estimation and interpolation must be considered within an OFDM symbol as presented in Fig. 10. Because the H matrix is time varying under a high-mobility channel, the H matrix must be periodically updated by an H interpolation process [25].

### OFDM symbol structure and overall simulation process of massive MIMO system for high-mobility channel.

Assuming high-mobility channel conditions, BER performance and the hardware cost of the MIMO detectors can be considered. Fig. 11 displays the BER simulation results of the proposed type1 and type2 detectors under a high-transmission channel, where the center frequency f0 is 2.4 Ghz and the velocity of user v = 500 km/h [26]. Except for the Doppler spread effect, the BER simulation environment is basically similar to the one presented in Section IV.2. In Fig. 11, as the computation process of the type2 architecture remains the same regardless of the H matrix update, its detection performance continues to be reliable. However, because the GRAM module of the type1 architecture updates the G matrix only when the H matrix is updated, the BER performance of the type1 architecture is improved with an increasing number of the H matrix updates. In Fig. 11, type1 indicates a comparable BER with type2 for six H matrix updates during one OFDM symbol duration. As presented in Fig. 11, to maintain the BER performance under higher center frequency (for example, 5 Ghz, 10 Ghz) or higher velocity, the channel matrix H must be updated more frequently. In this situation, the type2 architecture is expected to demonstrate considerably improved BER performance and energy efficiency compared to the type1 architecture. Table 6 presents the power/energy consumptions of the proposed architectures under a high-mobility channel. Under the same BER performance (type2 and type1 with six time updates), type2 indicates a 37% power savings compared to type1. Based on the numerical results presented in this work, we can conclude that the type1 massive MIMO detector demonstrates improved energy efficiency under the Rayleigh flat fading channel; however, the proposed type 2 detector displays superior energy efficiency under the high-mobility channel condition.

### Numerical results assuming high-mobility channel condition.

ChannelRayleigh flat fading channel
Richardson TypePower @ operating Freq. (mW)Energy per Effective bit (nJ/bit)
Type2477.811.1
Type1 with a static channel2375.5
Type1 with 3 time update353.38.2
Type1 with 6 time update758.417.62

## V. Conclusion

This paper proposed a low-complexity massive MIMO detector based on the Richardson iterative method. First, we presented a massive MIMO detector architecture (type1) based on the conventional Richardson iteration, which demonstrated considerably reduced detection latency compared to the conventional Neumann series-based architecture. Then, an efficient reformulation of the Richardson iteration was proposed to reduce the computational complexity (type2) and its hardware architecture was presented. To compare both architectures in terms of energy/power consumption, two extreme channel conditions were considered. In the Rayleigh flat fading channel condition, the type1 detector achieved a 50% power reduction compared to the type2 detector. Under a high-mobility channel condition, the type2 architecture indicated a 37% power saving compared to the type1 architecture.

## Footnotes

Byunggi Kang (byunggi.kang@gmail.com), Ji-Hwan Yoon (improma@korea.ac.kr), and Jongsun Park (corresponding author, jongsun@korea.ac.kr) are with the School of Electrical Engineering, Korea University, Seoul, Rep. of Korea.

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (2016R1A2B4015329).

• References
• ### References

[1]

H. Sampath et al., “A Fourth-Generation MIMO-OFDM Broadband Wireless System: Design, Performance, and Field Trial Results,” IEEE Commun. Mag., vol. 40, no. 9, Sept. 2002, pp. 143–149.

[2]

S. Sesia, I. Toufik, and M. Baker, “Introduction and Background,” LTE, the UMTS Long Term Evolution: From Theory to Practice, 1st ed, New York, USA: Wiley, 2009, pp. 1–21.

[3]

3rd Generation Partnership Project; Technical Specification Group Radio Access Network; Evolved Universal Terrestrial Radio Access (EUTRA); Physical Layer Procedures (Release 10), TS 36.213 version 10.10.0, July 2013.

[4]

IEEE Draft Standard Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) specifications: Amendment 4: Enhancements for Higher Throughput, P802.11n D3.00, Sept. 2007.

[5]

K. Zheng et al., “Survey of Large-Scale MIMO Systems,” IEEE Commun. Surveys Tutorials, vol. 17, no. 3, 2015, pp. 1738–1760.

[6]

F. Rusek et al., “Scaling up MIMO: Opportunities and Challenges with Very Large Arrays,” IEEE Signal Process. Mag., vol. 30, no. 1, Jan. 2013, pp. 40–60.

[7]

M. Wu et al., “Large-Scale MIMO Detection for 3GPP LTE: Algorithms and FPGA Implementations,” IEEE J. Sel. Topics Signal Process., vol. 8, no. 5, Oct. 2014, pp. 916–929.

[8]

X. Qin, Z. Yan, and G. He, “A Near-Optimal Detection Scheme Based on Joint Steepest Descent and Jacobi Method for Uplink Massive MIMO Systems,” IEEE Commun. Lett., vol. 20, no. 2, Feb. 2016, pp. 276–279.

[9]

T.L. Marzetta, “Noncooperative Cellular Wireless with Unlimited Numbers of Base Station Antennas,” IEEE Trans. Wireless Commun., vol. 9, no. 11, Nov. 2010, pp. 3590–3600.

[10]

M. Wu et al., “Approximate Matrix Inversion for High-Throughput Data Detection in the Large-Scale MIMO Uplink,” IEEE Int. Symp. Circuits Syst., Beijing, China, May 19–23, 2013, pp. 2155–2158.

[11]

X. Gao et al., “Low-Complexity Near-Optimal Signal Detection for Uplink Large-Scale MIMO Systems,” IET Electron. Lett., vol. 50, no. 18, Aug. 2014, pp. 1326–1328.

[12]

X. Gao et al., “Low-Complexity MMSE Signal Detection Based on Richardson Method for Large-Scale MIMO Systems,” IEEE Veh. Technol. Conf., Seoul, Rep. of Korea, Sept. 14–17, 2014, pp. 1–5.

[13]

X. Gao et al., “Matrix Inversion-Less Signal Detection Using SOR Method for Uplink Large-Scale MIMO Systems,” IEEE Global Commun. Conf., Austin, TX, USA, Dec. 8–12, 2014, pp. 3291–3295.

[14]

L. Dai et al., “Low-Complexity Soft-Output Signal Detection Based on Gauss–Seidel Method for Uplink Multiuser Large-Scale MIMO Systems,” IEEE Trans. Veh. Technol., vol. 64, no. 10, Oct. 2015, pp. 4839–4845.

[15]

J. Zhou, Y. Ye, and J. Hu, “Biased MMSE Soft-Output Detection Based on Jacobi Method in Massive MIMO,” IEEE Int. Conf. Commun. Problem-Solving, Beijing, China, Dec. 5–7, 2014, pp. 442–445.

[16]

B. Yin et al., “Conjugate Gradient-Based Soft-Output Detection and Precoding in Massive MIMO Systems,” IEEE Global Commun. Conf., Austin, TX, USA, Dec. 8–12, 2014, pp. 3696–3701.

[17]

B. Yin et al., “VLSI Design of Large-Scale Soft-Output MIMO Detection Using Conjugate Gradients,” IEEE Int. Symp. Circuits Syst., Melbourne, Australia, May 24–27, 2014, pp. 1498–1501.

[18]

J. Lin et al., “Low-Complexity High-Throughput QR Decomposition Design for MIMO Systems,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 10, Oct. 2015, pp. 2342–2346.

[19]

I. Park and S. Kang, “Scheduling Algorithm for Partially Parallel Architecture of LDPC Decoder by Matrix Permutation,” IEEE Int. Symp. Circuits Syst., Kobe, Japan, May 23–26, 2005, pp. 5778–5781.

[20]

B. Yin et al., “A 3.8 Gb/s Large-Scale MIMO Detector for 3GPP LTE-Advanced,” 2014 IEEE Int. Conf. Acoustics, Speech Signal Process., Florence, Italy, May 4–9, 2014, pp. 3879–3883.

[21]

M.S. Khairy et al., “Algorithms and Architectures of Energy-Efficient Error-Resilient MIMO Detectors for Memory-Dominated Wireless Communication Systems,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 61, no. 7, July 2014, pp. 2159–2171.

[22]

C. Liao, T. Wang, and T. Chiueh, “A 74.8 mW Soft-Output Detector IC for 8 × 8 Spatial-Multiplexing MIMO Communications,” IEEE J. Solid-State Circuits, vol. 45, no. 2, Feb. 2010, pp. 411–421.

[23]

D. Yue, Y. Zhang, and Y. Jia, “Large-Scale SIMO Systems Based on Specular Component in Correlated Rician Fading Environments,” Int. Conf. Instrum. Meas., Comput., Commun. Contr., Harbin, China, Sept. 18–20, 2014, pp. 840–844.

[24]

Synopsys PrimeTime User’s Manual. http://www.synopsys.com

[25]

T. Zijian et al., “Pilot-Assisted Timevarying Channel Estimation for OFDM Systems,” IEEE Trans. Signal Process., vol. 55, no. 5, May 2007, pp. 2226–2238.

[26]

C. Wang et al., “Cellular Architecture and Key Technologies for 5G Wireless Communication Networks,” IEEE Commun. Mag., vol. 52, no. 2, Feb. 2014, pp. 122–130.

• Cited by

• Metrics
• ### Metrics

738
485
Viewed

#### Citations

0
0
• Figure / Table

### Computational complexities of detection methods.

Neumann [10] (3-terms)4U3 + (2B + 1)·U2 + (4B − 1)U4U3 + 2BU2 + 4BUU
Neumann [10] (2-terms)(2B + 1)U2 + (4B − 1)U2BU2 + 4BUU
Richardson [11](4B + 4n)U2 + 2BU(4B + 4n − 2)U2 + 2BU0
SOR [13](4B + 4n − 2)U2 + 2(Bn + 1)U(4B + 4n − 3)U2 + 2(B − 3n)U + 2n − 22nU
Gauss-Seidel [14] (SOR, ω = 1)(4B + 4n − 2)U2 + 2(B − 2n + 1)U(4B + 4n − 4)U2 + 2(B − 4n)U + 2n − 22nU
Jacobi [15](4B + 4n + 1)U2 + 2BU(4B + 4n − 2)U2 + 2(Bn)UU

### Computational complexity of detection methods.

Richardson (type1)(4B + 4n)U2 + 2BU(4B + 4n − 2)U2 + 2BU
Richardson (type2)(8BU + 2U)n(8BU + U)n

### Comparison of latency gate product for various parallel order.

Parallel order124816
Detection latency (cycles, n = 4)5282721448048
Estimated gate count (@415 Mhz)404K738K1.32M2.59M5.0M
Latency gate count product (M)213201191207.2240

### Hardware comparisons of massive MIMO detectors.

Architecture[10][17][8]Proposed work
FPGA/ASICFPGAFPGAASICASIC
ProcessN/AN/A45 nm65 nm
AlgorithmNeumann series (2-term)CGLSNeumann series (3-term)RC (type1)RC (type2)
(B, U)(128, 16)(128, 8)(128, 8)(128, 16)
Modulation (QAM)6464646464
Max operating frequency (Mhz)222.414121,000467415
Latency (Cycles)246951 (n = 3)N/A220 (n = 4)144 (n = 4)
Max throughput (Mbps)128.64203,800203.778276.66
Area (Gate count)N/AN/A12.6M2.21M1.32M
Power @ Max.Oper. Freq. (mW)/Scaled power†N/AN/A8,000892.2/406.1477.8/217.5
Normalized Power* (nJ/bit)N/AN/A2.1051.380.544

† Scaled to 45 nm process technology [21].

* Normalized considering supply voltage and adopted technology [22].

### Numerical results assuming Rayleigh flat fading channel condition.

ChannelRayleigh flat fading channel
MethodRichardson type1Richardson type2
Power @ operating Freq. (mW)237477.8
Energy per Effective bit (nJ/bit)5.511.1

### Numerical results assuming high-mobility channel condition.

ChannelRayleigh flat fading channel
Richardson TypePower @ operating Freq. (mW)Energy per Effective bit (nJ/bit)
Type2477.811.1
Type1 with a static channel2375.5
Type1 with 3 time update353.38.2
Type1 with 6 time update758.417.62