I. Introduction
MIMO (multiple input multipleoutput) technology [1] has been widely adopted in many wireless communication standards including 3GPP LTE, LTEAdvanced, and IEEE 802.11n [2]–[4]. As conventional MIMO systems are approaching their throughput limits, multiuser (MU) MIMO and massive (or largescale) MIMO have become the most promising candidates for greater amounts of data transmission over wireless networks. In massive MIMO systems, the base stations (BSs) are equipped with hundreds of antennas serving tens of users or mobile stations (MSs) [5]. It has been theoretically proven that a massive MIMO system can increase the spectrum efficiency and energy efficiency of a wireless channel by approximately two orders and three orders of magnitude, respectively [6], [7]. The simplest linear detectors typically demonstrate nearoptimal performance [8]. However, owing to the large number of antennas at a BS, signal detection is one of the most complex and essential components in the uplink of massive MIMO systems [9], [10].
The hardware implementation of a massive MIMO detector is of particular interest [5]. Owing to a high computational complexity, optimal massive MIMO detectors such as the maximumlikelihood (ML) detector are considered virtually infeasible [5]. The previous massive MIMO detection approaches were based primarily on the approximate matrix inversion [10] or iterative methods for solving a system of linear equations [11]–[17]. The hardware architectures have typically adopted Neumann seriesbased approximate inversion [5], [10], where the computational complexity is significantly reduced with reduced detection performance compared to stationary iterative approaches [11]–[16]. Although many detection methods and architectures of massive MIMO detectors have been proposed in the literature, a comprehensive analysis on the detection performance and hardware cost has never been reported.
In this paper, we present a lowcomplexity massive MIMO detector architecture based on the Richardson iterative method. With a comprehensive literature survey, we compared different iterative methods in terms of detection performance and computational complexity. Based on this analysis, the Richardson iterative method was selected as the most hardwarefriendly approach with a reasonable BER performance. First, we present an efficient hardware architecture based on the conventional Richardson iteration (type1). In the conventional architecture, it is found that the complex matrixbymatrix multiplications in the Richardson iterative method can be reformulated to relatively simpler matrix vector multiplications with considerably reduced complexity. Based on reformulations, a lowcomplexity Richardson methodbased massive MIMO detector (type2) is proposed. Both types of massive MIMO detectors are implemented and compared in terms of detection performance and hardware cost (power/energy consumption) under the following two wireless channel conditions: highmobility channel (timevariant and frequency selective channel) and flat fading channel.
II. Background
1. Uplink System Model
A massive (or largescale) MU MIMO uplink system with B antennas at BS that communicates with U (≪ B) singleantenna users can be modeled as follows:
(1) 
Here, y corresponds to a Bdimensional complexvalued symbol vector received at the BS, H is a complexvalued B × U (tall and skinny) uplink channel matrix, s is a Udimensional complexvalued transmitted symbol vector that contains the transmitted symbols for all U users, and n represents additive noise at the BS.
For signal detection in the uplink system, the transmitted symbol vector s can be estimated by minimum mean square error (MMSE) detection as presented in (2).
(2) 
where ŝ is the estimate of the transmitted symbol vector, σ^{2} is the additive white Gaussian noise power, I is a U × U identity matrix, W = H^{H}H + σ^{2}I denotes the MMSE filtering matrix, and ŷ = H^{H} y is the matchedfilter output.
2. Approximate Matrix Inversion and Iterative Signal Detection Methods
Because the exact inverse of the W matrix in (2) incurs a significant number of computations with O(U^{3}), an approximate matrix inversion using Neumann series terms [8] is proposed to reduce the computational cost. Using only the first two terms of the Neumann series, the computational complexity is reduced from O(U^{3}) to O(U^{2}) [10]. However, the BER performance of the Neumann series approximation with two terms is considerably less than that of the exact matrix inversion [11], [13], [14]. For lowcomplexity and nearoptimal signal detection of massive MIMO systems, stationary iterative methods have been proposed [11]–[17]. In stationary iterative methods, the Udimensional transmitted symbol vector is approximated to a solution of the ith iteration. The iterative solution of the Richardson [11], [12], successive overrelation (SOR) [13], [14], and Jacobibased [15] iterative methods are presented in the following:
(3) 
(4) 
(5) 
where s^{(i)} is the estimate of the transmitted symbol vector at the ith iteration, ω is a constant scale factor, D is the main diagonal of the MMSE filtering matrix W = H^{H}H + σ^{2}I , L is the lower triangular matrix of W, and E is the offdiagonal of W (E = W − D).
First, to identify a hardwarefriendly and highly accurate detection method, approximate matrix inversion and iterative detection methods were compared in terms of both BER performance and computational complexity; the BER results are presented in Fig. 1. To compare the BER performances of various approaches under the same condition, a B × U = 128 × 16 massive MIMO uplink system with Rayleigh flat fading channel model and 64QAM modulation [8], [10], [17] were division operation, the Richardson method was selected as the assumed. Compared to the Neumann series approximation using 2terms, the Richardson, SOR, and GaussSeidel iterative methods demonstrated considerably superior BER performances when the number of iterations n = 5. Table 1 also indicates the computational complexity comparisons among the different massive MIMO detection methods. In terms of multiplication and addition, the Neumann (2term), Richardson, SOR, GaussSeidel, and Jacobi methods demonstrated similar complexities of O(U^{2}). Among the operations (multiplication, additions and division) indicated in Table 1, division was the most expensive operation. Complex division requires an approximately 9times greater area [18] than complex multiplication. Because the remainder of the operations (multiplication and addition) are comparable and the Richardson method is the only one that does not require a lowcomplexity approach in this paper.
Table 1.
Method  Multiplication  Addition  Division 

Neumann [10] (3terms)  4U^{3} + (2B + 1)·U^{2} + (4B − 1)U  4U^{3} + 2BU^{2} + 4BU  U 
Neumann [10] (2terms)  (2B + 1)U^{2} + (4B − 1)U  2BU^{2} + 4BU  U 
Richardson [11]  (4B + 4n)U^{2} + 2BU  (4B + 4n − 2)U^{2} + 2BU  0 
SOR [13]  (4B + 4n − 2)U^{2} + 2(B − n + 1)U  (4B + 4n − 3)U^{2} + 2(B − 3n)U + 2n − 2  2nU 
GaussSeidel [14] (SOR, ω = 1)  (4B + 4n − 2)U^{2} + 2(B − 2n + 1)U  (4B + 4n − 4)U^{2} + 2(B − 4n)U + 2n − 2  2nU 
Jacobi [15]  (4B + 4n + 1)U^{2} + 2BU  (4B + 4n − 2)U^{2} + 2(B − n)U  U 
III. Architectures of Massive MIMO Detector Based on Conventional and Proposed Richardson Iterative Methods
1. Hardware Architecture Based on the Conventional Richardson Iterative Method
Figure 2 presents the overall architecture of the proposed massive MIMO detector based on (3) (type1). Here, a B × U = 128 × 16 massive MIMO uplink system employing 64QAM modulation is assumed. As indicated in the figure, the architecture is divided primarily into a preprocessing block and iteration block.
A. Preprocessing Block
The preprocessing block is composed of the GRAM and MF (Matched Filter) modules as presented in Fig. 3. Each row (G_ROW01, G_ROW02, … , G_ROW15, G_ROW16) of the systolic array is used to compute each row of the lower triangular part of the 16 × 16 G (= H^{H}H) matrix. First, using the two types of processing elements (DPE and ODPE) presented in Figs. 4(a) and (b), a diagonal element (G_{ii}) and a lower diagonal element (G_{ij} where i > j) of the GRAM matrix are computed, respectively. As illustrated in Fig. 3, the GRAM matrix computation is performed during 159 (T159) clock cycles from T0 (H_{1,1}). The matched filter computation (ŷ = H^{H} y ) is performed using the MF module in Fig. 2. Here, the ith processing element (MF01 to MF16) of the MF module generates the ith element of the ŷ vector. The nth element of the mth column vector in the H matrix (H_col_m) and the nth element of the y vectors are delivered to each processing element of the MF module and those two elements ( $\overline{{h}_{m,n}}$ and y_{n}) are multiplied. Then, each multiplied term ( $\overline{{h}_{m,n}}$ ·y_{n} ) is accumulated during B (= 128) clock cycles.
B. Iteration Block
The computation process of the ith iteration (s^{(i)}) is illustrated in Fig. 5. First, each GS module (GS01 to GS16) computes G·S^{(i−1)} (= H^{H}H·S^{(i−1)} ) in (3). Then, 16 elements of ω{ŷ − (G + σ^{2}I)S^{(i−1)}} are simultaneously computed in one clock cycle. A constant ω is set to (0.5)^{7} to (0.5)^{10}, which provides the best convergence in our BER performance simulations. Finally, Adder Array (AA) performs 16 additions (S^{(i−1)} + ω{ŷ − (H^{H}H + σ^{2}I)S^{(i−1)}}) in one clock cycle. Each iteration of the iteration block requires 19 clock cycles (= 17 (GS module) + 1 (GS) + 1 (AA)) to complete. Because the outputs of the preprocessing module (G matrix and ŷ vector) can be reused from the second iteration, the total latency of n iterations is (19n + 144) clock cycles.
2. Proposed Reformulation and Iteration Skipping Approach Based on the Richardson Method
As presented in the previous section, the GRAM matrix multiplication (H^{H}H) is the most computationally intensive operation in the Richardson method. The GRAM matrix multiplication requires approximately 90.5% of the multiplications and additions in (3) when n = 5, B = 128, and U = 16. To reduce the computational complexity of the matrixbymatrix multiplication, the ith iteration of the Richardson method in (3) can be reformulated as follows:
(6) 
Although (6) is equivalent to (3), the expensive matrixbymatrix multiplication (H^{H}H) in (3) is replaced with matrixvector multiplication (H^{H} (y − Hs^{(i−1)}) in (6) by modifying the order of operations. The computational complexities of (3) and (6) are compared in Table 2. For example, when n = 5, B = 128, and U = 16, the computational complexity of the proposed reformulation (type2) can be reduced to approximately 58.5% of (3) (type1).
Table 2.
Method  Multiplication  Addition 

Richardson (type1)  (4B + 4n)U^{2} + 2BU  (4B + 4n − 2)U^{2} + 2BU 
Richardson (type2)  (8BU + 2U)n  (8BU + U)n 
For a lowcomplexity hardware implementation of the Richardson iterative detector, another important issue is to reduce the number of iterations. In the conventional Richardson method [11], the initial solution s^{(0)} is set to a Udimensional zero vector. To force the BER curve to converge more quickly, a zonebased initial solution is proposed [12]. In the zonebased initial solution approach, the real and imaginary parts in the Udimensional initial solution s^{(0)} are determined by detecting the sign of $\text{Re}[{\tilde{s}}_{k}]$ and $\text{Im}[{\tilde{s}}_{k}]$ in (7).
(7) 
where z is a constant integer and W_{k,k} is the kth diagonal element of the MMSE filtering matrix W. The proposed iteration skippingapproach is derived from the initial solution that is set to the zero vector. If we assume that s^{(0)} = 0 in (6), the solution of the 1st iteration always becomes a constant scaled matched filter output vector (s^{(0)} = ωH^{H}y). Because the solution of the first iteration is fixed, we can directly assume that s^{(0)} = ωH^{H}y to skip the first iteration. In Fig. 6, the BER simulation results of the proposed initial iteration skipping approach and conventional approaches are presented. In this figure, raw BER refers to “the BER before any detection algorithm is applied to restore the transmitted vector from the received signal vector y.” Owing to the skip of the first iteration in Fig. 6, the BER result of the proposed approach with the (n − 1)th iteration is the same as the nth iteration of the zero initial solution. Compared to the zonebased initial solution with the nth iteration, the proposed method with the nth iteration indicates minor BER performance degradation. However, in terms of computational cost, the zonebased initial solution approach requires U multiplications and 3U additions in (7), whereas the proposed approach requires only U additions for the constant multiplication of ω (= (0.5)^{7} to (0.5)^{10}) and the matched filter output (H^{H}y).
3. Proposed Type2 Richardson Detector Architecture
Applying the proposed reformulation and the initial iteration skipping approach, a lowcomplexity massive MIMO detector architecture based on the Richardson method (type2) is presented in Fig. 7(a). Here, a B × U = 128 × 16 massive MIMO uplink system employing 64QAM modulation is considered. To determine the optimal parallel order of the hardware architecture, a tradeoff between area and latency [19] is considered in Table 3. Because the parallel order of four has the smallest latency gate count product (M), the parallel order of four is employed in the proposed architecture. The timing diagram of the proposed architecture is presented in Fig. 7(b). (y − Hs^{(1−1)}) is first computed using HS and SUB (Subtractor) modules, then H^{H} (y − Hs^{(1−1)}) is computed in MAC (Multiplyaccumulate). Finally, the remaining s^{(1−1)} + ω{H^{H} (y − Hs^{(1−1)}) − σ^{2} · s^{(1−1)}} computation is performed in subtractor array (SA) and adder array (AA). Because the latency of each iteration is 36 clock cycles as presented in Fig. 7(b), n iterations require 36n clock cycles.
Table 3.
Parallel order  1  2  4  8  16 

Detection latency (cycles, n = 4)  528  272  144  80  48 
Estimated gate count (@415 Mhz)  404K  738K  1.32M  2.59M  5.0M 
Latency gate count product (M)  213  201  191  207.2  240 
A. Computations in HS and SUB modules
To compute (y − Hs^{(1−1)}), the HS and following SUB modules are used. As illustrated in Fig. 8(a), each HS module (HS1 to HS4) performs a vector multiplication using a row of the H matrix and s^{(1−1)} vector. In Fig. 8(a), the black multiplier and adder are a complex multiplier (four parallel real multipliers and two adders) and a complex adder (the number of inputs − 1) × 2 real adders for the real and imaginary parts computation). Because the H matrix has 128 rows (= 4 × 32), four parallel vector multiplications of the HS modules are performed during 32 clock cycles, which is presented in Fig. 7(b). During the 32 clock cycles, each input (H_row_n + 1) of HSn + 1 (where n = 0 to 3) module is set to the (32n + 1)th to (32n + 32)th row of the H matrix. Using four parallel rows of the Hs^{(1−1)} output and four elements of the y (y_1 to y_4) vector, four elements of the (y − Hs^{(1−1)}) vector are simultaneously computed in the SUB modules during the 32 clock cycles.
B. MultiplyAccumulate Module
Using the output of the SUB modules, which is (y − Hs^{(1−1)}), the MAC module performs the matrixvector multiplication H^{H} (y − Hs^{(1−1)}). The hardware architecture of the multiply unit (MU) and accumulator (ACC) inside the MAC are illustrated in Figs. 8(a) and 8(b), respectively. The timing diagram of the proposed type2 Richardson detector is also illustrated in Fig. 7(b). For each clock cycle (T01 to T32), the four parallel MU (MU1 to MU4) outputs are the 64 (= outputs of 4 MUs × 16 rows) product term numbers (h_{m,n}·Y_{m}) presented in Fig. 7(c). Here, each product term (h_{m,n}·Y_{m}) is the multiplication of the (m, n) element of the H matrix (h_{m,n}) and the mth element of the (y − Hs^{(1−1)}) vector. To compute H^{H} (y − Hs^{(1−1)}) , four product terms of each row (in Fig. 7(c)) must be accumulated using the ACC in Fig. 8(b). As presented in Fig. 7(c), the sum of the four product terms is repeatedly accumulated during the 32 clock cycles (T02 to T33). Specifically, four partial summation terms (h_{k,m}·Y_{k}, h_{32+k,m}·Y_{32+k}, h_{64+k,m}∙Y_{64+k}, h_{96+k,m}∙Y_{96+k}) and the accumulated summation term ACM(k − 1) in Fig. 8(b) are added at the kth clock cycle (k = 1 to 32).
C. Remaining Computations in the ith Iteration
Using the H^{H} (y − Hs^{(1−1)}) output of the MAC module, the SA modules, ωscaling modules, and AA modules perform the remaining computations of the ith iteration. First, SAs compute the subtraction of two 16dimensional vectors, which is implemented using 32 (= 16 rows × 2 (real and imaginary parts)) real multipliers (σ^{2} · s^{(1−1)}) and 32 subtractors (H^{H} (y − Hs^{(1−1)}) − σ^{2} · s^{(1−1)}). To remove the multiplication σ^{2} · s^{(1−1)} and following subtraction H^{H} (y − Hs^{(1−1)}) − σ^{2} · s^{(1−1)} from the critical path, σ^{2} · s^{(1−1)} is computed before the outputs of the MAC module come out. Then, to compute ω{H^{H} (y − Hs^{(1−1)}) − σ^{2} · s^{(1−1)}} , a ωscaling module is used. Because the relaxation parameter ω value is set to (0.5)^{7} to (0.5)^{10}, each ωscaling module can be implemented using only two adders (for real and imaginary parts) as presented in Fig. 7(c). Finally, AA performs the addition of the two 16dimensional vectors (s^{(1−1)} + ω{H^{H} (y − Hs^{(1−1)}) − σ^{2} · s^{(1−1)}}).
IV. Numerical Results
1. Hardware Comparisons of Massive MIMO Detectors
Two types of Richardson massive MIMO detectors (type1 and type2) were implemented using the 65nm CMOS standard library. In Table 4, the proposed Richardson detectors are compared with conventional detectors [8], [10], [17]. Those detectors are based on the same modulation order (64QAM) and the same or similar dimensions ((B, U) = (128, 16), (128, 8)). The conventional works [10], [17] are FPGAbased designs and the power result has not been reported. An ASICbased softoutput Neumann seriesbased detector architecture (3term) is proposed in [8], which demonstrates the highest throughput with the greatest hardware overheads in Table 4. To evaluate the tradeoff between power consumption and throughput, normalized power [22] is introduced. When we compare the proposed architectures with the Neumann seriesbased detector (3term) [8], the proposed type1 and type2 architectures demonstrate a 34% and 75% normalized power savings, respectively. Compared to the Neumann seriesbased detector (2term) [10] with the same (B, U) = (128, 16), the proposed works with four iterations (n = 4) demonstrate lower latency and higher throughput. Though the conjugate gradient least square (CGLS)based softoutput detector assumes a lower number of users U = 8, the proposed works outperform the CGLSbased architecture in terms of both latency and throughput. When we compare the two proposed architectures assuming that the H matrix is updated for every signal detection, the type2 architecture demonstrates superior latency/throughput and area consumption compared to the type1 architecture. However, in our observation, the better architectural choice between the two types of MIMO detectors can be changed depending on the channel conditions, which will be discussed in the following two subsections.
Table 4.
Architecture  [10]  [17]  [8]  Proposed work  

FPGA/ASIC  FPGA  FPGA  ASIC  ASIC  
Process  N/A  N/A  45 nm  65 nm  
Algorithm  Neumann series (2term)  CGLS  Neumann series (3term)  RC (type1)  RC (type2) 
(B, U)  (128, 16)  (128, 8)  (128, 8)  (128, 16)  
Modulation (QAM)  64  64  64  64  64 
Max operating frequency (Mhz)  222.41  412  1,000  467  415 
Latency (Cycles)  246  951 (n = 3)  N/A  220 (n = 4)  144 (n = 4) 
Max throughput (Mbps)  128.64  20  3,800  203.778  276.66 
Area (Gate count)  N/A  N/A  12.6M  2.21M  1.32M 
Power @ Max.Oper. Freq. (mW)/Scaled power†  N/A  N/A  8,000  892.2/406.1  477.8/217.5 
Normalized Power* (nJ/bit)  N/A  N/A  2.105  1.38  0.544 
2. Hardware Comparisons in the Case of Rayleigh Flat Fading Channel
Because the Rayleigh flat fading channel model has been widely used in massive MIMO research [23], the two proposed detectors were compared under a Rayleigh flat fading channel. In the Rayleigh flat fading channel, the estimated channel matrix H is generally assumed to be invariant during an Orthogonal frequencydivision multiplexing (OFDM) symbol duration [23]. In Fig. 9(a), the computation process of the proposed architecture is presented during an OFDM symbol duration. Here, T_{OFDM} denotes the length of the OFDM symbol duration, T_{H} is the required time to estimate the H matrix, and T_{D} is the process time to detect all data symbols inside the OFDM symbol. Because an estimated H matrix (during H estimation phase T_{H}) can be used to detect all the data symbols (D_{1}, D_{2}, … , D_{n−1}, D_{n}), the GRAM matrix G must be computed only once during T_{OFDM}. If we use the proposed type1 architecture in the Rayleigh flat fading channel, the outputs of the GRAM module (in Fig. 2) for a fixed H matrix can be reused during T_{D}. As presented in Fig. 9(b), the control logic of the GRAM module, which updates the channel information, is used to reduce the dynamic power in the proposed type1 detector. When the estimated H matrix is considered invariant in the channel estimator [23], the logical “1” GRAM_HOLD bit goes to the GRAM module of the type1 architecture. After the G matrix is computed, the GRAM_HOLD bit “1” directs the GRAM module to hold and reuse the GRAM matrix results during T_{D}. In the case of the type2 detector, s^{(i−1)} is updated every iteration, which means the type2 detector must perform all the computations in (6) regardless of the H matrix.
To determine what type of the Richardson methodbased detector is superior in terms of energy/power consumption, the simulation environments were set as follows:

1) The channel estimation was performed using MATLAB simulator. Each element of the H matrix was quantized to 15 bits.

2) In Fig. 9(a), T_{OFDM} was assumed to be 32 μs. The number of data subcarriers was 28 and the period of each subcarrier was 1 μs.

3) The simulation was performed for 40 consecutive OFDM symbol durations (for 1,120 received y vectors).
The power consumption results in Table 5 were obtained with the gatelevel netlist simulations using PrimetimePX [24]. In terms of energy/power consumption, the proposed type1 architecture indicates 61% lower energy consumption and 50% lower power consumption compared to the proposed type2 architecture, which is primarily due to the reuse of the GRAM matrix multiplication (H^{H}H) results in the type1 architecture.
3. Hardware Comparisons in the Case of HighMobility Channel
When users are highly mobile, the wireless channel becomes time variant and frequency selective even within one OFDM symbol because of the Doppler spread effect [23]. To realize reliable signal detection, channel estimation and interpolation must be considered within an OFDM symbol as presented in Fig. 10. Because the H matrix is time varying under a highmobility channel, the H matrix must be periodically updated by an H interpolation process [25].
Assuming highmobility channel conditions, BER performance and the hardware cost of the MIMO detectors can be considered. Fig. 11 displays the BER simulation results of the proposed type1 and type2 detectors under a hightransmission channel, where the center frequency f_{0} is 2.4 Ghz and the velocity of user v = 500 km/h [26]. Except for the Doppler spread effect, the BER simulation environment is basically similar to the one presented in Section IV.2. In Fig. 11, as the computation process of the type2 architecture remains the same regardless of the H matrix update, its detection performance continues to be reliable. However, because the GRAM module of the type1 architecture updates the G matrix only when the H matrix is updated, the BER performance of the type1 architecture is improved with an increasing number of the H matrix updates. In Fig. 11, type1 indicates a comparable BER with type2 for six H matrix updates during one OFDM symbol duration. As presented in Fig. 11, to maintain the BER performance under higher center frequency (for example, 5 Ghz, 10 Ghz) or higher velocity, the channel matrix H must be updated more frequently. In this situation, the type2 architecture is expected to demonstrate considerably improved BER performance and energy efficiency compared to the type1 architecture. Table 6 presents the power/energy consumptions of the proposed architectures under a highmobility channel. Under the same BER performance (type2 and type1 with six time updates), type2 indicates a 37% power savings compared to type1. Based on the numerical results presented in this work, we can conclude that the type1 massive MIMO detector demonstrates improved energy efficiency under the Rayleigh flat fading channel; however, the proposed type 2 detector displays superior energy efficiency under the highmobility channel condition.
V. Conclusion
This paper proposed a lowcomplexity massive MIMO detector based on the Richardson iterative method. First, we presented a massive MIMO detector architecture (type1) based on the conventional Richardson iteration, which demonstrated considerably reduced detection latency compared to the conventional Neumann seriesbased architecture. Then, an efficient reformulation of the Richardson iteration was proposed to reduce the computational complexity (type2) and its hardware architecture was presented. To compare both architectures in terms of energy/power consumption, two extreme channel conditions were considered. In the Rayleigh flat fading channel condition, the type1 detector achieved a 50% power reduction compared to the type2 detector. Under a highmobility channel condition, the type2 architecture indicated a 37% power saving compared to the type1 architecture.