• web of science banner
  • web of science banner
  • scopus banner
  • Engineering Information banner
  • Inspec Direct banner
  • Dialog banner
  • EBSCO banner
Subscription button Subscription button
ETRI J award winner banner
Article  <  Archive  <  Home
Exploiting Thread-Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline
Jaegeun Oh, Seok Joong Hwang, Huong Giang Nguyen, Areum Kim, Seon Wook Kim, Chulwoo Kim, and Jong-Kook Kim
vol. 30, no. 4, Aug. 2008, pp. 576-586.
Keywords : ILP, TLP, SMT, CMP, MLEP.
  • Abstract
    • Abstract.

      In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler-hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write-back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single-instruction multiple-data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32-bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2-way MLEP and 33.7% faster with a 4-way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler.
  • Authors
    • Authors

      Jaegeun Oh
      Korea University
      Seok Joong Hwang
      Korea University
      Huong Giang Nguyen
      Korea University
      Areum Kim
      Korea University
      Seon Wook Kim
      Korea University
      Chulwoo Kim
      Korea University
      Jong-Kook Kim
      Korea University
  • References
    • References

      [1] H.C. Hunter and J.H. Moreno, "A New Look at Exploiting Data Parallelism in Embedded Systems," CASE, 2003, pp. 159-169.
      [2] I. Karkowski and H. Corporaal, "Exploiting Fine- and Coarse-Grain Parallelism in Embedded Programs," PACT, 1998, pp. 60-67.
      [3] J.E. Smith and G.S. Sohi, "The Microarchitecture of Superscalar Processors," Proc. of the IEEE, vol. 83, Dec. 1995, pp.1609-1624.
      [4] D.M. Tullsen et al., "Simultaneous Multithreading: Maximizing On-Chip Parallelism," ISCA-22, June 1995.
      [5] Analog Devices, Inc. ADSP-BF561 Blackfin Embedded Symmetric Multiprocessor Rev. 0.
      [6] ARM. ARM11 MPCore. http://www.arm.com/.
      [7] EEMBC (EDN Embedded Microprocessor Benchmark Consortium). http://www.eembc.org.
      [8] J. Oh et al., "OpenMP and Compilation Issue in Embedded Applications," LNCS, vol. 2716, June 2003, pp. 109-121.
      [9] Extendable Instruction Set Computer. http://www.adc.co.kr.
      [10] A. Eichenberger et al., "A Tutorial on BG/L Dual FPU Simdization," BlueGen System Software Workshop, 2005.
      [11] C. Kozyrakis and D. Patterson, "Vector vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks," MICRO-35, 2002, pp. 283-293.
      [12] D. Talla et al., "Evaluating Signal Processing and Multimedia Applications on SIMD, VLIW, and Superscalar Architectures," ICCD, 2000, pp. 163-172.
      [13] OpenMP Forum, http://www.openmp.org/. OpenMP: A Proposed Industry Standard API for Shared Memory Programming, Oct. 1997.
      [14] M. Sato et al., "Design of OpenMP Compiler for an SMP Cluster," EWOMP, Sept. 1999, pp. 32-39.
      [15] H.G. Nguyen, S.J. Hwang, and S.W. Kim, "Compiler Construction for Lockstep Execution of Multithreaded Processors," CIT, 2007, pp. 829-834.
      [16] J.L. Lo et al., "Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading," ACM Trans. Computer Systems, vol. 15, no. 3, 1997, pp. 322-354.
      [17] J. Collins and D. Tullsen, "Clustered Multithreaded Architectures: Pursuing both IPC and Cycle Time," IPDPS, 2004, pp. 766-775.
      [18] H. Zhong, S.A. Lieberman, and S.A. Mahlke, "Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications, HPCA, Feb. 2007, pp. 25-36.
      [19] J.R. Nickols, "The Design of the MasPar MP-1: A Cost Effective Massively Parallel Computer," IEEE COMPCON, Spring 1990, pp. 25-28.
      [20] W.W.L. Fung et al., "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," MICRO, Dec. 2007, pp. 407-420.
      [21] T.R. Halfhill, "Parallel Processing With CUDA," Microprocessor Report, Jan. 2008.
      [22] GeForce Family, http://www.nvidia.com/page/geforce8.html.
  • Cited by
    • Cited by

  • Metrics
    • Metrics

      Article Usage