CSE591N: Reconfigurable Systems Group

CSE 503 - Mondays 1:30-2:30

This quarter we will be reading papers from four recent FPGA conferences and workshops (FPGA 2010, FPT 2010, FPL 2010, CARL 2010).

Schedule and Assignments

Date	Leader	Paper
January 3	Carl & Scott	Intro and Organization
January 10	Mike	Efficient FPGAs using Nanoelectromechanical Relays Chen Chen, Soogine Chong, Roozbeh Parsa, Nishant Patil, Kerem Akarvardar, J Provine, David Lewis, Jeff Watt, Roger T. Howe, H.-S. Philip Wong, Subhasish Mitra local copy
January 17	MLK Jr. Day
January 24	Robin	A Comprehensive Approach to Modeling, Characterizing and Optimizing for Metastability in FPGAs Doris Chen, Deshanand Singh, Jeffrey Chromczak, David Lewis, Ryan Fung, David Neto, Vaughn Betz. local copy
January 31	Stephen	High-Throughput Bayesian Computing Machine with Reconfigurable Hardware Mingjie Lin, Ilia Lebedev, John Wawrzynek. local copy
February 7	Corey	Rethinking FPGA Computing with a Many-Core Approach John Wawrzynek, Mingjie Lin, Ilia Lebedev, Shaoyi Cheng, Daniel Burke. local copy
February 14	Andrew	LEAP: A Virtual Platform Architecture for FPGAs Angshuman Parashar, Michael Adler, Kermin E. Fleming, Michael Pellauer and Joel S. Emer. local copy
February 21	President's Day
February 28	Maria	CoRAM: An In-Fabric Memory Abstraction for FPGA-Based Computing Eric S. Chung, James C. Hoe, and Ken Mai. local copy
March 7	Nate	Auto-Pipe: Streaming Applications on Architecturally Diverse Systems Chamberlain, R.D.; Franklin, M.A.; Tyson, E.J.; Buckley, J.H.; Buhler, J.; Galloway, G.; Gayen, S.; Hall, M.; Shands, E.F.B.; Singla, N. local copy

1. Mingjie Lin, Ilia Lebedev, John Wawrzynek (Berkeley). High-Throughput Bayesian Computing Machine with Reconfigurable Hardware, FPGA’10, 2010.

Abstract: We use reconfigurable hardware to construct a high throughput Bayesian computing machine (BCM) capable of evaluating probabilistic networks with arbitrary DAG (directed acyclic graph) topology. Our BCMachieves high throughput by exploiting the FPGA’s distributed memories and abundant hardware structures (such as long carry-chains and registers), which enables us to 1) develop an innovative memory allocation scheme based on a maximal matching algorithm that completely avoids memory stalls, 2) optimize and deeply pipeline the logic design of each processing node, and 3) schedule them optimally. The BCM architecture not only can be applied to many important algorithms in artificial intelligence, signal processing, and digital communications, but also has high reusability, i.e., a new application needs not change a BCM’s hardware design, only new task graph processing and code compilation are necessary. Moreover, the throughput of a BCM scales almost linearly with the size of the FPGA on which it is implemented. A Bayesian computing machine with 16 processing nodes was implemented with a Virtex-5 FPGA (XCV5LX155T-2) on a BEE3 (Berkeley Emulation Engine) platform. For a wide variety of sample Bayesian problems, comparing running the same network evaluation algorithm on a 2.4 GHz Core 2 Duo Intel processor and a GeForce 9400m using the CUDA software package, the BCM demonstrates 80x and 15x speedups respectively, with a peak throughput of 20.4 GFLOPS (Giga Floating-Point Operations per Second).

2. Yi-Hua Edward Yang and Viktor K. Prasanna (USC). High Throughput and Large Capacity Pipelined Dynamic Search Tree on FPGA, FPGA’10, 2010.

Abstract: We propose a pipelined Dynamic Search Tree (pDST) on FPGA which offers high throughput for lookup, insert and delete operations as well as the capability to perform inplace incremental updates. Based on the pipelined 2-3 tree data structure, our pDST supports one lookup per clock cycle and maintains tree balance under continual insert and delete operations. A novel buffered update scheme together with a bi-directional linear pipeline allows the pDST to perform one insert or delete operation per O (logN) cycles (N being the tree capacity) without stalling the lookup operations. Nodes at each pipeline stage are allocated and freed by a free-node chaining mechanism which greatly simplifies the memory management circuit. Our prototype implementation of a 15-level, 32-bit key dual-port pDST requires 192 blocks of 36 Kb BRAMs (64%) and 12.8k LUTs (6.3%) on a Virtex 5 LX330 FPGA. The circuit has a maximum capacity of 96k 32-bit keys and clock rate of 135 MHz, supporting 242 million lookups and concurrently 3.97 million inserts or deletes per second.

*3. Yi Shan, Bo Wang, Jing Yan, Yu Wang, Ningyi Xu, Huazhong Yang (Microsoft Research Asia). FPMR : MapReduce Framework on FPGA - a Case Study of RankBoost Acceleration, FPGA’10, 2010.

Abstract: Machine learning and data mining are gaining increasing attentions of the computing society. FPGA provides a highly parallel, low power, and flexible hardware platform for this domain, while the difficulty of programming FPGA greatly limits its prevalence. MapReduce is a parallel programming framework that could easily utilize inherent parallelism in algorithms. In this paper, we describe FPMR, a MapReduce framework on FPGA, which provides programming abstraction, hardware architecture, and basic building blocks to developers. An on-chip processor scheduler is implemented to maximize the utilization of computation resources and achieve better load balancing. An efficient data access scheme is carefully designed to maximize data reuse and throughput. Meanwhile, the FPMR framework hides the task control, synchronization, and communication away from designers so that more attention can be paid to the application itself. A case study of RankBoost acceleration based on FPMR demonstrates that FPMR efficiently helps with the development productivity; and the speedup is 31.8x versus CPU-based implementation. This performance is comparable to a fully manually designed version, which achieves 33.5x speedup. Two other applications: SVM, PageRank are also discussed to show the generalization of the framework.

4. Doris Chen, Deshanand Singh, Jeffrey Chromczak, David Lewis, Ryan Fung, David Neto, Vaughn Betz (Altera). A Comprehensive Approach to Modeling, Characterizing and Optimizing for Metastability in FPGAs, FPGA’10, 2010.

Abstract: Metastability is a phenomenon that can cause system failures in digital circuits. It may occur whenever signals are being transmitted across asynchronous or unrelated clock domains. The impact of metastability is increasing as process geometries shrink and supply voltages drop faster than transistor Vts. FPGA technologies are significantly detected since leading edge FPGAs are amongst the first devices to adopt the most recent process nodes. In this paper, we present a comprehensive suite of techniques for modeling, characterizing and optimizing metastability effects in FPGAs. We first discuss a theoretical model of metastability, and verify the predictions using both circuit level simulations and board measurements. Next we show how designers have traditionally dealt with metastability problems and contrast that with the automatic CAD algorithms described in this paper that both analyze and optimize metastabilityrelated issues. Through our detailed experimental results, we show that we can improve the metastability characteristics of a large suite of industrial benchmarks by an average of 268,000 times with our optimization techniques.

5. Chen Chen, Soogine Chong, Roozbeh Parsa, Nishant Patil, Kerem Akarvardar, J Provine, David Lewis, Jeff Watt, Roger T. Howe, H.-S. Philip Wong, Subhasish Mitra (Stanford/Altera). Efficient FPGAs using Nanoelectromechanical Relays, FPGA’10, 2010.

Abstract: Nanoelectromechanical (NEM) relays are promising candidates for programmable routing in Field-Programmable-Gate Arrays (FPGAs). This is due to their zero leakage and potentially low on-resistance. Moreover, NEM relays can be fabricated using a low-temperature process and, hence, may be monolithically integrated on top of CMOS circuits. Hysteresis characteristics of NEM relays can be utilized for designing programmable routing switches in FPGAs without requiring corresponding routing SRAM cells. Our simulation results demonstrate that the use of NEM relays for programmable routing in FPGAs can simultaneously provide 43.6% footprint area reduction, 37% leakage power reduction, and up to 28% critical path delay reduction compared to traditional SRAM-based CMOS FPGAs at the 22nm technology node.

6. H. Manteuffel, C. Bassoy, F. Mayer-Lindenberg. The TransC Process Model and Interprocess Communication, FPT'10, 2010.

Abstract:

7. Petr Matas, Eva Dokladalova, Mohamed Akil, Vjaceslav Georgiev, and Martin Poupa. Parallel Hardware Implementation of Connected Component Tree Computation, FPL'10, 2010.

Abstract: The paper proposes a new parallel hardware architecture for a connected component tree computation. It is an original implementation of the recently published parallel algorithm based on building of a 1D tree for each individual image line and their progressive merging. The image is divided into independent partitions which are processed concurrently. Nevertheless, merging of these partitions requires access to all partitions. A special interconnection switch is proposed to solve this problem. The implementation results obtained on an FPGA are also presented. The obtained performance on Virtex 5 FPGA is 145 Mpx/s using 11 928 slice LUTs, 5752 registers and 8064 Kib of block RAM.

8. Heiner Giefers and Marco Platzner. A Triple Hybrid Interconnect for Many-Cores: Reconfigurable Mesh, NoC and Barrier, FPL'10, 2010.

Abstract: Networks-on-chip (NoC) are very efficient for point-to-point communication but are also known to provide poor broadcast and multicast performance. In this paper, we propose a triple hybrid interconnect for many-cores, consisting of a reconfigurable mesh network and a wormhole routed NoC for data communication, and a barrier network for synchronization. On an FPGA many-core prototype comprising up to 30 Microblaze soft cores we show that the reconfigurable mesh network excels for multicast and broadcast operations, while the NoC performs better for larger messages and more dynamic workloads. Experiments with a parallel Jacobi algorithm demonstrate that the combined use of all three networks delivers the highest performance.

*9. Hagen Gädke-Lütjens, Benjamin Thielmann, and Andreas Koch. A Flexible Compute and Memory Infrastructure for High- Level Language to Hardware Compilation, FPL'10, 2010.

Abstract: We present a low-level infrastructure for use by highlevel language to hardware compiler back-ends. It consists of the highly parameterizable, technology-independent module library Modlib and the LMEM framework for localizing variables in fast on-chip memories. Modlib not only supports all high-level language operators (including memory accesses), but also provides a wide spectrum of usage modes: covering static and dynamic scheduling, speculative predicated execution, pipeline balancing, and explicit canceling of mis-speculated computations. We examine the performance of the infrastructure for a number of automatically compiled kernels, including an MD5 kernel that significantly profits from using LMEM.

10. Stefan Kirsch, Felix Rettig, Dirk Hutter, Jan de Cuveland, Venelin Angelov, and Volker Lindenstruth. An FPGA-based High-Speed, Low-Latency Processing System for High-Energy Physics, FPL'10, 2010.

Abstract: The Global Tracking Unit of the ALICE Transition Radiation Detector is a high-speed, low-latency trigger processor installed at the ALICE experiment at the Large Hadron Collider. Based on the analysis of up to 20,000 parametrized particle track segments per event, a trigger decision is formed within approx. 2 µs. Furthermore, the system is designed to significantly improve the overall detector performance by providing a complex and robust multi-event buffering scheme. Data from the detector arrives at an aggregate net bandwidth of 2.16 Tbit/s via 1080 optical links and is processed massively in parallel by 109 FPGA-based units organized in a 3-stage hierarchical structure. The embedded PowerPC cores are employed not only to build a monitoring and control system that can be interfaced by the experiment control. They are also used to realize a real-time hardware/software co-design, able to characterize the trigger performance, supervise the operation and intervene in cases of system errors.

11. Eric S. Chung, James C. Hoe, and Ken Mai (CMU). CoRAM: An In-Fabric Memory Abstraction for FPGA-Based Computing, CARL'10, 2010.

Abstract: FPGAs have been used in many applications to achieve orders-of-magnitude improvement in absolute performance and energy efficiency relative to conventional microprocessors. Despite their newfound potency in both processing performance and energy efficiency, FPGAs have not gained widespread acceptance as mainstream computing devices. A fundamental obstacle to FPGA-based computing can be traced to the FPGA’s lack of a common, scalable memory abstraction. When developing for FPGAs today, application writers are often directly responsible for crafting the application-specific infrastructure logic that transports data to and from the processing kernels. This infrastructure not only increases design time and effort but will often inflexibly lock a design to a particular FPGA product line, hindering scalability and portability. We propose a new FPGA memory abstraction called Connected RAM (CoRAM) to serve as a portable bridge between the distributed computation kernels and the edge memory interfaces. In addition to improving performance and efficiency, the CoRAM architecture provides a virtualized memory environment as seen by the hardware kernels to simplify application development and to improve an application’s portability and scalability. This CARL workshop research overview summarizes our published work in FPGA’11 [4].

12. Maysam Lavasani, Larry Dennison, and Derek Chiou (Texas). A Methodology for Leveraging Reconfigurability in Domain Specific Languages, CARL'10, 2010.

Abstract: Special-purpose hardware can dramatically accelerate an application. However, designing special-purpose hardware is often prohibitively expensive in terms of manpower and time. This paper describes a methodology that uses reconfigurability to enable the efficient compilation of a class of domain specific languages. We present the methodology, a prototype compiler, and a 40Gb/sec network processor designed to be implemented on an FPGA using that compiler.

13. Angshuman Parashar, Michael Adler, Kermin E. Fleming, Michael Pellauer and Joel S. Emer (MIT/Intel). LEAP: A Virtual Platform Architecture for FPGAs, CARL'10, 2010.

Abstract: FPGAs are known to be very effective at accelerating certain classes of algorithms. A variety of FPGA platforms are available today, but because of the absence of a standardized platform architecture, each platform comes in the form of a board with a diverse set of devices and communication endpoints. Therefore, FPGA programmers typically have to spend significant effort in building interfaces to devices and adapting their applications to work with the semantics of these devices. Further, an FPGA board by itself is in many cases incapable running full real-world applications – software support is required. Working out communication protocols between the FPGA and software is another unnecessary time sink for programmers who would rather focus on the high-level functionalities of their applications. Finally, there is little support for building and allocating flexible memory hierarchies on FPGA platforms. All of these problems are further exacerbated by the fact that switching FPGA platforms usually requires the programmer to re-do a significant portion of this work. These are all non-issues for software programmers who live in a world of block and character devices, hardware-managed memory hierarchies with rich memory management libraries, and a plethora of portable communication protocols. We attempt to bridge this gap between platform support for software and FPGA application development by proposing LEAP Virtual Platforms. LEAP (Logic-based Environment for Application Programming) provides an FPGA application with a consistent set of useful services and device abstractions, a memory hierarchy and a flexible communication protocol across a range of FPGA physical platforms. Tying these functionalities together is a modular development and build infrastructure. In this paper we describe the services provided by LEAP and explain how they are implemented using a multi-layered stack of abstractions.

14. John Wawrzynek, Mingjie Lin, Ilia Lebedev, Shaoyi Cheng, Daniel Burke (Berkeley). Rethinking FPGA Computing with a Many-Core Approach, CARL'10, 2010.

Abstract: While ASIC design and manufacturing costs are soaring with each new technology node, the computing power and logic capacity of modern FPGAs steadily advances. Therefore, high-performance computing with FPGA-based system becomes increasingly attractive and viable. Unfortunately, truly unleashing the computing potential of FPGAs often stipulates cumbersome HDL programming and laborious manual optimization. To circumvent such challenges, we propose a Many-core Approach to Reconfigurable Computing (MARC) that (i) allows programmers to easily express parallelism through a high-level programming language, (ii) supports coarse-grain multithreading and dataflowstyle fine-grain threading while permitting bit-level resource control, and (iii) greatly reduces the effort required to repurpose the hardware system for different algorithms or different applications. Leveraging a many-core architectural template, sophisticated logic synthesizing techniques, and state-of-art compiler optimization technology, a MARC system enables efficient highperformance computing for applications expressed with imperative programming languages such as C/C++ by exploiting abundant special FPGA resources such as distributed block memories and DSP blocks to implement complete single-chip high efficiency many-core microarchitectures. To quantitatively validate the proposed MARC system, we implemented a MARC prototype machine consisting of one control processing core and 32 arithmetic processing cores using a Virtex-5 (XCV5LX155T-2) FPGA. For a well-known general-purpose Bayesian computing problem, we compare the throughput and runtime of this MARC machine, with fully synthesized application-specific processing cores, against a manually optimized FPGA implementation—BCM (Bayesian Computing Machine) [1]. As the problem sizes range from 10^3 to 10^6, this MARC machine achieve 8.13 GFLOPS in throughput on average, which is 43% of that of BCM but with much less design/implementation effort and much greater portability and retargetability. More importantly, we developed a simple analytical performance model to explain the performance discrepancy between the MARC machine and the hand-optimized BCM FPGA implementation [1].

15. Lin, M.; Wawrzynek, J. Improving FPGA Placement With Dynamically Adaptive Stochastic Tunneling.

Abstract: This paper develops a dynamically adaptive stochastic tunneling (DAST) algorithm to avoid the "freezing" problem commonly found when using simulated annealing for circuit placement on field-programmable gate arrays (FPGAs). The main objective is to reduce the placement runtime and improve the quality of final placement. We achieve this by allowing the DAST placer to tunnel energetically inaccessible regions of the potential solution space, adjusting the stochastic tunneling schedule adaptively by performing detrended fluctuation analysis, and selecting move types dynamically by a multi-modal scheme based on Gibbs sampling. A prototype annealing-based placer, called DAST, was developed as part of this paper. It targets the same computer-aided design flow as the standard versatile placement and routing (VPR) but replaces its original annealer with the DAST algorithm. Our experimental results using the benchmark suite and FPGA architecture file which comes with the Toronto VPR5 software package have shown a 18.3% reduction in runtime and a 7.2% improvement in critical-path delay over that of conventional VPR.