Student projects at LAP

Accelerate FPGA Implementation Time

Supervisor: Andrea Guerrieri ([email protected])

FPGAs are powerful devices from embedded applications to datacentres. However, developing with FPGAs requires hardware design skills and a long implementation time. High-level synthesis covers the gap in creating hardware design from C/C++ code. However, the time required to generate a bitstream configuration to download to the FPGA is still very long. This project aims to accelerate the FPGA implementation design from several minutes to a few seconds.

 

Post-quantum cryptography core using high-level synthesis

Supervisor: Andrea Guerrieri ([email protected])

In 2016, the American National Institute of Standards and Technology (NIST) started a contest to qualify a set of post-quantum cryptography standards. High-level synthesis produces HDL code for FPGAs out of C/C++ in an automatic way, bridging the gap from algorithm to hardware design. However, the quality of results could be suboptimal compared to RTL-based (Register Transfer Level) designs. This project aims to explore and improve the QoR of the post-quantum cryptography hardware using HLS. 

 

Reconfigurable SoCs with Quantum Optics for Astronomical and Aerospace Applications.

Supervisor: Andrea Guerrieri ([email protected])

Reconfigurable System-on-Chips (SoCs) are incredibly versatile and flexible electronic devices, offering the capability to fulfill real-time constraints across various domains, such as astronomy, and aerospace among many others. This project focuses on harnessing the potential of reconfigurable SoCs to establish interfaces and control quantum sensors. To achieve this goal, advanced design techniques, including High-level Synthesis (HLS) and the Chisel Hardware Description Language (HDL), will be applied, ensuring the most cutting-edge and efficient methods are employed in the development of this project.

 

Beyond Instruction-Level Parallelism in Dataflow Circuits

Supervisor: Ayatallah Elakhras ([email protected])

Pipelining is an important technique for exploiting parallelism by overlapping operations in spatial computing. A key characteristic of pipelining is the initiation interval (II) which is the number of cycles that should pass before accepting a new set of inputs. It is the inverse of the throughput of the pipeline. A perfect pipeline has an II of 1 cycle. Such a pipeline has the maximum resource efficiency, as all pipeline stages are kept busy all the time.

High-level synthesis (HLS) is the process of automatically generating an RTL design out of a high level language input such as a C/C++ code. An important transformation in the process of HLS is loop pipelining. Existing HLS techniques pipeline loops utlimately targetting to achieve an II of 1. Dynamically scheduled HLS through dataflow circuit generation does a better job in pipelining irregular loops with complicated control structures than its statically scheduled counterpart due to its runtime mechanism. Yet, dataflow circuit generation techniques fail to achieve an II of 1 in the presence of loop-carried dependencies on an operation of a high latency. This decreases the resource efficiency and decreases the throughput of the application. One straightforward solution to increase the throughput is by replicating the resources to execute multiple sets of inputs in parallel. However, this solution does not improve the resource efficiency. The goal of this project is to explore a better solution that increases both the throughput and the resource efficiency by accepting and passing a new set of inputs to the dataflow circuit to execute on the idle resources whenever the previous set of inputs is blocked on a data dependency. Doing so in dataflow circuits requires extending the existing handshake protocol to support token tagging. This achieves a coarse-grained pipeline with an II of 1 and increases the overall throughput of the application, collectively, by executing multiple sets of inputs in parallel and sharing resources among them.

 

An Area-Efficient Memory Interface for Dataflow Circuits

Supervisor: Ayatallah Elakhras ([email protected])

In the recent years, dynamically scheduled HLS through dataflow circuit generation has proven its effectiveness over its statically scheduled counterpart due to its flexibility in dealing with the unpredictability of program control decisions and the variable latency of some operations. Applications with complicated control structures and irregular memory access patterns are a good candidate for it, but when the memory access pattern can not be statically disambiguated, a runtime mechanism to reorder memory accesses is required to maintain the correctness of the sequential program execution by preventing potential memory hazards. Out-of-order processors typically rely on load-store queues (LSQs) for this purpose and so do dataflow circuits. Although LSQs have proven to do a great job at responding to memory accesses as fast as possible while maintaining the correct semantic order, they are a main source of area and power inefficiency especially in the context of dataflow circuits. The goal of this project is to abandon LSQs and to replace them with local circuitry, inserted solely between
memory dependent operations, that can be as fast as LSQs but more area-efficient.

 

Analysis of Memory Accesses of Task-Level Algorithms 

Supervisor: Canberk Sonmez ([email protected])

Task-level parallelism (TLP) is an abstraction that models the program execution in terms of multiple tasks that can be executed in parallel. TLP is proved to be effective for a variety of tasks, most prominently for graph processing algorithms, and we believe that TLP is a great candidate for FPGA acceleration. While prior work explored the computation aspect of TLP both on CPUs and FPGAs, the memory aspect is yet to be investigated. In this project, the student is expected to perform an in-depth analysis of various graph processing algorithms to help design a specialized memory controller for our FPGA-based TLP acceleration framework. This analysis includes, but not limited to, investigating the spatial and temporal locality, and data reuse patterns within/across tasks.

 

Toward Coarse Grained Field Programmable Gate Arrays

Supervisor: Louis Coulon ([email protected])

Reconfigurable computing systems are ideal candidates to fulfill the  growing need for hardware specialization. They enable the mapping of  applications to efficient dedicated digital circuits which can be  implemented on generic computing fabrics, thus improving both  performance and energy efficiency while still being capable of  supporting virtually any application. In practice, the only  reconfigurable computing system available commercially today is the  FPGA, recently adopted as a computing device in datacenters and targeted  by all high-level synthesis compilers to ease its programming. However,  in datacenter setting, there is a striking mismatch between the FPGA  programmable fabric and the computing needs of applications: these  generally perform word-level computations, while FPGAs allow bit-level  reconfigurability and provide bit-level abstractions. In this project,  we study a coarser granularity version of FPGAs and evaluate its  performance as a word-level reconfigurable computing fabric. We offer  opportunities to work on new compiler optimizations targetting this new  category of processors, new architectural ideas, and the physical  implementation and modeling of such processors.

 

Open-Source Work on Dynamic High-level Synthesis Compiler Targeting FPGAs

Supervisor: Lucas Ramirez ([email protected])

Dynamic high-level synthesis (DHLS) is the process of turning high-level source code into synchronous dynamically scheduled circuits. We offer opportunities to work on an open-source DHLS compiler called Dynamatic that is based on the MLIR compiler ecosystem and which generates synthesizable RTL that targets FPGA architectures from C/C++ code. Projects can revolve around writing compiler passes that transform our intermediate representation (IR) at some level of our progressive lowering pipeline. These go from close-to-source high-level analysis steps (e.g., memory dependence analysis, loop transformations) down to dataflow circuit-level transformations (e.g., buffer placement, bitwidth optimization). We also have open lines of work in the infrastructure surrounding the core compiler, be it debugging, visualization, or benchmarking tools. Regardless of the exact line of work, as an open-source project we value good software design, solid development practices, and want to make any student contribution a permanent part of Dynamatic’s codebase. If you would like to discuss potential directions, please reach out with your resume/transcript and any specific interest you may have.

 

Processing Engines Development for Task Level Parallelism

Supervisor: Mohamed Shahawy ([email protected])

Task Level Parallelism (TLP) is an approach to parallelize programs through dividing them into a number of independent tasks. TLP is supported in software by several compilers like OpenCilk and libraries like IntelTBB. TLP is beneficial for workloads that express control dependencies and irregular memory access patterns like graph and tree processing. Currently, TLP runs mainly on parallel CPUs; our lab aims to build TLP support for FPGA computing. In this project, the student is expected to implement and verify a set of Processing Elements (PEs) that uses the TLP backbone infrastructure developed at LAP. The PEs would range from simple benchmark PEs to more complex ones that implement modern parallel graph mining algorithms.

 

Wishbone bus architecture generator

Supervisor: Theo Kluter ([email protected])

The wishbone bus is an open source hardware computer bus that can be used in embedded systems to connect the different components together in a memory mapped way. The wishbone standard allows for a shared-bus, a dataflow, and a crossbar architecture. In this project you are going to design a tool that generates the required HDL (VHDL and/or Verilog) implementing a wishbone architecture given a specification of the master(s) and slave(s) connected to the bus.

 

RISC-V educational simulator

Supervisor: Theo Kluter ([email protected])

The open source RISC-V instruction set architecture (ISA) is becoming more and more important in industry and research. A good reason to use it in our educational program. In this project you are going to implement a RISC-V ISA simulator framework that can be used to introduce students into computer architectur.