arpitak / writing / IBERT LLM Implementation in Hardware

I implemented an LLM end-to-end in hardware using SystemVerilog and generated tokens natively on a PNYQ Z1 FPGA. This project is part of ECE 327 - Digital Hardware System offered by Dr. Nachiket Kapre and had been my one of my top courses that I have taken so far. It introduces software patterns for digital design, RTL design constructs in SystemVerilog and how to think about PPA choices, optimizations + tradeoffs to consider. Huge thanks to Prof. Kapre for all of his teaching & admin efforts to make this possible!

ibert-attention

I started off my Verilog journey by speedrunning basic constructs + syntax from HDLBits (I started this on the side during my time at ATS-LS). I learned about the intricate details surrounding Blocking/Non-Blocking assignments (this paper from sunburst design was a great read on this topic), Dataflow/Structural modelling & Procedural constructs, how delays work, differences in the hardware generation patterns and synthesis vs simulation mismatches.

Project Implementation Phase

Phase 1: Building out the Arithmetic Blocks (~10 hours)

Acc: Sums a stream of input values (in_data) into a synchronous accumulator (result). Sets result to 0. Accumulation occurs on each positive clock edge, if not reset/initialize, adds in_data to result. Latency - Output result is valid after 1 cycle.
MAC: Computes the dot product of two input streams (a and b) by accumulating (instantiated acc module) their products into result. On each clock, if not reset/initialize, adds a * b to result.
Max: Tracks the maximum value in a stream, updating only when a new maximum is seen.
Array: Instantiates N independent mac modules, each receiving its own input streams (a[k], b[k], initialize[k]). Each mac computes a dot product for its input lane, and the results are output as an array, enabling parallel multiply-accumulate operations across N data streams.

The FSM in DIV ensures correct sequencing and resource sharing. Pipelining in EXP and GELU allows new inputs every cycle, maximizing throughput. The use of integer math and polynomial approximations is justified by the need for speed and resource efficiency on FPGAs - read more about IBERT here

Div: Implements integer division using FSM pattern: computes quotient = floor(dividend / divisor). Loop behaviour should be implemented using FSM-based control with two states: INIT and COMP.
- Details:
  - The FSM operates with two primary states: INIT and COMP. In the INIT state, the module waits for the in_valid signal. When in_valid is asserted, the module loads the dividend and divisor into internal registers, resets the quotient, and transitions to the COMP state to begin computation. Within the COMP state, the division operation is carried out across multiple clock cycles. During each cycle, the LOPD determines the position of the most significant '1' in both the remainder and the divisor, allowing the hardware to calculate how much to shift the divisor for maximum possible subtraction. The aligned divisor is then subtracted from the remainder, and the quotient is updated accordingly.
  - This process repeats until the remainder becomes less than the divisor. At this point, the module signals completion by setting the out_valid flag and returns to the INIT state, ready for the next division input. The LOPD is essential for this process, as it enables quick identification of the optimal shift amount, thereby improving division speed and hardware efficiency.
Exp: Second order polynomial approx, based on the integer exponent algorithm. Exponent calculation is pipelined into multiple stages for full throughput.
- Note: I had observed an issue in the testbench provided - not registering values of certain variables in a pipelined register would not cause an issue since testbench was holding the values of the quantised input as well as some input signals even when the concerned valid signal was deasserted.
GELU: Similar design to EXP module; full throughput pipelined module. In iBERT, GELU activation is approximated for int-only computation, enabling efficient large language model inference on hardware while preserving smoothness and expressiveness over older activations like ReLU.

Phase 2: Designing the Systolic Array (~10 hours)

This blew my mind! Systolic arrays allow for elegant mechanisms to perform MMs in a much faster, scalable manner (highly suggest checking out this foundational paper). The main challenge here though is ensuring correct dataflow and synchronization, which was solved by careful design of the PE and array interconnects. Address generators are critical for efficient memory access, especially when matrices are larger than the array. Partition input/output matrices across memory banks using cyclic partioning with interleaved memory bank addressing, enabling parallel, contention-free data access and streaming (the staggering flow of the data was taken care of). systolic-4x4 systolic-array-mm-a

Phase 3: Building Transformer Primitives (~25 hours)

Softmax + Layernormalization modules - started to reuse and integrate lot of the previously implemented modules from Phase 1 for this one. Main challenge here again was to ensure proper dataflow and sync. Incorporated buffering logic using Shift Registers and FIFOs to handle latency differences between deeply pipelined (layernorm consisted of 11 pipelined stages) & multi-cycle FSM blocks (Div, Exp, GELU) ensuring synchronized data flow. softmax

Note: Quantization in softmax is not a single event but a series of deliberate reductions in bit-width at critical points - after exponentiation and after normalization to ensure the computation is accurate, efficient, and compatible with hardware limitations.

Phase 4: Constructing IBERT Language Model (~5 hours)

Top Level Asssembly - this phase is essentially stitching up everything.

Attention Head Assembly (mm.sv): Self-Attention Head module that computes Q - query, K - key, and V - value matrices, chained with softmax to find self-attention result. As before, all blocks here are wrapped in AXI-stream shims (tdata, tvalid, tready, tlast).
Final System Integration: Wrapper modules for Systolic Array with Address Generators (from Phase 2), Requantization units with Layernorn & GELU and the Attention Heads.

Takeaways:

Hardware design is super cool!! I realized that I enjoy thinking of circuits & systems implementation from a Verilog pov.
LUTs/Slice LUT: More combinational logic, wide muxes, big arithmetic units = more LUTs. Trading area vs. flexibility. Excessive LUT use signals inefficient implementation.
Slice Registers: Used for pipelining (improves timing), FSM state, etc. More registers often boost throughput, but wasteful use blooms area.
Storing data in registers/LUTs is wasteful; always use BRAM when feasible for large arrays/buffers.
DSPs: Using LUTs inflates LUT usage but saves DSPs if they’re scarce. Always favor DSP use for big/critical math.
Timing constraints: Managing slack is critical during timing closure, where designers may adjust pipelining, optimize logic. Add pipeline stages (increase depth) to meet critical path timing requirements, improving max clock frequency by breaking long combinational paths into shorter segments with registers.