12/10: Day 10 🎄
TPU on FPGA
Welcome back to Unwrapping TPUs!
Yesterday, we established roofline plots as the fundamental framework for understanding performance limits. Today, we introduce TPU architecture represented on FPGAs and the logistics of running them as coprocessors.
Why FPGA? TPU v1
As an Application-Specific Integrated Circuit (ASIC), TPU v1 had a fixed architecture, with a massive 256x256 systolic array optimized for 8-bit integer matrix multiplications. Great for its intended purpose (inference on quantized models), but inflexible. However, in true ASIC fashion, what Google designed is what everyone gets (very very good and high demand as the news has been revealing)!
Custom FPGA implementations are the opposite. FPGAs live on a reconfigurable fabric where users have control over designing the exact datapath needed for a given workload. In terms of implementing a TPU from scratch in a hacking setting, this is perfect: say we needed a smaller systolic array with custom precision? Build it. Code it. Ship it. The FPGA’s flexibility allows optimization for exact workloads rather than hoping they map well to fixed hardware.
The tradeoff: TPU v1 was a complete, validated design. FPGAs require implementing everything from scratch: memory controllers, systolic logic, instruction decoders, the whole stack. Our tradeoff here lies in performance and scale.
Where our Design Lands on the Roofline
Here’s where roofline plots become immediately practical. Once the classifier is implemented on FPGA or TPU hardware, the roofline reveals what’s actually limiting performance.
Memory bound means the systolic array sits idle, waiting for data. No amount of additional compute logic helps. Fixes include improving on-chip memory management, better tiling strategies, increasing memory bandwidth, and optimizing buffer utilization. Compute bound means processing elements are saturated while memory bandwidth is sufficient. Fixes involve optimizing the compute pipeline itself, increasing parallelism, reducing operation count, and improving PE efficiency.
For TPU v1, with limited external memory bandwidth, many kernels with large batch sizes or extensive input data end up memory bound. Custom FPGA designs depend entirely on how on-chip memory, external bandwidth, and compute resources are balanced.
Different parts of the classifier land in different regions. High-compute operations like large matrix multiplies tend toward compute bound. Operations involving extensive data lookup or I/O (loading weights, activation functions, layer normalization) skew memory bound. The roofline indicates where to focus optimization efforts and potentially shift the balance of our FPGA implementation (say for greater memory bandwidth via BRAM cell allocation).
Profiling the FPGA TPU Implementation
Once the TPU implementation is running on FPGA, profiling becomes essential to understand where it sits on the roofline. The original TPU v1 paper profiled six neural network applications: an MLP for multilayer perceptron models, two CNNs for vision tasks, two LSTMs for recurrent models, and the full production MLP0 used in Google datacenter inference. Each of these workloads has vastly different arithmetic intensity.
For profiling an FPGA implementation, we need to measure two key metrics: actual operations per second (ops/sec or FLOPS) and memory bandwidth utilization (bytes/sec). On FPGA, this means instrumenting the design with performance counters. Adding registers that track systolic array busy cycles, memory read/write transactions, and total operations completed gives raw data. These counters get read back through the PCIe interface or captured in simulation.
The arithmetic intensity calculation is straightforward: we take total operations divided by total bytes transferred. For a matrix multiply of dimensions MxK and KxN, the operation count is 2MNK (multiply-accumulate counts as two ops). The memory traffic is the bytes read for both input matrices plus bytes written for the output matrix. If most data fits in on-chip buffers and gets reused, arithmetic intensity increases. If every operation requires fetching from external DRAM, intensity collapses.
(Jouppi et al.)
Different model architectures land very differently. MLPs with large fully-connected layers tend toward compute bound when batch sizes are large where we have lots of reuse of weights across multiple inputs. CNNs depend heavily on image dimensions and filter sizes. Small convolutions with frequent memory access hit the memory bound region. LSTMs are notoriously memory-bound because of their sequential nature and constant state updates preventing effective batching.
Plotting these measured points against the theoretical roofline (determined by FPGA specs: peak MAC operations per cycle, memory bandwidth from DDR controller) immediately shows bottlenecks. If measured performance tracks the sloped bandwidth line, the implementation is memory-limited. If performance plateaus below the flat compute ceiling, either the systolic array isn’t fully utilized or there’s a pipeline stall issue in the control logic.
PCIe Coprocessor Integration: M.2 Form Factor
The cleanest way to integrate an FPGA TPU implementation as a coprocessor is through PCIe in the M.2 form factor. Boards like the LiteFury and NiteFury from RHS Research pack a Xilinx Artix-7 FPGA (XC7A100T or XC7A200T) into an M.2 2280 Key M card—the same physical format as NVMe SSDs. This means the FPGA plugs directly into a laptop’s M.2 slot and communicates over PCIe Gen2 x4 (2GB/s bandwidth).
(RHS Research)
The hardware specs matter for understanding performance limits. LiteFury provides 512MB DDR3-800 RAM with 1.6GB/s bandwidth, while NiteFury doubles this to 1GB. Both include configuration flash and expose 4 LVDS pairs plus general-purpose I/O through a connector. The FPGA itself has roughly 100K-200K logic cells depending on the variant, enough for a 16x16 or 32x32 systolic array implementation with supporting logic.
The key architectural component is Xilinx’s DMA/Bridge Subsystem for PCI Express (XDMA). This IP block handles the PCIe protocol layer and provides AXI-Stream interfaces for streaming data to and from the FPGA. The systolic array connects to XDMA through AXI-Stream buses, one for host-to-card (H2C) data flow and one for card-to-host (C2H) results.
Data flow works as follows. The host CPU writes input matrices (activations and weights) to /dev/xdma0_h2c_0, which streams data through the PCIe interface into the FPGA’s unified buffer. The control logic loads weights into the weight FIFOs, then triggers the systolic array to begin computation. As the array produces output, results stream back through /dev/xdma0_c2h_0 where the host reads them. This streaming approach is critical—without it, the latency of multiple discrete transfers would destroy performance.
Implementation details matter here. The XDMA block operates at 64-byte (512-bit) bus width, so AXI-Stream Data Width Converters are needed if the systolic array uses different widths. The control interface can be either AXI4-Lite registers or simpler ap_ctrl signals—the latter often proves more reliable. Starting computation involves either writing to control registers or simply holding a start signal high.
Software Stack and Driver Integration
Getting the FPGA to actually talk to the host requires kernel drivers and userspace code. Xilinx provides the XDMA Linux kernel driver, though it has known bugs that require patching. The critical fix involves the config BAR detection logic—without this patch, the driver fails to recognize the FPGA entirely.
On the software side, communication happens through character devices. Opening /dev/xdma0_h2c_0 and /dev/xdma0_c2h_0 in O_WRONLY and O_RDONLY modes respectively gives file descriptors for streaming data. Writing to the H2C device sends data to the FPGA, reading from the C2H device retrieves results. The key detail: start listening for results before sending inputs, otherwise the output buffer may overflow or the read may hang waiting for data that was already transmitted.
For real implementations, async I/O using Python’s asyncio or similar mechanisms handles this cleanly. One coroutine continuously writes input matrices while another simultaneously reads results. This overlaps communication with computation. while the FPGA processes one batch, the host is already sending the next.
Coprocessor Setup
Getting the accelerator actually usable means setting it up as a coprocessor alongside a host CPU. We have two main approaches we can take with this. The first being cloud deployment on AWS where we rent an FPGA instance and upload our bitstream. Our CPU tasks run on virtual machines and would communicate with the FPGA purely over network instances. This is the most convenient approach .. but it’s not the coolest :).
The other option we have is to actually create the physical FPGA where we have an FPGA dev board connected to a host CPU via PCIe and where the CPU orchestrates and sends data and commands to the FPGA. The FPGA would then perform the computations that we designed it to in Verilog (i.e. implementation of a systolic array) and return results / communicate to the CPU DRAM and on board memory.
In our subsequent posts we first focus on finishing our miniaturized TPU in Verilog and begin to look to port over to an actual hardware implementation. FPGA on TPU soon to come next!
Works Cited
NiteFury and LiteFury Public Repository, github.com/RHSResearchLLC/NiteFury-and-LiteFury
Jouppi et al. In-Datacenter Performance Analysis of a Tensor Processing Unit
Thanks for reading Chewing on Chips! Subscribe for free to receive new posts and support my work.




