12/12: Day 12 🎄

Who's the grandfather of the TPU..

Dec 13, 2025

Welcome back to Unwrapping TPUs ✨!

Previously, we’ve focussed on how systolic arrays are implemented on TPU v1 architecture. Today, we dig into the hardware history that makes the TPU possible and the lessons learned from the 1980s that Google applied 30 years later. To understand the TPU v1, we wanted to look at the original systolic array machine: a computer from Carnegie Mellon called Warp.

The 1978 Origin: “Systolic”

In 1978, H.T. Kung and Charles Leiserson published “Systolic Arrays for VLSI”, proposing an idea to solve the memory bottleneck. Instead of a CPU fetching data one piece at a time, they imagined a network of small processors that pumped data through themselves rhythmically, with the goal to perform multiple computations for each memory access by keeping data flowing regularly through the network. In this model, data flows from cell to cell, being reused multiple times before returning to memory, effectively balancing I/O bandwidth with computation speed. Again, this is fundamentally the same idea TPU v1 implements with their systolic arrays.

The First Attempt: The “Warp” Computer

In the 1980s, H.T. Kung put this theory into practice at CMU with a machine called Warp. Kung’s original paper outlined a Linear Systolic Array, which is essentially a 1D row of processors. In this setup, data is pumped in one end, ripples through the neighbors, and results exit the other side. The Warp machine implemented this linear topology with programmable cells. While it proved that systolic arrays worked for specific tasks like convolution, it tried to be a general-purpose computer where each cell had its own program. This “smart cell” approach (MIMD) made the hardware complex and difficult to scale to the density required for modern AI.

Fast Forward to TPU v1

When Google designed the TPU v1 in 2015, they returned to Kung’s original idea but with a crucial twist learned from history. The most visible difference lies in the topology. While Warp and the initial examples in Kung’s paper focused on linear (1D) arrays for matrix-vector multiplication, the TPU v1 utilizes a 2D Matrix (Mesh) array (256 x 256 cells). Kung’s paper actually anticipated 2D structures, proposing a hexagonal mesh for dense matrix multiplication where data flows in three directions. The TPU simplifies this into a square grid to match the physical shape of matrix multiplication and allow data to move in strict rows and columns, maximizing reuse without the complexity of hexagonal routing.

(H.T. Kung et al. 1978)

Despite these topological differences, both architectures utilize staggered inputs to solve the timing challenge. If you dump all data into the array at once, the pipeline jams. Kung described this backing 1978! In the TPU, this manifests as a wavefront where Row 1 is delayed by one cycle relative to Row 0, and Row 2 by two cycles. This skewing ensures that an input activation arrives at a specific processing element at the exact moment its corresponding weight is stationary and ready to multiply.

Finally, the TPU diverged from Warp by simplifying the processing element itself. Warp cells were complex and programmable, but the TPU v1 cells are simple ALUs. They utilize the inner product step processor concept described by Kung. By stripping away the program counters and instruction fetching from the individual cells, Google achieved the massive density that Kung and Leiserson predicted would be possible with VLSI technology.

Also once finals are over .. it’s time for FPGA and .. hacking together neat RTL design tools .. and more. Stay tuned !!

Works Referenced

Kung, H. T., and Charles E. Leiserson. “Systolic arrays (for VLSI).” Sparse Matrix Proceedings 1978. Society for Industrial and Applied Mathematics, 1979.

Chewing on Chips

Discussion about this post

Ready for more?