12/14: Day 14.5 🎄

Where'd All the Time Go? (Dr. Dog 💿🔥)

Dec 15, 2025

Unwrapping TPUs Continued ✨

As an added bonus to make up for our time lost over finals season, we had a rare opportunity to catch up with Cliff Young and dive a bit deeper into the exact numerics happening on the TPU v1 beyond the Jouppi et al. paper. Thank you again Cliff for your generous time, in the chance that you’re reading this, we appreciate you!!

(Nano Banana)

Here are some of our key questions answered!

Cliff’s notes:

Fixed-point was the TPU v1’s implementation as the time crunch demanded, pure play inference from the get go. Later on, floating point was implemented, allowing for greater numeric flexibility. INT8 format is represented as both unsigned and signed formats, mediated by an offline scale factor (real number).
The scale factor can be thought of as a 1 bit difference that translates to the real number line. As an example for ReLU6 activation, 0 represents 0.0 and 255 represents 1.0, you scale accordingly to find the representable values. Note that 1.0 is not included.
ReLU-X translation is a rescale-and-clamp as expected (min and max filters at 0 and X for the specific value in question). For tanh and sigmoid there’s still a rescale, but that gets the value in the right range for the sigmoid approximation unit.
- —> this is a really neat clarification, in our codebase right now ReLU6 is implemented statically … revisions incoming!
16 bit precision from INT8 on the hardware is a series of multiplication and adding subsequent blocks, as seen in our Accumulator breakdown (check out Day 3 breakdown).
The different 16-bit and 8-bit mixed numerics were a pain to get working requiring 3 different state trackers… but it’s doable: ran on 16bit for LSTM calculations in the original Speech inference run.
V2 is a new architecture, with floating-point. More to dive into on this point, as we launch v1 on FPGA, we’ll start breaking down backprop and all the training goodness there is to explore!!

Works Referenced

Jouppi et al. In-Datacenter Performance Analysis of a Tensor Processing Unit

Chewing on Chips

Discussion about this post

Ready for more?