A lot has happened in transformer quantization over the past few years, from barely being able to quantize a 7B model in INT8 without destroying accuracy, to routinely fitting a 70B model in 4-bits on a single GPU. But existing guides on the topic are fragmented: either focused on a specific technique or on how to use a library. I’ve been working on integer quantization for fixed-point hardware for a while now and my goal with this series is to bridge that gap: building the core ideas carefully and tracing how the field has evolved, each technique motivated by the problems of what came before. This first post covers the foundations: what quantization is, why it’s hard, and the math behind it.
What is Quantization & why should you care?¶
Quantization is the process of representing high-precision values using fewer bits. In practice, this means storing weights and (optionally) activations in lower precision (e.g., int8 instead of fp16), introducing a small approximation error.
The most immediate and easy-to-realize benefit of quantization is memory reduction. As a rule of thumb, a model with N billion parameters requires roughly 2 × N GB of memory when stored in 16-bit precision. Quantizing to 8-bit or 4-bit reduces this footprint by 2× and 4×, respectively.
There is also a hardware advantage. In 2014, Mark Horowitz, from Stanford University published a paper Computing’s Energy Problem which studied fp operations vs integer operations:
Energy Costs for various operations on a 45nm CMOS node. Source: Computing’s Energy Problem
So, integer arithmetic consumes lesser energy, specifically int8 add consumes 30x less energy than fp32 add & int8 mul consumes 18x less energy than fp32 mul. Lower precision hardware is also faster & consumes lesser silicon area than floating point.
How do these benefits translate to real-world gains? It depends on the bottleneck:
Compute-bound workloads (e.g., CNNs, LLM prefill) : Quantization can improve throughput since lower-precision arithmetic is faster and consumes lesser energy.
Memory-bandwidth-bound workloads (e.g., LLM decoding): Quantization reduces the amount of data moved, improving performance by lowering memory bandwidth pressure.
... continue reading