Quantization: Shrink the Weights
Same model, a quarter of the memory — if you protect the 1% of weights that actually matter. Quantization is what puts a 70B model on one card and a 4B model on your laptop, and in 2026 the defaults have settled.
§ 00 · WHY FP16 IS WASTEFULMemory is parameters times bytes
A language model is a pile of numbers. Serving it means holding those numbers in GPU memory, and the arithmetic is blunt: VRAM ≈ parameters × bytes-per-parameter. Train in FP16 — two bytes each — and a 7B model needs ~14GB of weights, a 70B model ~140GB. The 70B doesn’t fit on any single consumer card, or even most datacenter ones, before you’ve served a token.
But those weights were never that precise to begin with. A trained weight is a smallish number, and the difference between storing it in 16 bits and storing it in 4 is mostly storing noise. Quantization is the bet that you can throw away most of those bits and barely move the model’s outputs — if you do it carefully. Get it right and the same GPU holds a model 2–4× larger, and generates faster too, because decode is bound by how fast you can read the weightsMemory-bandwidth bound. For single-stream generation, the bottleneck is reading the weights from GPU memory each token, not doing the math — so halving the bytes per weight roughly speeds up decoding..
§ 01 · THE NUMBER-FORMAT LADDERFP16 → FP8 → INT8 → INT4
Each rung down the ladder halves (or quarters) the bytes per weight. FP16 and FP8 keep a floating-point shape (an exponent, so they span a wide range); INT8 and INT4 map weights onto a small grid of integers with a scale factor. Fewer bits means less memory and faster reads — and more rounding error. Drag the lab: pick a model size, pick a format, and watch whether it fits your card and what quality survives.
FP16 is the reference: full quality, but two bytes per parameter. A 70B model needs ~145GB just for weights — multiple GPUs before you've served a single token. VRAM, speed, and quality are rule-of-thumb estimates.
The pattern you’ll feel: FP8 and INT8 are nearly free — half the memory, quality basically intact. The 4× prize is INT4, but naive INT4 leaves quality on the table, especially on smaller models. That gap is what AWQ closes.
§ 02 · NOT ALL WEIGHTS ARE EQUALAWQ protects the salient 1%
Naive INT4 treats every weight the same. But a small fraction of weights — the ones multiplied by the largest activations — matter far more to the output than the rest. Round those carelessly and quality falls off a cliff; round the other 99% and almost nothing happens.
AWQ (activation-aware weight quantization) finds those salient weights by looking at activation magnitudes, and protects them — scaling them so the rounding grid lands kindly, rather than storing them at full width. The result: INT4 memory, near-FP16 quality. It won a MLSys 2024 best paper and, by 2026, became the default recipe for 4-bit deployment. The broader lesson generalizes: GPTQ and LLM.int8() make the same move in different ways — spend your precision budget where it matters, not uniformly.
§ 03 · THE FORMATSGPTQ · AWQ · GGUF · FP8 — which and when
The zoo of names is really two questions: what math (how the weights get rounded) and what container(the file format + the runtime that reads it). Four you’ll actually meet:
- AWQ — activation-aware INT4. The default when you want 4-bit on a GPU with quality intact. Read by vLLM, TGI, and friends.
- GPTQ — the other well-established INT4 method (error-compensating, layer-by-layer). Still common; AWQ usually edges it on quality-at-speed in 2026.
- GGUF — not a math, a container: the llama.cpp file format with a whole menu of quant levels (Q4_K_M, Q5_K_M, Q8_0…). The default for local inference — Ollama, LM Studio, llama.cpp all speak it, CPU or GPU.
- FP8 — 8-bit float. On Hopper/Blackwell GPUs it’s the datacenter serving default: half the memory of FP16, quality essentially intact, and native hardware support so it’s genuinely fast.
Rule of thumb for 2026: serving on datacenter GPUs → FP8; 4-bit on a GPU → AWQ; running locally → a GGUF Q4_K_M or Q5_K_M. The blueprint builds the GGUF and AWQ paths side by side so you can measure the tradeoff yourself.
§ 04 · QUANTIZE THE KV CACHE TOOWeights aren’t the only thing in memory
Quantizing weights shrinks the model. But at long context lengths, the KV cache — the running memory of every token attended to so far — can rival or exceed the weights. The same idea applies: store the cache in INT8 or INT4 instead of FP16. KVQuant and similar methods push KV-cache quantization far enough to fit million-token contexts that FP16 caches couldn’t hold.
The limits worth respecting: quantization sharpens a memory/quality tradeoff, it doesn’t escape it— push to 3 or 2 bits and quality really does fall, fast. Smaller models are more sensitive than large ones (each weight carries more). And a quant is only as good as its calibration data — a model quantized on the wrong distribution can quietly regress on yours. Measure quality after quantizing; don’t assume.
§ · FURTHER READINGReferences & deeper sources
- (2023). AWQ: Activation-aware Weight Quantization for LLM Compression · arXiv:2306.00978 (MLSys 2024)
- (2022). GPTQ: Accurate Post-Training Quantization for Generative Transformers · arXiv:2210.17323
- (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale · arXiv:2208.07339
- (2022). FP8 Formats for Deep Learning · arXiv:2209.05433
- (2024). KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization · arXiv:2401.18079
- (2024). llama.cpp / GGUF quantization formats · GitHub
- (2026). Blueprint: Quantize & Run · Brain Drip Blueprints
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.