Drip · Core Concepts · 10 min read

Quantization: Shrink the Weights

Same model, a quarter of the memory — if you protect the 1% of weights that actually matter. Quantization is what puts a 70B model on one card and a 4B model on your laptop, and in 2026 the defaults have settled.

Brain Drip EditorsQuantizationQuantization. Storing a model's weights (and sometimes activations) in fewer bits than the FP16 they were trained in — 8-bit or 4-bit integers, or 8-bit floats — to cut memory and speed up inference. · FP16 → INT4 · July 2026

The bottom line. A model’s memory is just parameters × bytes-per-parameter, so the fastest way to fit a bigger model on your GPU is to use fewer bytes per weight. FP16 is two bytes; INT4 is half a byte — a 4× cut. The catch is quality: round every weight equally (naive INT4) and it degrades. The fix that won 2026 is AWQ — keep the ~1% of salient weights in high precision and quantize the rest, for near-FP16 quality at INT4 size. FP8 is the datacenter serving default; GGUF is the local-inference default; and you can quantize the KV cache too. The lab lets you find the biggest model that fits your card.

§ 00 · WHY FP16 IS WASTEFULMemory is parameters times bytes

A language model is a pile of numbers. Serving it means holding those numbers in GPU memory, and the arithmetic is blunt: VRAM ≈ parameters × bytes-per-parameter. Train in FP16 — two bytes each — and a 7B model needs ~14GB of weights, a 70B model ~140GB. The 70B doesn’t fit on any single consumer card, or even most datacenter ones, before you’ve served a token.

But those weights were never that precise to begin with. A trained weight is a smallish number, and the difference between storing it in 16 bits and storing it in 4 is mostly storing noise. Quantization is the bet that you can throw away most of those bits and barely move the model’s outputs — if you do it carefully. Get it right and the same GPU holds a model 2–4× larger, and generates faster too, because decode is bound by how fast you can read the weightsMemory-bandwidth bound. For single-stream generation, the bottleneck is reading the weights from GPU memory each token, not doing the math — so halving the bytes per weight roughly speeds up decoding..

§ 01 · THE NUMBER-FORMAT LADDERFP16 → FP8 → INT8 → INT4

Each rung down the ladder halves (or quarters) the bytes per weight. FP16 and FP8 keep a floating-point shape (an exponent, so they span a wide range); INT8 and INT4 map weights onto a small grid of integers with a scale factor. Fewer bits means less memory and faster reads — and more rounding error. Drag the lab: pick a model size, pick a format, and watch whether it fits your card and what quality survives.

Lab · shrink the weightsPick a size and a format — find the biggest model that fits your card without wrecking quality.

Model size · 8B params

Your GPU

18.3GB of 24GB used

VRAM needed

18.3GB

fits on 24GB

Decode speed

1.0×

vs FP16 (bandwidth-bound)

Quality retained

100.0%

vs FP16 baseline

FP16 is the reference: full quality, but two bytes per parameter. A 70B model needs ~145GB just for weights — multiple GPUs before you've served a single token. VRAM, speed, and quality are rule-of-thumb estimates.

The pattern you’ll feel: FP8 and INT8 are nearly free — half the memory, quality basically intact. The 4× prize is INT4, but naive INT4 leaves quality on the table, especially on smaller models. That gap is what AWQ closes.

§ 02 · NOT ALL WEIGHTS ARE EQUALAWQ protects the salient 1%

Naive INT4 treats every weight the same. But a small fraction of weights — the ones multiplied by the largest activations — matter far more to the output than the rest. Round those carelessly and quality falls off a cliff; round the other 99% and almost nothing happens.

Fig 1AWQ (activation-aware weight quantization) keeps the ~1% of salient weights in high precision and quantizes the rest to INT4 — near-FP16 quality at a quarter of the memory.

AWQ (activation-aware weight quantization) finds those salient weights by looking at activation magnitudes, and protects them — scaling them so the rounding grid lands kindly, rather than storing them at full width. The result: INT4 memory, near-FP16 quality. It won a MLSys 2024 best paper and, by 2026, became the default recipe for 4-bit deployment. The broader lesson generalizes: GPTQ and LLM.int8() make the same move in different ways — spend your precision budget where it matters, not uniformly.

§ 03 · THE FORMATSGPTQ · AWQ · GGUF · FP8 — which and when

The zoo of names is really two questions: what math (how the weights get rounded) and what container(the file format + the runtime that reads it). Four you’ll actually meet:

AWQ — activation-aware INT4. The default when you want 4-bit on a GPU with quality intact. Read by vLLM, TGI, and friends.
GPTQ — the other well-established INT4 method (error-compensating, layer-by-layer). Still common; AWQ usually edges it on quality-at-speed in 2026.
GGUF — not a math, a container: the llama.cpp file format with a whole menu of quant levels (Q4_K_M, Q5_K_M, Q8_0…). The default for local inference — Ollama, LM Studio, llama.cpp all speak it, CPU or GPU.
FP8 — 8-bit float. On Hopper/Blackwell GPUs it’s the datacenter serving default: half the memory of FP16, quality essentially intact, and native hardware support so it’s genuinely fast.

Rule of thumb for 2026: serving on datacenter GPUs → FP8; 4-bit on a GPU → AWQ; running locally → a GGUF Q4_K_M or Q5_K_M. The blueprint builds the GGUF and AWQ paths side by side so you can measure the tradeoff yourself.

§ 04 · QUANTIZE THE KV CACHE TOOWeights aren’t the only thing in memory

Quantizing weights shrinks the model. But at long context lengths, the KV cache — the running memory of every token attended to so far — can rival or exceed the weights. The same idea applies: store the cache in INT8 or INT4 instead of FP16. KVQuant and similar methods push KV-cache quantization far enough to fit million-token contexts that FP16 caches couldn’t hold.

Rendering diagram…

The practical pipeline: an FP16 checkpoint quantized down one of two paths, then served by the matching runtime.

The limits worth respecting: quantization sharpens a memory/quality tradeoff, it doesn’t escape it— push to 3 or 2 bits and quality really does fall, fast. Smaller models are more sensitive than large ones (each weight carries more). And a quant is only as good as its calibration data — a model quantized on the wrong distribution can quietly regress on yours. Measure quality after quantizing; don’t assume.

CHECKYou need to run a 70B model on a single 48GB GPU for local experimentation. Which is the most sensible first choice?

§ · FURTHER READINGReferences & deeper sources

Ji Lin et al. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression · arXiv:2306.00978 (MLSys 2024)
Elias Frantar et al. (2022). GPTQ: Accurate Post-Training Quantization for Generative Transformers · arXiv:2210.17323
Tim Dettmers et al. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale · arXiv:2208.07339
Paulius Micikevicius et al. (2022). FP8 Formats for Deep Learning · arXiv:2209.05433
Coleman Hooper et al. (2024). KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization · arXiv:2401.18079
Georgi Gerganov & contributors (2024). llama.cpp / GGUF quantization formats · GitHub
Brain Drip Editors (2026). Blueprint: Quantize & Run · Brain Drip Blueprints

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.