Step 4: Quantize to AWQ — Quantize & Run a Model

Why AWQ needs calibration data

Unlike a flat GGUF round, AWQ has to find the salient weights — the ones multiplied by the largest activations — and it does that by running a small sample of real text through the model and watching the activations. So the one input beyond the model is a handful of calibration samples. A few hundred lines of general text is plenty; use text close to your domain if you have it.

The quantize script

# quantize_awq.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
 
SRC = "models/qwen3-4b-fp16"
OUT = "models/qwen3-4b-awq"
 
quant_config = {
    "zero_point": True,
    "q_group_size": 128,   # weights share a scale in groups of 128
    "w_bit": 4,            # INT4
    "version": "GEMM",     # fast GPU kernel
}
 
model = AutoAWQForCausalLM.from_pretrained(SRC, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(SRC)
 
# calibrates on a default general-text sample, finds + protects salient weights
model.quantize(tokenizer, quant_config=quant_config)
 
model.save_quantized(OUT)
tokenizer.save_pretrained(OUT)
print("saved AWQ INT4 →", OUT)

$ python quantize_awq.py
Quantizing... (calibrating on samples)
saved AWQ INT4 → models/qwen3-4b-awq
$ du -sh models/qwen3-4b-awq
2.5G  models/qwen3-4b-awq

Same ~3× shrink as GGUF — but this file is built to be read by GPU serving runtimes (vLLM, TGI) at high throughput, not by llama.cpp.

What just happened

Under the hood AutoAWQ did exactly what the drip's figure shows: it identified the ~1% of weights that dominate the output, scaled them so the INT4 grid lands kindly on them, and quantized everything else to 4 bits. The q_group_size=128 means weights share a quantization scale in groups of 128 — finer groups = better quality, slightly larger file. 128 is the standard.

A note on the tradeoff

AWQ quantization itself takes a few minutes and a GPU (it's doing calibration forward-passes). That's a one-time cost — you quantize once and serve forever. Don't confuse the one-time quantization cost with inference cost; the whole point is that inference afterward is cheaper and faster than FP16.

Now you have the same model three ways — FP16, GGUF Q4_K_M, AWQ INT4. Time to make them prove themselves.

Reference: AutoAWQ examples · AWQ paper · Quantization — §02 Not all weights are equal (drip)