The Goal

The companion drip, Quantization, lays out the theory: memory is parameters × bytes-per-parameter, INT4 is a 4× cut, and AWQ keeps quality by protecting the salient ~1% of weights. This blueprint turns that into real files and real numbers.

By the end you'll have:

  • The same base model in three forms: FP16 (baseline), a GGUF Q4_K_M, and an AWQ INT4.
  • A benchmark table comparing on-disk size, VRAM, tokens/sec, and a quality proxy across all three — measured, not assumed.
  • The winner served: GGUF via Ollama (local), or AWQ via vLLM (GPU) — with an OpenAI-compatible endpoint either way.

Two paths, one model

The two paths aren't rivals — they're for different targets. GGUF is the local default (Ollama, LM Studio, llama.cpp; CPU or GPU). AWQ is the GPU-serving default (vLLM, TGI). Building both once, on the same model, is how you feel the tradeoff the drip describes.

The model

We'll use a small instruct model so this runs on modest hardware — Qwen/Qwen3-4B-Instruct (swap in any HF model; the commands are identical). At FP16 that's ~8GB of weights; you'll watch it drop to ~2.3GB in Q4_K_M and ~2.5GB in AWQ, with quality mostly intact.

Why measure, not assume

The drip's lab gives rule-of-thumb numbers. Real models vary: some quantize cleanly to INT4, some regress more than you'd expect, and the right quant level depends on your model and your task. The one non-negotiable habit from this build: quantize, then measure quality before you ship. Step 5 is the whole point.

The companion repo

Runnable version: github.com/maraja/quantize-and-run — scripts for each path plus the benchmark harness.

What's coming

Seven steps:

  1. What we're building (you're here)
  2. Setup — llama.cpp, AutoAWQ, pull the base model
  3. Quantize to GGUF — convert + quantize to Q4_K_M
  4. Quantize to AWQ — activation-aware INT4
  5. Benchmark — size, VRAM, tokens/sec, quality across all three
  6. Serve the winner — Ollama for GGUF, vLLM for AWQ
  7. What's next — KV-cache quant, smaller quants, calibration

Reference: Quantization (drip) · llama.cpp · AutoAWQ · vLLM AWQ