What you built
- The same model in three forms — FP16, GGUF Q4_K_M, AWQ INT4 — from one download.
- A benchmark proving the tradeoff on your hardware: ~3× less memory, faster decode, small quality cost.
- The winner served behind an OpenAI-compatible endpoint — Ollama locally, or vLLM on a GPU.
- The discipline that matters most: measure quality after quantizing, then decide.
Push it further
- Quantize the KV cache. At long contexts the KV cache can dwarf the weights. vLLM supports FP8 KV cache (
--kv-cache-dtype fp8); llama.cpp has--cache-type-k/--cache-type-v. This is what unlocks very long contexts on modest VRAM. - Try smaller quants — carefully. Q3_K_M or 3-bit AWQ squeeze onto tiny hardware, but this is where quality really starts to fall. Re-run Step 5; if perplexity spikes, back off. Sub-3-bit is rarely worth it for models this size.
- Calibrate AWQ on your domain. The default calibration text is generic. If your traffic is code, or legal, or another language, calibrate on a sample of that — AWQ finds different salient weights for different distributions, and matching the calibration to your use measurably helps.
- Batch for throughput. If you're serving many users, the AWQ + vLLM path scales far past single-stream numbers thanks to continuous batching — benchmark under concurrency, not just one request at a time.
- Deploy it. The AWQ + vLLM combo drops straight into a container; pair it with the Deploy an Open Model on Cloud Run pattern for a serverless GPU endpoint.
The one thing to remember
Quantization is a memory/quality tradeoff you can measure, not a free lunch. FP8 and INT8 are nearly free; INT4-AWQ is the 4× sweet spot for most deployments; below that, quality starts to cost real points — and smaller models feel it first. So the habit that matters isn't picking the smallest format — it's quantizing, measuring, and only then shipping. Do that, and you'll fit models where they didn't fit and run them faster, without shipping a quietly-worse model to your users.
Reference: Quantization (drip) · KV Cache (drip) · vLLM FP8 KV cache · Deploy Gemma on Cloud Run (blueprint)