What you built

  • The same model in three forms — FP16, GGUF Q4_K_M, AWQ INT4 — from one download.
  • A benchmark proving the tradeoff on your hardware: ~3× less memory, faster decode, small quality cost.
  • The winner served behind an OpenAI-compatible endpoint — Ollama locally, or vLLM on a GPU.
  • The discipline that matters most: measure quality after quantizing, then decide.

Push it further

  • Quantize the KV cache. At long contexts the KV cache can dwarf the weights. vLLM supports FP8 KV cache (--kv-cache-dtype fp8); llama.cpp has --cache-type-k/--cache-type-v. This is what unlocks very long contexts on modest VRAM.
  • Try smaller quants — carefully. Q3_K_M or 3-bit AWQ squeeze onto tiny hardware, but this is where quality really starts to fall. Re-run Step 5; if perplexity spikes, back off. Sub-3-bit is rarely worth it for models this size.
  • Calibrate AWQ on your domain. The default calibration text is generic. If your traffic is code, or legal, or another language, calibrate on a sample of that — AWQ finds different salient weights for different distributions, and matching the calibration to your use measurably helps.
  • Batch for throughput. If you're serving many users, the AWQ + vLLM path scales far past single-stream numbers thanks to continuous batching — benchmark under concurrency, not just one request at a time.
  • Deploy it. The AWQ + vLLM combo drops straight into a container; pair it with the Deploy an Open Model on Cloud Run pattern for a serverless GPU endpoint.

The one thing to remember

Quantization is a memory/quality tradeoff you can measure, not a free lunch. FP8 and INT8 are nearly free; INT4-AWQ is the 4× sweet spot for most deployments; below that, quality starts to cost real points — and smaller models feel it first. So the habit that matters isn't picking the smallest format — it's quantizing, measuring, and only then shipping. Do that, and you'll fit models where they didn't fit and run them faster, without shipping a quietly-worse model to your users.


Reference: Quantization (drip) · KV Cache (drip) · vLLM FP8 KV cache · Deploy Gemma on Cloud Run (blueprint)