Step 7: What's Next — Quantize & Run a Model

What you built

The same model in three forms — FP16, GGUF Q4_K_M, AWQ INT4 — from one download.
A benchmark proving the tradeoff on your hardware: ~3× less memory, faster decode, small quality cost.
The winner served behind an OpenAI-compatible endpoint — Ollama locally, or vLLM on a GPU.
The discipline that matters most: measure quality after quantizing, then decide.

Push it further

Quantize the KV cache. At long contexts the KV cache can dwarf the weights. vLLM supports FP8 KV cache (--kv-cache-dtype fp8); llama.cpp has --cache-type-k/--cache-type-v. This is what unlocks very long contexts on modest VRAM.
Try smaller quants — carefully. Q3_K_M or 3-bit AWQ squeeze onto tiny hardware, but this is where quality really starts to fall. Re-run Step 5; if perplexity spikes, back off. Sub-3-bit is rarely worth it for models this size.
Calibrate AWQ on your domain. The default calibration text is generic. If your traffic is code, or legal, or another language, calibrate on a sample of that — AWQ finds different salient weights for different distributions, and matching the calibration to your use measurably helps.
Batch for throughput. If you're serving many users, the AWQ + vLLM path scales far past single-stream numbers thanks to continuous batching — benchmark under concurrency, not just one request at a time.
Deploy it. The AWQ + vLLM combo drops straight into a container; pair it with the Deploy an Open Model on Cloud Run pattern for a serverless GPU endpoint.

The one thing to remember

Quantization is a memory/quality tradeoff you can measure, not a free lunch. FP8 and INT8 are nearly free; INT4-AWQ is the 4× sweet spot for most deployments; below that, quality starts to cost real points — and smaller models feel it first. So the habit that matters isn't picking the smallest format — it's quantizing, measuring, and only then shipping. Do that, and you'll fit models where they didn't fit and run them faster, without shipping a quietly-worse model to your users.

Reference: Quantization (drip) · KV Cache (drip) · vLLM FP8 KV cache · Deploy Gemma on Cloud Run (blueprint)