Step 2: Setup — Quantize & Run a Model

Project + Python

mkdir quantize-and-run && cd quantize-and-run
python -m venv .venv && source .venv/bin/activate
pip install "huggingface_hub[cli]" transformers torch

Get the base model (FP16)

Download once; both quantizers read from this local folder.

hf download Qwen/Qwen3-4B-Instruct --local-dir models/qwen3-4b-fp16

Build llama.cpp (GGUF path)

git clone https://github.com/ggml-org/llama.cpp
cmake -S llama.cpp -B llama.cpp/build -DGGML_CUDA=ON   # drop -DGGML_CUDA=ON for CPU-only
cmake --build llama.cpp/build --config Release -j
pip install -r llama.cpp/requirements.txt              # for the convert script

That gives you two things we need: the convert_hf_to_gguf.py script and the llama-quantize binary (in llama.cpp/build/bin).

Install AutoAWQ (GPU path)

pip install autoawq

AutoAWQ needs a CUDA GPU to quantize. If you don't have one, you can still do the entire GGUF path (Step 3), benchmark it against FP16, and serve it — skip Steps 4 and the AWQ rows in Step 5.

Sanity check

# llama.cpp built?
./llama.cpp/build/bin/llama-quantize --help | head -3
# model present?
ls models/qwen3-4b-fp16/*.safetensors | head
# awq importable? (GPU only)
python -c "import awq; print('autoawq ok')"

With the base model on disk and both toolchains ready, you never touch Hugging Face again — the rest of the build is local. Next: the GGUF path, which runs even without a GPU.

Reference: llama.cpp build guide · AutoAWQ install · huggingface-cli download