Project + Python
mkdir quantize-and-run && cd quantize-and-run
python -m venv .venv && source .venv/bin/activate
pip install "huggingface_hub[cli]" transformers torchGet the base model (FP16)
Download once; both quantizers read from this local folder.
hf download Qwen/Qwen3-4B-Instruct --local-dir models/qwen3-4b-fp16Build llama.cpp (GGUF path)
git clone https://github.com/ggml-org/llama.cpp
cmake -S llama.cpp -B llama.cpp/build -DGGML_CUDA=ON # drop -DGGML_CUDA=ON for CPU-only
cmake --build llama.cpp/build --config Release -j
pip install -r llama.cpp/requirements.txt # for the convert scriptThat gives you two things we need: the convert_hf_to_gguf.py script and the llama-quantize binary (in llama.cpp/build/bin).
Install AutoAWQ (GPU path)
pip install autoawqAutoAWQ needs a CUDA GPU to quantize. If you don't have one, you can still do the entire GGUF path (Step 3), benchmark it against FP16, and serve it — skip Steps 4 and the AWQ rows in Step 5.
Sanity check
# llama.cpp built?
./llama.cpp/build/bin/llama-quantize --help | head -3
# model present?
ls models/qwen3-4b-fp16/*.safetensors | head
# awq importable? (GPU only)
python -c "import awq; print('autoawq ok')"With the base model on disk and both toolchains ready, you never touch Hugging Face again — the rest of the build is local. Next: the GGUF path, which runs even without a GPU.
Reference: llama.cpp build guide · AutoAWQ install · huggingface-cli download