The fastest way to a local LLM in 2026: Ollama. It bundles a quantized model, runs it on your laptop, and exposes an OpenAI-compatible HTTP API. The models live on the Hugging Face Hub under the hood — Ollama handles the download and quantization for you.
Install Ollama
# macOS
brew install ollama
# Linux / WSL
curl -fsSL https://ollama.com/install.sh | shPull and run a model
ollama run qwen3:8bThat's it. The first time you run it, the command downloads the model (~5GB, 4-bit quantized). After that, you drop straight into a chat REPL — type a prompt, get a reply. Press Ctrl+D to exit.
Call it from another app
Ollama serves an OpenAI-compatible API at http://localhost:11434 whenever it's running. Any OpenAI client library works against your local model by just swapping the base URL:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:8b",
"messages": [{"role": "user", "content": "What is a transformer?"}]
}'Where to go from here
- Try a different model.
ollama run llama3.1:8b,ollama run deepseek-r1:14b,ollama run mistral:7b. Each one fetches automatically the first time. - List what you have.
ollama listshows the models on disk.ollama rm <name>frees the space. - Free up RAM.
ollama psshows what's loaded; the model auto-unloads after a few minutes of idle.
That's the whole loop. Two installs, one run, and a working HTTP API.