The fastest way to a local LLM in 2026: Ollama. It bundles a quantized model, runs it on your laptop, and exposes an OpenAI-compatible HTTP API. The models live on the Hugging Face Hub under the hood — Ollama handles the download and quantization for you.

Install Ollama

# macOS
brew install ollama
 
# Linux / WSL
curl -fsSL https://ollama.com/install.sh | sh

Pull and run a model

ollama run qwen3:8b

That's it. The first time you run it, the command downloads the model (~5GB, 4-bit quantized). After that, you drop straight into a chat REPL — type a prompt, get a reply. Press Ctrl+D to exit.

Call it from another app

Ollama serves an OpenAI-compatible API at http://localhost:11434 whenever it's running. Any OpenAI client library works against your local model by just swapping the base URL:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "What is a transformer?"}]
  }'

Where to go from here

  • Try a different model. ollama run llama3.1:8b, ollama run deepseek-r1:14b, ollama run mistral:7b. Each one fetches automatically the first time.
  • List what you have. ollama list shows the models on disk. ollama rm <name> frees the space.
  • Free up RAM. ollama ps shows what's loaded; the model auto-unloads after a few minutes of idle.

That's the whole loop. Two installs, one run, and a working HTTP API.