GUIDE · CPU-ONLY · JUNE 2026
CPU-only local AI works — for a narrow set of use cases.
No GPU, just the CPU + system RAM. This is the entry point for people who want to try local AI without spending money, or who need to deploy to servers without GPUs. It works. It’s also 5–10× slower than even a mid-tier GPU.
This piece tells you which workflows survive on CPU and which ones don’t. No hedging — honest split between “yes this is fine” and “you’re fighting physics.”
Where CPU-only works
- Text classification + short summarization. Qwen 3.5 4B at Q4 runs at 8–15 tok/s on a modern CPU with DDR5. That’s fast enough for “classify 10k records overnight” or “summarize 2-page document.”
- Privacy-bound workflows on air-gapped machines. If the data legally cannot leave the box and the box doesn’t have a GPU, CPU inference is the answer. Quality ceiling is ~7B; accept it.
- Learning and experimentation. Zero-cost entry. Every concept transfers to GPU later. Don’t buy hardware until you know what you need.
- Server fleets without GPU budget. Llama 3.2 1B or Qwen 3.5 2B at 30–50 tok/s on a 32-core EPYC handles high-volume classification at zero GPU cost.
Where it doesn’t
- Interactive chat. 8 tok/s feels slow. Humans read at 5 tok/s but the psychological bar for “fluent” chat is closer to 20 tok/s. CPU-only misses it.
- Coding assistants. Copilot replacement requires ~30 tok/s minimum for feel. CPU-only on 14B+ sits at 3–5 tok/s. It technically works, it’s not usable.
- Agents + tool-use loops. Every iteration is another round of inference. What a GPU does in 15 seconds, CPU takes 2 minutes. Users quit.
- Image generation. SD 3.5 Medium on CPU: 3–5 minutes per image. Even SANA-0.6B takes 10–20 seconds. Use cloud.
- 70B pretending to fit. Q4_K_M 70B loads on 64 GB RAM. It runs at 0.5–1 tok/s. It looks cool; don’t do this seriously.
What actually affects CPU speed
Memory bandwidth dominates. A modern Ryzen 9 with DDR5-6000 beats a high-core-count Xeon with DDR4-3200 by 2–3× on the same model. “Number of cores” matters less than marketing claims.
AVX-512 matters, a lot. Chips without it (most Intel desktop through 13th gen) are materially slower. AMD Zen 4+ has it; most modern server chips have it; E-cores on Intel 12th+ gen explicitly do not.
Quantization beats model size. Q4_K_M is the CPU default. Q2_K trades quality for speed when you need both. Q8 is only worth it if you have bandwidth to spare — which on CPU, you don’t.
The practical setup
Runner: llama.cpp directly, or Ollama (which wraps llama.cpp). Both have strong CPU paths. Don’t bother with vLLM or TensorRT — those are GPU-first.
Models to actually try: Qwen 3.5 4B, Qwen 3.5 2B, Phi-4 Mini, Llama 3.2 1B / 3B. All under 8 GB RAM at Q4. All benchmark usefully on modern CPUs.
Plan on upgrading. CPU-only is the right entry point; it’s rarely the right endpoint. Once you know what you actually want to run locally, a $500–$800 GPU unlocks a 5–10× quality-at-speed jump that no CPU tuning matches.
Next step
Plan a CPU-only setup in the planner→