USE CASE · CHAT
Local AI for everyday conversation.
The 16GB sweet spot is now genuinely good. Gemma 4 26B A4B at 3.8B active runs sub-second-to-first-token on consumer GPUs at quality that beats GPT-3.5 of two years ago. The honest case for local chat is privacy, offline reliability, and zero per-token cost — not raw quality vs Opus.
Verdict — Genuinely good at 16GB+, frontier-class at 64GB+
Qwen 3.6-27B dense (April 22 2026) claims to beat the prior 397B MoE flagship while fitting 16GB Q4. Gemma 4 31B for editorial-prose quality. Qwen 3.6-35B-A3B for fast-daily-driver MoE.
What's the answer at each tier
Llama 3.3 70B dense, Qwen 3.5 122B-A10B, gpt-oss-120b — open-weight chat that competes with closed frontier for most conversational work. The Chinese-lab April 2026 wave (DeepSeek V4, Kimi K2.6, GLM-5.1, MiniMax M2.7) all hit at-or-above this tier in a 12-day window.
- Llama 3.3 70B Q4 dense — Proven daily-driver 70B dense — focused, reliable, well-supported across all runners. The community pick for "what should my $4K rig run for chat."
- Qwen 3.5 122B-A10B (4-bit MLX, multimodal) — Native multimodal — handles text + image + video input. 10B active, 60.6 tok/s calibrated on M5 Max 128 GB. Apache 2.0.
- Qwen 3.6-35B-A3B (the fast daily driver) — WillItRunAI rates 95/100 for Mac M3 Ultra 96 GB at 71 tok/s. Use this when you want speed; step up to 70B/122B for hard reasoning.
Qwen 3.6-27B dense (April 22 2026) claims to beat the prior 397B MoE flagship while fitting 16GB Q4. Gemma 4 31B for editorial-prose quality. Qwen 3.6-35B-A3B for fast-daily-driver MoE.
- Qwen 3.6-27B — April 22 2026 dense refresh; supersedes Qwen 3.5 27B and claims to beat the prior 397B MoE flagship while staying single-GPU at Q4 (~17 GB).
- Gemma 4 31B Dense — Google April 2 2026 release; Arena top 5, 256K context, vision+audio native; Apache 2.0.
- Qwen 3.6-35B-A3B — 3B-active MoE; strong function calling; 262K native context extensible to 1M.
Qwen 3.5 35B-A3B + Gemma 4 26B MoE + Ministral 3 14B Instruct. Diverse picks — pick by voice preference: Qwen for technical, Gemma for prose, Ministral for tool use.
- Qwen 3.5 35B-A3B (MoE, fits 24GB) — 3B active MoE — 30B quality at 3B inference speed.
- Gemma 4 26B MoE (3.8B active) — Open Arena top 10 at 3.8B active compute; calm and fast.
- Ministral 3 14B Instruct — Mistral 14B with 256K + vision + tool use; Apache 2.0. Prefer Instruct — community reports timeouts on the Reasoning variant.
Qwen 3.5 9B and Ministral 3 8B handle most chat. 9B has 262K context and multimodal; 8B is faster.
- Qwen 3.5 9B — 262K context with native multimodal; strong on GPQA, IFEval, LiveCodeBench at the 9B size.
- Ministral 3 8B Instruct — Outperforms Gemma 12B in most evals; Apache 2.0.
- gpt-oss-20b — OpenAI Apache 2.0 reasoning model; fits 16GB; strong general chat.
Qwen 3.5 4B + Ministral 3 3B + Phi-4 Mini. Fine for offline assistants, classification, simple Q&A. Doesn't replace cloud for hard reasoning.
- Qwen 3.5 4B — 4B dense with 262K context and native multimodal.
- Ministral 3 3B — Smallest Ministral; reasoning + tool use; Apache 2.0.
- Phi-4 Mini — Microsoft 3.8B; punches above its weight on STEM.
How to actually run it
Ollama + Open-WebUI is the standard local chat stack. LM Studio if you prefer a desktop app. Jan for AGPLv3 / fully-open. Mac users gained 1.6× prefill / 2× decode via Ollama's MLX backend (released March 2026, requires 32 GB unified minimum).
Watchouts
- Qwen 3.7 is mentioned in some places but no open-weight package or official Alibaba release note exists as of May 2026. Don't chase the rumor.
- Mac vs NVIDIA performance gap for chat is mostly noise — both work fine. Pick by overall workflow fit (Mac for portability + silence, NVIDIA for raw speed + budget).
- Frontier-tier reasoning (Opus 4.7, GPT-5.5, Gemini 3 Pro Thinking) still beats every open-weight model on hard reasoning. For pure chat that's rarely the bottleneck.
When cloud still wins
Frontier reasoning, current-events recall (local weights are a frozen snapshot), or one-off heavy tasks where the per-token cost doesn't add up. For consistent daily use of 50K+ tokens, local pays back in 4-12 months at the 24GB+ tier.
Hardware that fits this use case
Related guides
Next step
Try the planner with Chat preselected→The planner pulls all six dimensions together — your hardware, your VRAM/RAM, your GPU family, your context, and your priorities — and returns specific picks with fit badges.
Notes flagged for next refresh
DeepSeek V4-Flash (April 24 2026, 158B/13B-active, MIT, 1M context) is now the strongest open-weight chat at frontier tier — needs 96GB+ for local. Already in planner picks.