Local AI for everyday conversation.

The 16GB sweet spot is now genuinely good. Gemma 4 26B A4B at 3.8B active runs sub-second-to-first-token on consumer GPUs at quality that beats GPT-3.5 of two years ago. The honest case for local chat is privacy, offline reliability, and zero per-token cost — not raw quality vs Opus.

Verdict — Genuinely good at 16GB+, frontier-class at 64GB+

Qwen 3.6-27B dense (April 22 2026) claims to beat the prior 397B MoE flagship while fitting 16GB Q4. Gemma 4 31B for editorial-prose quality. Qwen 3.6-35B-A3B for fast-daily-driver MoE.

What's the answer at each tier

Frontier (64+ GB)

Llama 3.3 70B dense, Qwen 3.5 122B-A10B, gpt-oss-120b — open-weight chat that competes with closed frontier for most conversational work. The Chinese-lab April 2026 wave (DeepSeek V4, Kimi K2.6, GLM-5.1, MiniMax M2.7) all hit at-or-above this tier in a 12-day window.

Llama 3.3 70B Q4 dense — Proven daily-driver 70B dense — focused, reliable, well-supported across all runners. The community pick for "what should my $4K rig run for chat."
Qwen 3.5 122B-A10B (4-bit MLX, multimodal) — Native multimodal — handles text + image + video input. 10B active, 60.6 tok/s calibrated on M5 Max 128 GB. Apache 2.0.
Qwen 3.6-35B-A3B (the fast daily driver) — WillItRunAI rates 95/100 for Mac M3 Ultra 96 GB at 71 tok/s. Use this when you want speed; step up to 70B/122B for hard reasoning.

Top (32+ GB)

Qwen 3.6-27B dense (April 22 2026) claims to beat the prior 397B MoE flagship while fitting 16GB Q4. Gemma 4 31B for editorial-prose quality. Qwen 3.6-35B-A3B for fast-daily-driver MoE.

Qwen 3.6-27B — April 22 2026 dense refresh; supersedes Qwen 3.5 27B and claims to beat the prior 397B MoE flagship while staying single-GPU at Q4 (~17 GB).
Gemma 4 31B Dense — Google April 2 2026 release; Arena top 5, 256K context, vision+audio native; Apache 2.0.
Qwen 3.6-35B-A3B — 3B-active MoE; strong function calling; 262K native context extensible to 1M.

High (20–24 GB)

Qwen 3.5 35B-A3B + Gemma 4 26B MoE + Ministral 3 14B Instruct. Diverse picks — pick by voice preference: Qwen for technical, Gemma for prose, Ministral for tool use.

Qwen 3.5 35B-A3B (MoE, fits 24GB) — 3B active MoE — 30B quality at 3B inference speed.
Gemma 4 26B MoE (3.8B active) — Open Arena top 10 at 3.8B active compute; calm and fast.
Ministral 3 14B Instruct — Mistral 14B with 256K + vision + tool use; Apache 2.0. Prefer Instruct — community reports timeouts on the Reasoning variant.

Mid (12–16 GB)

Qwen 3.5 9B and Ministral 3 8B handle most chat. 9B has 262K context and multimodal; 8B is faster.

Qwen 3.5 9B — 262K context with native multimodal; strong on GPQA, IFEval, LiveCodeBench at the 9B size.
Ministral 3 8B Instruct — Outperforms Gemma 12B in most evals; Apache 2.0.
gpt-oss-20b — OpenAI Apache 2.0 reasoning model; fits 16GB; strong general chat.

Low (6–12 GB / CPU)

Qwen 3.5 4B + Ministral 3 3B + Phi-4 Mini. Fine for offline assistants, classification, simple Q&A. Doesn't replace cloud for hard reasoning.

Qwen 3.5 4B — 4B dense with 262K context and native multimodal.
Ministral 3 3B — Smallest Ministral; reasoning + tool use; Apache 2.0.
Phi-4 Mini — Microsoft 3.8B; punches above its weight on STEM.

How to actually run it

Ollama + Open-WebUI is the standard local chat stack. LM Studio if you prefer a desktop app. Jan for AGPLv3 / fully-open. Mac users gained 1.6× prefill / 2× decode via Ollama's MLX backend (released March 2026, requires 32 GB unified minimum).

Watchouts

Qwen 3.7 (Max + Plus) is real and official (May 2026) but proprietary / API-only — no open weights, so it is not a local pick. Qwen 3.6 (35B-A3B + 27B, Apache 2.0) remains the open-weight ceiling for the Qwen line as of July 2026.
Mac vs NVIDIA performance gap for chat is mostly noise — both work fine. Pick by overall workflow fit (Mac for portability + silence, NVIDIA for raw speed + budget).
Frontier-tier reasoning (Opus 4.8, GPT-5.5, Gemini 3 Pro Thinking) still beats every open-weight model on hard reasoning. For pure chat that's rarely the bottleneck.

When cloud still wins

Frontier reasoning, current-events recall (local weights are a frozen snapshot), or one-off heavy tasks where the per-token cost doesn't add up. For consistent daily use of 50K+ tokens, local pays back in 4-12 months at the 24GB+ tier.

Hardware that fits this use case

Related guides

Next step

Try the planner with Chat preselected→

The planner pulls all six dimensions together — your hardware, your VRAM/RAM, your GPU family, your context, and your priorities — and returns specific picks with fit badges.

Notes flagged for next refresh

DeepSeek V4-Flash (April 24 2026, 158B/13B-active, MIT, 1M context) is now the strongest open-weight chat at frontier tier — needs 96GB+ for local. Already in planner picks.