the AI bench
VERIFIED MAY 2026

← All use cases

USE CASE · CHAT

Local AI for everyday conversation.

The 16GB sweet spot is now genuinely good. Gemma 4 26B A4B at 3.8B active runs sub-second-to-first-token on consumer GPUs at quality that beats GPT-3.5 of two years ago. The honest case for local chat is privacy, offline reliability, and zero per-token cost — not raw quality vs Opus.


Verdict — Genuinely good at 16GB+, frontier-class at 64GB+

Qwen 3.6-27B dense (April 22 2026) claims to beat the prior 397B MoE flagship while fitting 16GB Q4. Gemma 4 31B for editorial-prose quality. Qwen 3.6-35B-A3B for fast-daily-driver MoE.


What's the answer at each tier

Frontier (64+ GB)

Llama 3.3 70B dense, Qwen 3.5 122B-A10B, gpt-oss-120b — open-weight chat that competes with closed frontier for most conversational work. The Chinese-lab April 2026 wave (DeepSeek V4, Kimi K2.6, GLM-5.1, MiniMax M2.7) all hit at-or-above this tier in a 12-day window.

  1. Llama 3.3 70B Q4 dense — Proven daily-driver 70B dense — focused, reliable, well-supported across all runners. The community pick for "what should my $4K rig run for chat."
  2. Qwen 3.5 122B-A10B (4-bit MLX, multimodal) — Native multimodal — handles text + image + video input. 10B active, 60.6 tok/s calibrated on M5 Max 128 GB. Apache 2.0.
  3. Qwen 3.6-35B-A3B (the fast daily driver) — WillItRunAI rates 95/100 for Mac M3 Ultra 96 GB at 71 tok/s. Use this when you want speed; step up to 70B/122B for hard reasoning.
Top (32+ GB)

Qwen 3.6-27B dense (April 22 2026) claims to beat the prior 397B MoE flagship while fitting 16GB Q4. Gemma 4 31B for editorial-prose quality. Qwen 3.6-35B-A3B for fast-daily-driver MoE.

  1. Qwen 3.6-27B — April 22 2026 dense refresh; supersedes Qwen 3.5 27B and claims to beat the prior 397B MoE flagship while staying single-GPU at Q4 (~17 GB).
  2. Gemma 4 31B Dense — Google April 2 2026 release; Arena top 5, 256K context, vision+audio native; Apache 2.0.
  3. Qwen 3.6-35B-A3B — 3B-active MoE; strong function calling; 262K native context extensible to 1M.
High (20–24 GB)

Qwen 3.5 35B-A3B + Gemma 4 26B MoE + Ministral 3 14B Instruct. Diverse picks — pick by voice preference: Qwen for technical, Gemma for prose, Ministral for tool use.

  1. Qwen 3.5 35B-A3B (MoE, fits 24GB) — 3B active MoE — 30B quality at 3B inference speed.
  2. Gemma 4 26B MoE (3.8B active) — Open Arena top 10 at 3.8B active compute; calm and fast.
  3. Ministral 3 14B Instruct — Mistral 14B with 256K + vision + tool use; Apache 2.0. Prefer Instruct — community reports timeouts on the Reasoning variant.
Mid (12–16 GB)

Qwen 3.5 9B and Ministral 3 8B handle most chat. 9B has 262K context and multimodal; 8B is faster.

  1. Qwen 3.5 9B — 262K context with native multimodal; strong on GPQA, IFEval, LiveCodeBench at the 9B size.
  2. Ministral 3 8B Instruct — Outperforms Gemma 12B in most evals; Apache 2.0.
  3. gpt-oss-20b — OpenAI Apache 2.0 reasoning model; fits 16GB; strong general chat.
Low (6–12 GB / CPU)

Qwen 3.5 4B + Ministral 3 3B + Phi-4 Mini. Fine for offline assistants, classification, simple Q&A. Doesn't replace cloud for hard reasoning.

  1. Qwen 3.5 4B — 4B dense with 262K context and native multimodal.
  2. Ministral 3 3B — Smallest Ministral; reasoning + tool use; Apache 2.0.
  3. Phi-4 Mini — Microsoft 3.8B; punches above its weight on STEM.

How to actually run it

Ollama + Open-WebUI is the standard local chat stack. LM Studio if you prefer a desktop app. Jan for AGPLv3 / fully-open. Mac users gained 1.6× prefill / 2× decode via Ollama's MLX backend (released March 2026, requires 32 GB unified minimum).


Watchouts

  • Qwen 3.7 is mentioned in some places but no open-weight package or official Alibaba release note exists as of May 2026. Don't chase the rumor.
  • Mac vs NVIDIA performance gap for chat is mostly noise — both work fine. Pick by overall workflow fit (Mac for portability + silence, NVIDIA for raw speed + budget).
  • Frontier-tier reasoning (Opus 4.7, GPT-5.5, Gemini 3 Pro Thinking) still beats every open-weight model on hard reasoning. For pure chat that's rarely the bottleneck.

When cloud still wins

Frontier reasoning, current-events recall (local weights are a frozen snapshot), or one-off heavy tasks where the per-token cost doesn't add up. For consistent daily use of 50K+ tokens, local pays back in 4-12 months at the 24GB+ tier.


Hardware that fits this use case


Related guides


Next step

Try the planner with Chat preselected

The planner pulls all six dimensions together — your hardware, your VRAM/RAM, your GPU family, your context, and your priorities — and returns specific picks with fit badges.


Notes flagged for next refresh

DeepSeek V4-Flash (April 24 2026, 158B/13B-active, MIT, 1M context) is now the strongest open-weight chat at frontier tier — needs 96GB+ for local. Already in planner picks.