USE CASE · CODING
Local AI that codes with you.
At 24 GB VRAM and above, local coding models are genuinely competitive with cloud for most real work. Below that, they handle autocomplete and simple refactors but stop being a Claude-replacement. Here's the honest tier-by-tier read.
Verdict — Comfortable on 24GB+, workable below
Qwen3-Coder-30B-A3B is the community daily-driver at 32GB: 3B-active speeds with 30B-class quality. Qwen 3.5 35B-A3B often wins on mixed real-world codebases. Both Apache 2.0, both fit 24GB Q4.
What's the answer at each tier
70B-class dense and 122B-A10B MoE — local Claude territory. You're running Aider / Cline / OpenHands loops without round-trip latency, on your own hardware, with full repo context. The frontier band is where local coding meaningfully challenges cloud.
- Llama 3.3 70B Q4 dense — Community-standard 70B dense — fits 96 GB Mac unified cleanly (no tweak), 22 tok/s on M5 Max 128 GB, full BF16 on DGX Spark. Mature across llama.cpp / vLLM / TensorRT-LLM / MLX.
- Qwen 3.5 122B-A10B (4-bit MLX, multimodal) — 60.6 tok/s calibrated on M5 Max 128 GB; native multimodal; Apache 2.0. Mac 96 GB needs sysctl wired-memory tweak; M5 Max 128 GB and DGX Spark run it without.
- gpt-oss-120b (Apache 2.0, MXFP4 ~63 GB) — Near o4-mini reasoning at 5.1B active. MXFP4-native (no separate quant). Community reports 200+ tok/s on consumer hardware. Single 80 GB GPU or 128 GB unified.
Qwen3-Coder-30B-A3B is the community daily-driver at 32GB: 3B-active speeds with 30B-class quality. Qwen 3.5 35B-A3B often wins on mixed real-world codebases. Both Apache 2.0, both fit 24GB Q4.
- Qwen3-Coder-30B-A3B (MoE, fits 24GB) — Community daily driver for local coding; 3B-active MoE delivers 30B quality at 3B-dense speed.
- Qwen 3.5 35B-A3B (generalist MoE) — Often beats the Coder variant on mixed real-world codebases per community testing; Apache 2.0.
- Kimi K2.6 (1T MoE, frontier hosted) — April 20 2026 release; tops SWE-Bench Pro at 58.6 (vs GPT-5.4 xhigh 57.7) and #4 on Artificial Analysis Index. Modified MIT — no commercial restrictions below 100M MAU. Hosted-only realistic at 1T params.
The 24GB tier still runs the 30B-A3B MoEs cleanly. Best return-per-dollar in the entire local-AI stack — a $1,800 used 4090 or 3090 puts you in serious-work territory.
- Qwen3-Coder-30B-A3B (MoE, fits 24GB) — 3B-active MoE — benchmark champion for local coding at this tier.
- Qwen 3.5 35B-A3B (generalist MoE) — Often wins real mixed-codebase work over the Coder variant; Apache 2.0.
- gpt-oss-20b — OpenAI Apache 2.0; 21B MoE with 3.6B active; near o4-mini on reasoning; fits 16GB.
Qwen3-14B and Qwen 3.5 9B handle in-editor completion and small refactors. Reasoning over a large repo is harder; pair with tight RAG or call cloud for the harder questions.
- Qwen3-14B — Sticky 14B workhorse; 128K context; Apache 2.0; broad runner support.
- Qwen 3.5 9B — 262K context; strong on LiveCodeBench, IFEval, MMLU-Pro for its size.
- gpt-oss-20b — MXFP4-native Apache 2.0; fits 16GB cleanly; reasoning + tool use at this tier.
Qwen 3.5 4B / 2B and Phi-4 Mini work for autocomplete and small isolated edits. Don't expect them to read your whole repo and reason — that's not what they're for.
- Qwen 3.5 4B — 4B dense with 262K context; surprisingly coherent for its size.
- Qwen 3.5 2B — Smallest coherent option; 262K context; fine for autocomplete loops.
- Phi-4 Mini — Microsoft 3.8B; strongest small-model STEM and reasoning performance.
How to actually run it
Ollama remains the easiest entry; LM Studio for GUI users; llama.cpp directly for max throughput. The agentic frameworks (Cline, Continue, Aider, OpenHands) all support Ollama-compatible endpoints — same model, different surfaces. Devstral Small 2 (Mistral, Feb 2026) was built specifically for agentic loops if you're running Cline / OpenHands; Qwen3-Coder-30B-A3B is better for in-editor completion (Cursor / Continue).
Watchouts
- Aider leaderboard top 25 is closed-frontier dominated — top open-weight is DeepSeek-V3.2-Exp (Reasoner) at 74.2% (rank 12 as of May 2026). Our local picks won't appear in Aider's top 10. The honest framing: best for local-first coding, not top-of-leaderboards.
- MoE picks (30B-A3B, 35B-A3B) had HIP kernel issues on AMD ROCm through early 2026. Use llama.cpp Vulkan backend on AMD; Vulkan often outperforms ROCm on MoE workloads.
- Long-horizon agent loops (50+ steps) are where local breaks down first. Frontier reasoning models in cloud handle these better today.
When cloud still wins
Long-horizon autonomous agent loops (50+ steps), genuinely frontier reasoning required (complex multi-file refactors with subtle constraints), or you need the absolute top of the Aider leaderboard. GPT-5, o3-pro, Opus 4.7 still lead at the very top. For everything else under 8 hours of work per day, local at 24GB+ pays back in 6-18 months vs cloud subscriptions.
Hardware that fits this use case
Related guides
Next step
Try the planner with Coding preselected→The planner pulls all six dimensions together — your hardware, your VRAM/RAM, your GPU family, your context, and your priorities — and returns specific picks with fit badges.
Notes flagged for next refresh
Devstral Small 2 (Mistral, Feb 25 2026, 24B Apache 2.0, 256K context, 68.0% SWE-bench Verified) is a strong agentic-loops alternative not yet wired into the planner picks — flagged for next quarterly refresh.