Local AI for agentic loops and tool use.

Agents are where local AI struggles most against cloud. Frontier reasoning + reliable tool-use + long-horizon coherence is genuinely hard at open-weight scale below 96GB. Above that, you can run real agent loops locally — but the comparison is honest: cloud frontier models still lead.

Verdict — Workable at 24GB+; competitive with cloud only at frontier

gpt-oss-20b + Ministral 3 14B Instruct + Qwen3-Coder-30B-A3B (which the Qwen team specifically trained for agentic coding workflows). Devstral Small 2 (Mistral, Feb 2026) was built explicitly for Cline / OpenHands loops at 24B Apache 2.0.

What's the answer at each tier

Frontier (64+ GB)

Qwen 3.5 122B-A10B (multimodal, 262K) + gpt-oss-120b (Apache 2.0 reasoning + tool use, near o4-mini) + Llama 3.3 70B Q4 (proven production agent backbone) + GLM-5.1 (Z.ai's long-horizon specialist — stays productive across hundreds of rounds and thousands of tool calls per vendor claims).

Command A+ (218B-A25B, Apache 2.0) — May 20 2026 release — Cohere frontier MoE built explicitly for agentic + RAG. Native tool use, 128K context, 48-language coverage, native vision input. The first Apache 2.0 218B-class MoE — direct DeepSeek V4-Pro competitor with a permissive license neither V4-Pro nor Kimi K2.6 carry.
Qwen 3.5 122B-A10B (multimodal agent, 262K) — Multimodal + 10B active for tool-use latency. 262K context for long agentic loops. Apache 2.0.
gpt-oss-120b (Apache 2.0) — OpenAI Apache 2.0 reasoning + tool use; near o4-mini. MXFP4 native ~63 GB. The open-weight agentic frontier on 80 GB+ hardware.

Top (32+ GB)

Qwen 3.6-35B-A3B at fast-daily-driver MoE speeds. MiniMax M2.5/M2.7 (229-230B hosted-only frontier). GLM-5.1 if you can host a 754B MoE remotely.

Qwen 3.6-35B-A3B — Latest Qwen MoE; strong function calling; realistic on 24GB+ VRAM or Mac 48GB+ — the local agentic top pick.
Qwen3-Coder-30B-A3B (MoE, fits 24GB) — Purpose-trained for agentic coding + browser-use. 3.3B active so tool-call latency stays low. The community pick when the agent loop is mostly code.
Qwen 3.5 35B-A3B (generalist MoE) — Apache 2.0 generalist with native tool use; 262K context for long agentic loops; 10B-active speed at 35B-class quality.

High (20–24 GB)

Qwen 3.5 35B-A3B (MoE, fits 24GB) — MoE with native tool use; fits 24GB at Q4; Apache 2.0.
gpt-oss-20b — OpenAI Apache 2.0 reasoning + tool use; 21B MoE fits 16GB.
Ministral 3 14B Instruct — Dense with strong tool use + planning. Prefer Instruct — community reports timeouts on Reasoning variant.

Mid (12–16 GB)

Qwen 3.5 9B + Ministral 3 8B + gpt-oss-20b. Short bounded tool-use loops only. Don't plan multi-hour autonomous runs at this tier.

Qwen 3.5 9B — Strong tool-use performance for 9B; supports thinking mode and 201-language coverage.
Ministral 3 8B Instruct — Tool use + reasoning; Apache 2.0.
gpt-oss-20b — Apache 2.0 reasoning model; strong structured outputs for agents.

Low (6–12 GB / CPU)

Gemma 3 4B + Ministral 3 3B + Qwen 3.5 4B. Single-tool function-call style work. Not for autonomous agent loops.

Ministral 3 3B — Smallest Ministral with reasoning + tool use.
Phi-4 Mini — Microsoft 3.8B; works for small tool-calling agents.
Gemma 3 4B — Compact Gemma for bounded tool loops; keep iterations tight.

How to actually run it

Cline + Ollama is the easiest local agent stack. OpenHands for production-grade autonomous agents (Devstral Small 2 was tuned for this). n8n + local LLM endpoint for workflow automation. Open-WebUI for tool-use chat experimentation.

Watchouts

Long-horizon stability is the open-weight Achilles heel. Models that look great in 5-step demos often degrade past 50 steps. GLM-5.1 is the standout exception per vendor claims (verify against your workload).
Tool-use formatting drift between models is real — a system prompt tuned for Qwen3-Coder won't directly transfer to gpt-oss. Plan for per-model prompt iteration.
Aider / Cline / OpenHands all support Ollama-compatible endpoints — easy to swap models. But also: GLM-5.1's `glm-5.1:cloud` Ollama tag is hosted inference, not local. Check tag semantics.

When cloud still wins

Long-horizon autonomous agents that need 100+ steps of reliable tool-use, frontier reasoning + tool-use combined, or anything where a single wrong step is expensive (production deployment loops). Local at frontier tier is competitive; below 96GB, cloud is usually the right call for serious agent work.

Hardware that fits this use case

Related guides

Next step

Try the planner with AI agents · tool use preselected→

The planner pulls all six dimensions together — your hardware, your VRAM/RAM, your GPU family, your context, and your priorities — and returns specific picks with fit badges.

Notes flagged for next refresh

Devstral Small 2 (Mistral, Feb 25 2026, 24B Apache 2.0, 256K context, built specifically for agentic loops) is flagged for next quarterly refresh as a planner pick addition.