USE CASE · AGENTS · TOOL USE
Local AI for agentic loops and tool use.
Agents are where local AI struggles most against cloud. Frontier reasoning + reliable tool-use + long-horizon coherence is genuinely hard at open-weight scale below 96GB. Above that, you can run real agent loops locally — but the comparison is honest: cloud frontier models still lead.
Verdict — Workable at 24GB+; competitive with cloud only at frontier
gpt-oss-20b + Ministral 3 14B Instruct + Qwen3-Coder-30B-A3B (which the Qwen team specifically trained for agentic coding workflows). Devstral Small 2 (Mistral, Feb 2026) was built explicitly for Cline / OpenHands loops at 24B Apache 2.0.
What's the answer at each tier
Qwen 3.5 122B-A10B (multimodal, 262K) + gpt-oss-120b (Apache 2.0 reasoning + tool use, near o4-mini) + Llama 3.3 70B Q4 (proven production agent backbone) + GLM-5.1 (Z.ai's long-horizon specialist — stays productive across hundreds of rounds and thousands of tool calls per vendor claims).
- Qwen 3.5 122B-A10B (multimodal agent, 262K) — Multimodal + 10B active for tool-use latency. 262K context for long agentic loops. Apache 2.0.
- gpt-oss-120b (Apache 2.0) — OpenAI Apache 2.0 reasoning + tool use; near o4-mini. MXFP4 native ~63 GB. The open-weight agentic frontier on 80 GB+ hardware.
- Llama 3.3 70B Q4 dense — The reliable 70B for production agent loops — battle-tested, broad framework support (LangGraph / CrewAI / AutoGen / Qwen Code).
Qwen 3.6-35B-A3B at fast-daily-driver MoE speeds. MiniMax M2.5/M2.7 (229-230B hosted-only frontier). GLM-5.1 if you can host a 754B MoE remotely.
- Qwen 3.6-35B-A3B — Latest Qwen MoE; strong function calling; realistic on 24GB+ VRAM or Mac 48GB+ — the local agentic top pick.
- Qwen3-Coder-30B-A3B (MoE, fits 24GB) — Purpose-trained for agentic coding + browser-use. 3.3B active so tool-call latency stays low. The community pick when the agent loop is mostly code.
- Qwen 3.5 35B-A3B (generalist MoE) — Apache 2.0 generalist with native tool use; 262K context for long agentic loops; 10B-active speed at 35B-class quality.
gpt-oss-20b + Ministral 3 14B Instruct + Qwen3-Coder-30B-A3B (which the Qwen team specifically trained for agentic coding workflows). Devstral Small 2 (Mistral, Feb 2026) was built explicitly for Cline / OpenHands loops at 24B Apache 2.0.
- Qwen 3.5 35B-A3B (MoE, fits 24GB) — MoE with native tool use; fits 24GB at Q4; Apache 2.0.
- gpt-oss-20b — OpenAI Apache 2.0 reasoning + tool use; 21B MoE fits 16GB.
- Ministral 3 14B Instruct — Dense with strong tool use + planning. Prefer Instruct — community reports timeouts on Reasoning variant.
Qwen 3.5 9B + Ministral 3 8B + gpt-oss-20b. Short bounded tool-use loops only. Don't plan multi-hour autonomous runs at this tier.
- Qwen 3.5 9B — Strong tool-use performance for 9B; supports thinking mode and 201-language coverage.
- Ministral 3 8B Instruct — Tool use + reasoning; Apache 2.0.
- gpt-oss-20b — Apache 2.0 reasoning model; strong structured outputs for agents.
Gemma 3 4B + Ministral 3 3B + Qwen 3.5 4B. Single-tool function-call style work. Not for autonomous agent loops.
- Ministral 3 3B — Smallest Ministral with reasoning + tool use.
- Phi-4 Mini — Microsoft 3.8B; works for small tool-calling agents.
- Gemma 3 4B — Compact Gemma for bounded tool loops; keep iterations tight.
How to actually run it
Cline + Ollama is the easiest local agent stack. OpenHands for production-grade autonomous agents (Devstral Small 2 was tuned for this). n8n + local LLM endpoint for workflow automation. Open-WebUI for tool-use chat experimentation.
Watchouts
- Long-horizon stability is the open-weight Achilles heel. Models that look great in 5-step demos often degrade past 50 steps. GLM-5.1 is the standout exception per vendor claims (verify against your workload).
- Tool-use formatting drift between models is real — a system prompt tuned for Qwen3-Coder won't directly transfer to gpt-oss. Plan for per-model prompt iteration.
- Aider / Cline / OpenHands all support Ollama-compatible endpoints — easy to swap models. But also: GLM-5.1's `glm-5.1:cloud` Ollama tag is hosted inference, not local. Check tag semantics.
When cloud still wins
Long-horizon autonomous agents that need 100+ steps of reliable tool-use, frontier reasoning + tool-use combined, or anything where a single wrong step is expensive (production deployment loops). Local at frontier tier is competitive; below 96GB, cloud is usually the right call for serious agent work.
Hardware that fits this use case
Related guides
Next step
Try the planner with AI agents · tool use preselected→The planner pulls all six dimensions together — your hardware, your VRAM/RAM, your GPU family, your context, and your priorities — and returns specific picks with fit badges.
Notes flagged for next refresh
Devstral Small 2 (Mistral, Feb 25 2026, 24B Apache 2.0, 256K context, built specifically for agentic loops) is flagged for next quarterly refresh as a planner pick addition.