the AI bench
VERIFIED MAY 2026

← All use cases

USE CASE · AGENTS · TOOL USE

Local AI for agentic loops and tool use.

Agents are where local AI struggles most against cloud. Frontier reasoning + reliable tool-use + long-horizon coherence is genuinely hard at open-weight scale below 96GB. Above that, you can run real agent loops locally — but the comparison is honest: cloud frontier models still lead.


Verdict — Workable at 24GB+; competitive with cloud only at frontier

gpt-oss-20b + Ministral 3 14B Instruct + Qwen3-Coder-30B-A3B (which the Qwen team specifically trained for agentic coding workflows). Devstral Small 2 (Mistral, Feb 2026) was built explicitly for Cline / OpenHands loops at 24B Apache 2.0.


What's the answer at each tier

Frontier (64+ GB)

Qwen 3.5 122B-A10B (multimodal, 262K) + gpt-oss-120b (Apache 2.0 reasoning + tool use, near o4-mini) + Llama 3.3 70B Q4 (proven production agent backbone) + GLM-5.1 (Z.ai's long-horizon specialist — stays productive across hundreds of rounds and thousands of tool calls per vendor claims).

  1. Qwen 3.5 122B-A10B (multimodal agent, 262K) — Multimodal + 10B active for tool-use latency. 262K context for long agentic loops. Apache 2.0.
  2. gpt-oss-120b (Apache 2.0) — OpenAI Apache 2.0 reasoning + tool use; near o4-mini. MXFP4 native ~63 GB. The open-weight agentic frontier on 80 GB+ hardware.
  3. Llama 3.3 70B Q4 dense — The reliable 70B for production agent loops — battle-tested, broad framework support (LangGraph / CrewAI / AutoGen / Qwen Code).
Top (32+ GB)

Qwen 3.6-35B-A3B at fast-daily-driver MoE speeds. MiniMax M2.5/M2.7 (229-230B hosted-only frontier). GLM-5.1 if you can host a 754B MoE remotely.

  1. Qwen 3.6-35B-A3B — Latest Qwen MoE; strong function calling; realistic on 24GB+ VRAM or Mac 48GB+ — the local agentic top pick.
  2. Qwen3-Coder-30B-A3B (MoE, fits 24GB) — Purpose-trained for agentic coding + browser-use. 3.3B active so tool-call latency stays low. The community pick when the agent loop is mostly code.
  3. Qwen 3.5 35B-A3B (generalist MoE) — Apache 2.0 generalist with native tool use; 262K context for long agentic loops; 10B-active speed at 35B-class quality.
High (20–24 GB)

gpt-oss-20b + Ministral 3 14B Instruct + Qwen3-Coder-30B-A3B (which the Qwen team specifically trained for agentic coding workflows). Devstral Small 2 (Mistral, Feb 2026) was built explicitly for Cline / OpenHands loops at 24B Apache 2.0.

  1. Qwen 3.5 35B-A3B (MoE, fits 24GB) — MoE with native tool use; fits 24GB at Q4; Apache 2.0.
  2. gpt-oss-20b — OpenAI Apache 2.0 reasoning + tool use; 21B MoE fits 16GB.
  3. Ministral 3 14B Instruct — Dense with strong tool use + planning. Prefer Instruct — community reports timeouts on Reasoning variant.
Mid (12–16 GB)

Qwen 3.5 9B + Ministral 3 8B + gpt-oss-20b. Short bounded tool-use loops only. Don't plan multi-hour autonomous runs at this tier.

  1. Qwen 3.5 9B — Strong tool-use performance for 9B; supports thinking mode and 201-language coverage.
  2. Ministral 3 8B Instruct — Tool use + reasoning; Apache 2.0.
  3. gpt-oss-20b — Apache 2.0 reasoning model; strong structured outputs for agents.
Low (6–12 GB / CPU)

Gemma 3 4B + Ministral 3 3B + Qwen 3.5 4B. Single-tool function-call style work. Not for autonomous agent loops.

  1. Ministral 3 3B — Smallest Ministral with reasoning + tool use.
  2. Phi-4 Mini — Microsoft 3.8B; works for small tool-calling agents.
  3. Gemma 3 4B — Compact Gemma for bounded tool loops; keep iterations tight.

How to actually run it

Cline + Ollama is the easiest local agent stack. OpenHands for production-grade autonomous agents (Devstral Small 2 was tuned for this). n8n + local LLM endpoint for workflow automation. Open-WebUI for tool-use chat experimentation.


Watchouts

  • Long-horizon stability is the open-weight Achilles heel. Models that look great in 5-step demos often degrade past 50 steps. GLM-5.1 is the standout exception per vendor claims (verify against your workload).
  • Tool-use formatting drift between models is real — a system prompt tuned for Qwen3-Coder won't directly transfer to gpt-oss. Plan for per-model prompt iteration.
  • Aider / Cline / OpenHands all support Ollama-compatible endpoints — easy to swap models. But also: GLM-5.1's `glm-5.1:cloud` Ollama tag is hosted inference, not local. Check tag semantics.

When cloud still wins

Long-horizon autonomous agents that need 100+ steps of reliable tool-use, frontier reasoning + tool-use combined, or anything where a single wrong step is expensive (production deployment loops). Local at frontier tier is competitive; below 96GB, cloud is usually the right call for serious agent work.


Hardware that fits this use case


Related guides


Next step

Try the planner with AI agents · tool use preselected

The planner pulls all six dimensions together — your hardware, your VRAM/RAM, your GPU family, your context, and your priorities — and returns specific picks with fit badges.


Notes flagged for next refresh

Devstral Small 2 (Mistral, Feb 25 2026, 24B Apache 2.0, 256K context, built specifically for agentic loops) is flagged for next quarterly refresh as a planner pick addition.