the AI bench
VERIFIED JUNE 2026

METHODOLOGY · HOW WE PICK · JUNE 2026

How the planner makes its calls.

Editorial judgment, dated inputs, deterministic scoring — and a short list of things we deliberately don’t do.

The point of an evidence layer is to make the recommendations auditable. Every pick on this site has a date, a license, a runner, a quant, and a stated tier band. None of it is scraped live. None of it is predicted. The numbers on /methodology/calibration are cross-verified against community benchmark sources (LocalScore, Hardware Corner, llama.cpp benchmark threads, vendor docs) and dated independently — see the References section at the bottom of that page.


1 · How a pick gets on the list

For each use case (coding, chat, docs, image, agents, voice) we shortlist the top 3 models per tier band (top / high / mid / low) against four inputs:

  • Currency — released or substantively updated within the last 6 months. The wedge against chatbots is being more current than their training data; a 12-month-old pick loses by definition.
  • Community signal — adoption on Hugging Face (download counts, trending), Ollama library (tag availability), r/LocalLLaMA, llama.cpp GitHub issue volume on the runner side. We weight lived experience over leaderboard wins; benchmark champions that are painful to actually run get demoted.
  • Hardware fit — the model has to actually run on the tier band’s VRAM budget at a sane quant. A 70B dense model on a 24 GB card is not a pick, no matter how good the model is.
  • License clarity — Apache 2.0, MIT, and other clean OSS licenses get a thumb on the scale over custom or non-commercial weights. We flag NC/research-only weights loudly so commercial users don’t deploy them by accident.

2 · How a setup gets a verdict

The planner converts your inputs (platform, VRAM, RAM, GPU family, use case, priority) into a numeric tier between 0 and 7 via deterministic scoring tables in site/src/planner.js. The same function powers the web UI and the public /api/v1/plan endpoint, so they cannot drift.

  • ≥ 5.5 — “Strong” (sage). Frontier local picks are comfortable.
  • 3.5 – 5.5 — “Comfortable” (slate-blue). The 24 GB-VRAM-or-equivalent tier where MoE picks unlock dense-class quality.
  • 1.5 – 3.5 — “Workable” (amber). 7B–14B dense picks land here; quality is honest but not frontier.
  • < 1.5 — “Cloud-leaning” (stone). Local works for narrow tasks; cloud wins on quality per dollar.

Penalties (Mac prefill on long prompts, AMD ROCm tooling friction, laptop thermals, CPU-only inference) are applied inside the same scoring function and cited in the per-result watchouts.

3 · Refresh cadence

We refresh as the field moves, not on a calendar. Model picks, hardware prices, cloud plans, and OLLAMA_TAGS get re-verified against primary sources whenever something material shifts — and any drift gets a dated entry on /changes/.

Fast takes on major model drops (anything top-5 Arena, anything new from Alibaba / Meta / Google / Mistral / Anthropic / OpenAI) get a published verdict within 24 hours at /changes/drops/YYYY-MM-DD-[model].

4 · Calibration benchmark

Editorial verdicts are checked against measured numbers — not ours, but the community's — and republished as a single dated audit log. The current calibration covers 12 model × hardware combinations cross-verified between 2026-04-09 and 2026-04-14 against published benchmark data from sources like LocalScore, Hardware Corner, llama.cpp Discussions, NVIDIA Developer Forums, and vendor docs. Per-row verification dates reflect when we cross-checked the published figures, with exact runner, quant, Flash Attention setting, context length, prompt-processing tok/s, generation tok/s, and peak memory.

Read the calibration table

5 · What we don’t do

  • No live scraping. No real-time price feeds, no real-time benchmark dashboards. Dated snapshots only. A page that’s wrong every Tuesday for a week is worse than a page that’s right with a clear “verified June 2026” stamp.
  • No predicted exact tok/s. The planner returns speed bands (e.g. “~150–200 tok/s on this tier”), not a precise number for your specific rig. Speed varies by ~30% with quant, runner, FA, prompt length, and OS noise; pretending otherwise is dishonest.
  • No paid rankings, no sponsored picks. No affiliate buttons anywhere on the site. No vendor-funded audits. Not a position; a structural choice — the editorial voice is the moat, and revenue would corrupt it.
  • No generic “best model in the world” leaderboard. LMArena, Artificial Analysis, and OpenRouter already own that question. The only ranked surface here is editorial top picks by use case, tied to the refresh cadence.
  • No 500+ model directory. Curated ~70 picks is the moat; breadth isn’t.

6 · How to push back

If a pick reads wrong on your hardware, the fastest way to change it is to send a reproducible benchmark — model, quant, runner, hardware, prompt, measured PP and TG. We’ll cross-verify against the published community sources listed on the calibration page and either update the row or add a new one. The table is a public audit log; corrections are welcome.

For API consumers: the same scoring function is at /api/v1/plan with full OpenAPI 3.1 schema. See /for-agents for copy-paste examples.