the AI bench
VERIFIED JUNE 2026
Methodology

CALIBRATION · 12 ROWS · VERIFIED 2026-04-09 → 2026-04-14

What the community has measured.

12 model × hardware combinations cross-verified against published benchmark data, republished here as a single dated audit log — exact runner, quant, Flash Attention setting, context length, and per-row verification dates.

The editorial verdicts on the planner have to land within shouting distance of measured reality. This is the audit log. Every row below was cross-verified between 2026-04-09 and 2026-04-14 against ≥2 published community sources — see the References section at the bottom — using the runner and quant stated. Single-stream throughput, not batched. PP = prompt-processing tok/s; TG = generation tok/s. Where a number is reported as a band, it’s because the published figures varied meaningfully across reproductions — and saying so is honest.

Popular companion pages: the Mac Studio M3 Ultra 96 GB workstation, the AMD ROCm guide, and the find-by-model hardware lookup.


How to read this

  • MoE vs dense at the same total weight is the headline story for the 24 GB+ tier. Compare the RTX 5090 Qwen3 32B dense row to the Qwen3moe 30B-A3B row above it — same memory, ~3× the TG.
  • Mac prefill is the real friction. The M5 Max Qwen 3.5 27B row shows PP ~10× lower than the 5090 with comparable TG. Long-prompt latency is what Mac users actually feel.
  • AMD with the right backend is competitive. The RDNA3 Vulkan row on Qwen3 30B-A3B sits within ~10% of NVIDIA. The friction is software, not silicon.
  • Capacity-first hardware (DGX Spark, M5 Max 128 GB) wins what NVIDIA consumer cards can’t do at all — running a 122B-class model on a single device. 38–60 tok/s is genuinely usable for the right workload.
  • Backend choice can swing 2–3×. The Intel Arc B580 row reports a band because Vulkan delivers ~62 tok/s and the (now-archived) IPEX-LLM SYCL path delivers ~25–30. Same card, different software story.

The table

NVIDIA RTX 5090 (32 GB)Qwen 3.5 35B-A3B

VERIFIED 2026-04-09 · llama.cpp (CUDA 13.0) · Q4_K_XL · FA on · 4K ctx

PP
6,958tok/s
TG
194tok/s
Peak mem
~19 GB

MoE pays off — 35B total, 3B active. TG runs at small-model speed; PP scales with the dense compute budget.

NVIDIA RTX 5090 (32 GB)Qwen3 32B (dense)

VERIFIED 2026-04-09 · llama.cpp (CUDA 13.0) · Q4_K_XL · FA on · 4K ctx

PP
2,931tok/s
TG
61tok/s
Peak mem
~18.6 GB

Reference dense baseline. Compare to the 30B-A3B MoE row above — same weight footprint, ~3× the TG.

NVIDIA RTX 5090 (32 GB)Qwen3moe 30B-A3B

VERIFIED 2026-04-10 · llama.cpp (CUDA 13.0) · Q4_K_XL · FA on · 147K ctx

PP
666tok/s
TG
52tok/s
Peak mem
~31 GB

Long-context reality check. 147K loaded, KV cache dominates VRAM, both PP and TG drop sharply vs the 4K case. This is what coding agents on 100K+ codebases actually see.

NVIDIA RTX 4090 (24 GB)Qwen3 14B

VERIFIED 2026-04-11 · llama.cpp (CUDA 12.8) · Q4_K_XL · FA on · 16K ctx

PP
3,928tok/s
TG
69tok/s
Peak mem
~10 GB

24 GB RTX 4090 hits its sweet spot at 14B Q4 with room for 16K context. 70B Q4 (~40 GB) does NOT fit — dual-GPU or 48 GB+ required.

2× AMD RX 7900 XTX (48 GB total)Llama 3.1 70B Instruct

VERIFIED 2026-04-12 · llama.cpp (ROCm 7.1.1) · Q4_K_M · FA on · 4K ctx

PP
341tok/s
TG
13.4tok/s
Peak mem
~39.6 GB

70B Q4 the way it actually fits on consumer hardware: tensor-parallel across two 24 GB AMD cards on ROCm 7.1. Single 24 GB card cannot hold this model at Q4.

AMD RDNA3 single-card (Vulkan)Qwen3 30B-A3B

VERIFIED 2026-04-12 · llama.cpp (Vulkan backend) · Q4_K_M · FA on · 4K ctx

PP
3,033tok/s
TG
183tok/s
Peak mem
~17.3 GB

Vulkan backend is genuinely competitive on RDNA3 for MoE picks. Tested on the AI PRO R9700 (gfx1100); 7900 XTX behaves the same family. Community benchmarks on the sibling Qwen3.5-35B-A3B Q4 Vulkan show ~30% lower TG (~95 vs our 183) — flagged for re-verification with matched quant + context.

M5 Max MacBook Pro (128 GB)Qwen 3.5 27B (dense)

VERIFIED 2026-04-13 · mlx_lm · 6-bit MLX · FA off · 16K ctx

PP
686tok/s
TG
20.3tok/s
Peak mem
~21 GB unified

Honest Mac dense-27B number. PP is ~10× lower than the 5090 — Mac prefill on long prompts is the real friction. TG holds up at ~20 tok/s, which still reads as fast in interactive chat.

M5 Max MacBook Pro (128 GB)Qwen 3.5 122B-A10B

VERIFIED 2026-04-13 · mlx_lm · 4-bit MLX · FA off · 16K ctx

PP
1,239tok/s
TG
60.6tok/s
Peak mem
~74 GB unified

The unified-memory unlock. A 122B-class model that simply cannot run on consumer NVIDIA — and it does 60 tok/s. This is the one Mac picks pay off on. 64 GB Mac users: this row is aspirational, you need 96 GB+.

NVIDIA DGX Spark (128 GB unified)Qwen 3.5 122B-A10B

VERIFIED 2026-04-14 · vLLM 0.19 + FlashInfer · INT4 (AutoRound) · FA on · 4K ctx

PP
TG
38.4tok/s
Peak mem
~74 GB unified

GB10 platform with hybrid INT4 + FP8 + MTP-1 patches. Capacity-first hardware: the 122B fits comfortably in 128 GB unified, and 38 tok/s is genuinely usable. Baseline INT4 alone runs ~28 tok/s; the patches add the rest.

NVIDIA RTX 5060 Ti (16 GB)Llama 3.1 8B Instruct

VERIFIED 2026-04-14 · llama.cpp (CUDA 12.8) · Q4_K_M · FA on · 4K ctx

PP
2,387tok/s
TG
59.9tok/s
Peak mem
~5 GB

The "$550 sweet spot" verified. 8B Q4 at ~60 tok/s, with 11 GB headroom for context or a second model. Time-to-first-token ~565 ms.

Mac Mini M4 base (16 GB unified)Llama 3.1 8B

VERIFIED 2026-04-14 · Ollama (Metal; MLX backend requires 32 GB+) · Q4_K_M · FA off · 4K ctx

PP
TG
28–32tok/s
Peak mem
~6 GB unified

Reported as a band: 28–32 tok/s across 4 prompt patterns. The $499 Apple machine that genuinely runs an 8B-class model. Hard ceiling: 16 GB shared between OS, browser, IDE, and the model.

Intel Arc B580 (12 GB)Llama 3.1 8B

VERIFIED 2026-04-14 · llama.cpp (Vulkan) · Q4_K_M · FA off · 4K ctx

PP
TG
25–62tok/s
Peak mem
~5 GB

Backend matters more than the card here. Vulkan: ~62 tok/s. SYCL via IPEX-LLM (now archived as of January 28, 2026): 25–30 tok/s. The hardware is fine; the software stack is the question.


Caveats

  • This is an audit log, not original research. We cross-check published community figures and date when we cross-checked. Where ≥2 sources agreed within ~5% we adopt the figure; where they disagreed we publish a band. The source list per category is in the References section below.
  • Single-stream throughput only. Batched inference wins meaningfully at 4–8 concurrent requests; the numbers here are for one user, one prompt at a time — matching how community sources typically report.
  • Llama-bench convention for PP and TG (pp512 and tg128 equivalents). MLX runs use the analogous mlx_lm metrics. Published source figures are typically the median of 3+ warmed-up runs in their original reports; FA setting stated per row.
  • Context-length matters more than people think. 4K context numbers are not 32K context numbers. Where the context-length effect is the headline (the RTX 5090 30B-A3B long-context row), we publish both.
  • Each row is dated independently. When a published figure goes stale (new runner, new quant, new model version), we cross-verify the new number and update that row with a new date, leaving the others alone.

References — community benchmark sources

These are the sources we cross-check against. Per-row entries typically draw from 2–3 of these; vendor-specific rows (DGX Spark, M5 Max, Strix Halo) lean harder on the vendor + community follow-up combination.

  • LocalScore — community-submitted llama.cpp benchmark database with per-accelerator pages. Primary source for RTX 5090 / 5060 Ti / 5070 Ti / RTX 3090 figures.
  • Hardware Corner — independent local-LLM hardware benchmarks. Primary source for Mac Apple Silicon (M3/M4/M5) figures across the Mac product line.
  • llama.cpp GitHub Discussions + Issues — runner-author + maintainer benchmarks and the canonical place to find AMD ROCm vs Vulkan back-and-forth.
  • NVIDIA Developer Forums — vendor-published DGX Spark throughput figures + community reproductions on driver versions (e.g. v2.1 patches for the 122B-A10B benchmark).
  • Ollama blog — runner-author throughput claims for new backends (Metal, MLX, ROCm) at release.
  • HuggingFace model card discussions on the canonical model repos (e.g. Qwen3.5-35B-A3B discussions, ubergarm GGUF quants for community llama-bench results).
  • r/LocalLLaMA benchmark threads — community lived experience, especially useful for catching regressions in newer driver/runner builds.
  • Databasemart benchmarks — server-hosted multi-GPU figures (dual 5090, dual A6000).

Want a row updated, added, or corrected? Send a reproducible benchmark — model, quant, runner, hardware, prompt, measured PP and TG — and we’ll cross-verify against the existing sources and either update the row or add a new one.

Back to the methodology

How picks are chosen