- PP
- 6,958tok/s
- TG
- 194tok/s
- Peak mem
- ~19 GB
MoE pays off — 35B total, 3B active. TG runs at small-model speed; PP scales with the dense compute budget.
CALIBRATION · 12 ROWS · VERIFIED 2026-04-09 → 2026-04-14
12 model × hardware combinations cross-verified against published benchmark data, republished here as a single dated audit log — exact runner, quant, Flash Attention setting, context length, and per-row verification dates.
The editorial verdicts on the planner have to land within shouting distance of measured reality. This is the audit log. Every row below was cross-verified between 2026-04-09 and 2026-04-14 against ≥2 published community sources — see the References section at the bottom — using the runner and quant stated. Single-stream throughput, not batched. PP = prompt-processing tok/s; TG = generation tok/s. Where a number is reported as a band, it’s because the published figures varied meaningfully across reproductions — and saying so is honest.
Popular companion pages: the Mac Studio M3 Ultra 96 GB workstation, the AMD ROCm guide, and the find-by-model hardware lookup.
How to read this
The table
MoE pays off — 35B total, 3B active. TG runs at small-model speed; PP scales with the dense compute budget.
NVIDIA RTX 5090 (32 GB)Qwen3 32B (dense)
Reference dense baseline. Compare to the 30B-A3B MoE row above — same weight footprint, ~3× the TG.
NVIDIA RTX 5090 (32 GB)Qwen3moe 30B-A3B
Long-context reality check. 147K loaded, KV cache dominates VRAM, both PP and TG drop sharply vs the 4K case. This is what coding agents on 100K+ codebases actually see.
NVIDIA RTX 4090 (24 GB)Qwen3 14B
24 GB RTX 4090 hits its sweet spot at 14B Q4 with room for 16K context. 70B Q4 (~40 GB) does NOT fit — dual-GPU or 48 GB+ required.
2× AMD RX 7900 XTX (48 GB total)Llama 3.1 70B Instruct
70B Q4 the way it actually fits on consumer hardware: tensor-parallel across two 24 GB AMD cards on ROCm 7.1. Single 24 GB card cannot hold this model at Q4.
AMD RDNA3 single-card (Vulkan)Qwen3 30B-A3B
Vulkan backend is genuinely competitive on RDNA3 for MoE picks. Tested on the AI PRO R9700 (gfx1100); 7900 XTX behaves the same family. Community benchmarks on the sibling Qwen3.5-35B-A3B Q4 Vulkan show ~30% lower TG (~95 vs our 183) — flagged for re-verification with matched quant + context.
Honest Mac dense-27B number. PP is ~10× lower than the 5090 — Mac prefill on long prompts is the real friction. TG holds up at ~20 tok/s, which still reads as fast in interactive chat.
M5 Max MacBook Pro (128 GB)Qwen 3.5 122B-A10B
The unified-memory unlock. A 122B-class model that simply cannot run on consumer NVIDIA — and it does 60 tok/s. This is the one Mac picks pay off on. 64 GB Mac users: this row is aspirational, you need 96 GB+.
NVIDIA DGX Spark (128 GB unified)Qwen 3.5 122B-A10B
GB10 platform with hybrid INT4 + FP8 + MTP-1 patches. Capacity-first hardware: the 122B fits comfortably in 128 GB unified, and 38 tok/s is genuinely usable. Baseline INT4 alone runs ~28 tok/s; the patches add the rest.
NVIDIA RTX 5060 Ti (16 GB)Llama 3.1 8B Instruct
The "$550 sweet spot" verified. 8B Q4 at ~60 tok/s, with 11 GB headroom for context or a second model. Time-to-first-token ~565 ms.
Mac Mini M4 base (16 GB unified)Llama 3.1 8B
Reported as a band: 28–32 tok/s across 4 prompt patterns. The $499 Apple machine that genuinely runs an 8B-class model. Hard ceiling: 16 GB shared between OS, browser, IDE, and the model.
Intel Arc B580 (12 GB)Llama 3.1 8B
Backend matters more than the card here. Vulkan: ~62 tok/s. SYCL via IPEX-LLM (now archived as of January 28, 2026): 25–30 tok/s. The hardware is fine; the software stack is the question.
Caveats
References — community benchmark sources
These are the sources we cross-check against. Per-row entries typically draw from 2–3 of these; vendor-specific rows (DGX Spark, M5 Max, Strix Halo) lean harder on the vendor + community follow-up combination.
Want a row updated, added, or corrected? Send a reproducible benchmark — model, quant, runner, hardware, prompt, measured PP and TG — and we’ll cross-verify against the existing sources and either update the row or add a new one.
Back to the methodology
How picks are chosen→