VERIFIED JULY 2026

CALIBRATION · 13 ROWS · VERIFIED 2026-07-13 → 2026-07-17

What the community has measured.

13 model × hardware combinations cross-verified against published benchmark data, republished here as a single dated audit log — exact runner, quant, Flash Attention setting, context length, and per-row verification dates.

The editorial verdicts on the planner have to land within shouting distance of measured reality. This is the audit log. Every row below was cross-verified between 2026-07-13 and 2026-07-17 against its primary source — linked under every row, so you can check us — using the runner and quant stated. Single-stream throughput, not batched. PP = prompt-processing tok/s; TG = generation tok/s. Where a number is reported as a band, it’s because the published figures varied meaningfully across reproductions — and saying so is honest.

Rebuilt July 2026. The previous table benchmarked Llama 3.1 and Qwen3 — models the planner had stopped recommending. It measured things we don’t pick and skipped things we do, which makes it decoration rather than evidence. Every row is now a current pick, and every row carries its own source link and the date that source ran the test. Two dates matter and we show both: when the benchmark was measured, and when we last verified it against the source.

What is deliberately not here matters too. There are no rows for the Mac Studio M3 Ultra 96 GB, the Mac mini M4, or the Intel Arc B580, because no credible current-model benchmark exists for them — and we would rather show a gap than an estimate. Several sites will happily give you a number for those combinations; at least one generates them from a formula and says so in its own footer. A missing row beats an invented one.

Popular companion pages: the Mac Studio M3 Ultra 96 GB workstation, the AMD ROCm guide, and the find-by-model hardware lookup.

How to read this

MoE vs dense is the whole ballgame at 24 GB+. On one used RTX 3090, Gemma 4’s 26B MoE generates 64 tok/s while holding 256K of context; Gemma 4’s 31B dense manages 31 tok/s at 32K. Same family, same vendor, same quant, same card — the architecture is the entire difference. This is why our picks lean MoE.
A MacBook out-decodes a $4,699 DGX Spark. Identical model, identical quant (gpt-oss-20b MXFP4): M4 Max 118 tok/s, DGX Spark 86. The Spark wins prompt processing and wins capacity outright — it runs gpt-oss-120b at 42 tok/s, which no consumer card in this table can load at all. But on generation, Apple’s memory bandwidth simply wins. Buy the Spark for what fits, not for what’s fast.
Context is the tax nobody quotes you. The 3090 runs gpt-oss-20b at 148 tok/s at 4K and 62 tok/s at 128K. Same card, same model, less than half the speed. Any benchmark quoted without a context depth is quoting you its best case.
A runner build can beat a hardware upgrade. The DGX Spark row is the cautionary one: prompt processing went from 2,009 to 3,798 tok/s on the same box and model, purely by moving llama.cpp from build 6771 to 7067. That is an 89% gain from a software bump. It is also why every row here pins a runner — a throughput number without a build is a rumour.
The gaps are part of the data. There is no Mac Studio M3 Ultra row, no Mac mini row, no Arc B580 row, and no AMD row, because we could not verify a current-model benchmark for them to the standard above. Those numbers are easy to find elsewhere and mostly should not be trusted — at least one popular source generates them from a formula and admits it in its footer.

The table

NVIDIA RTX 5090 (32 GB)Qwen 3.5 35B-A3B

VERIFIED 2026-07-17 · llama.cpp (CUDA, FA on) · MXFP4 · FA on · 256K ctx · MEASURED March 2026

PP: 2,004tok/s
TG: 97tok/s
Peak mem: ~20 GB

The reason this card is the pick: a 35B MoE holding a quarter-million tokens of context and still generating at ~97 tok/s. At 4K the same model prompt-processes at 6,605 tok/s.

Source: Hardware Corner — RTX 5090

NVIDIA RTX 5090 (32 GB)Gemma 4 31B (dense)

VERIFIED 2026-07-17 · llama.cpp (CUDA, FA on) · Q4_K · FA on · 128K ctx · MEASURED March 2026

PP: 900tok/s
TG: 43tok/s
Peak mem: ~19 GB

Dense costs you speed: 43 tok/s here vs 97 for the 35B-A3B MoE above, on the same card, at half the context. This row is why our picks lean MoE.

Source: Hardware Corner — RTX 5090

NVIDIA RTX 3090 (24 GB, used)gpt-oss-20b

VERIFIED 2026-07-17 · llama.cpp (CUDA, FA on) · MXFP4 · FA on · 4K ctx · MEASURED March 2026

PP: 4,400tok/s
TG: 148tok/s
Peak mem: ~11.3 GB

A four-year-old card still turning in 148 tok/s. At 128K context the same run holds 62 tok/s — the used 3090 remains the best throughput-per-dollar in this table.

Source: Hardware Corner — RTX 3090

NVIDIA RTX 3090 (24 GB, used)Qwen 3.5 35B-A3B

VERIFIED 2026-07-17 · llama.cpp (CUDA, FA on) · MXFP4 · FA on · 128K ctx · MEASURED March 2026

PP: 1,289tok/s
TG: 79tok/s
Peak mem: ~20 GB

35B-A3B at 128K on a used 24 GB card at 79 tok/s. This is the single row that best explains why we tell people to buy a second-hand 3090 before a new mid-range card.

Source: Hardware Corner — RTX 3090

NVIDIA RTX 3090 (24 GB, used)Gemma 4 26B-A4B (MoE)

VERIFIED 2026-07-17 · llama.cpp (CUDA, FA on) · Q4_K · FA on · 256K ctx · MEASURED March 2026

PP: 671tok/s
TG: 64tok/s
Peak mem: ~16 GB

256K context on a 24 GB card, still at 64 tok/s. At 4K the same model runs 119 tok/s — so you pay roughly half your speed for 64× the context.

Source: Hardware Corner — RTX 3090

NVIDIA RTX 3090 (24 GB, used)Gemma 4 31B (dense)

VERIFIED 2026-07-17 · llama.cpp (CUDA, FA on) · Q4_K · FA on · 32K ctx · MEASURED March 2026

PP: 724tok/s
TG: 31tok/s
Peak mem: ~19 GB

The dense/MoE contrast on one card: 31 tok/s dense at 32K vs 64 tok/s for the 26B MoE at 256K. Same family, same vendor, same quant — the architecture is the whole difference.

Source: Hardware Corner — RTX 3090

NVIDIA RTX 5060 Ti (16 GB)gpt-oss-20b

VERIFIED 2026-07-17 · llama.cpp (CUDA, FA on) · MXFP4 · FA on · 128K ctx · MEASURED March 2026

PP: 685tok/s
TG: 44tok/s
Peak mem: ~11.3 GB

The cheapest card here running a 20B at full 128K context, comfortably interactive. At 4K it prompt-processes at 3,585 tok/s.

Source: Hardware Corner — RTX 5060 Ti 16GB

NVIDIA RTX 5060 Ti (16 GB)Qwen3-14B

VERIFIED 2026-07-17 · llama.cpp (CUDA, FA on) · Q4_K · FA on · 32K ctx · MEASURED March 2026

PP: 621tok/s
TG: 26tok/s
Peak mem: ~9 GB

26 tok/s is usable but no longer snappy — this is roughly where the 16 GB tier starts to feel its limits on dense models.

Source: Hardware Corner — RTX 5060 Ti 16GB

NVIDIA DGX Spark (128 GB)Qwen 3.5 35B-A3B

VERIFIED 2026-07-17 · llama.cpp CUDA (llama-bench, pp2048/tg32) · Q4_K_M · FA on · 2K (pp2048) ctx · MEASURED Oct 2025 – 2026

PP: 2,789tok/s
TG: 60tok/s
Peak mem: 20.09 GiB

Note the shape: strong prompt processing, unremarkable generation. The Spark buys capacity, not speed — 273 GB/s of bandwidth is the ceiling on decode.

Source: llama.cpp — DGX Spark performance thread

NVIDIA DGX Spark (128 GB)Qwen3-Coder-30B-A3B

VERIFIED 2026-07-17 · llama.cpp CUDA (llama-bench, pp2048/tg32) · Q8_0 · FA on · 2K (pp2048) ctx · MEASURED Oct 2025 – 2026

PP: 1,654tok/s
TG: 44tok/s
Peak mem: 30.25 GiB

Our coding pick at Q8 — no quantization compromise — because 128 GB means you never have to make one. 44 tok/s is the price of that luxury.

Source: llama.cpp — DGX Spark performance thread

NVIDIA DGX Spark (128 GB)gpt-oss-120b

VERIFIED 2026-07-17 · llama.cpp CUDA (llama-bench, pp2048/tg32) · MXFP4 · FA on · 2K (pp2048) ctx · MEASURED Oct 2025 – 2026

PP: 967tok/s
TG: 42tok/s
Peak mem: 59.02 GiB

A 120B model at 42 tok/s in a box on your desk. No consumer GPU in this table can load it at all — that is the entire argument for this machine.

Source: llama.cpp — DGX Spark performance thread

NVIDIA DGX Spark (128 GB)gpt-oss-20b

VERIFIED 2026-07-17 · llama.cpp CUDA — build 6771, then 7067 · MXFP4 · FA on · 2K (pp2048) ctx · MEASURED Oct 2025 – 2026

PP: 3,798tok/s
TG: 86tok/s
Peak mem: 11.27 GiB

Read this row as a warning about every other row. Same box, same model: prompt processing went 2,009 → 3,798 tok/s purely by moving llama.cpp build 6771 → 7067. Any benchmark without a pinned build is a rumour.

Source: llama.cpp — DGX Spark performance thread

MacBook Pro M4 Maxgpt-oss-20b

VERIFIED 2026-07-17 · llama.cpp Metal (llama-bench, tg128) · MXFP4 · FA on · 2K (pp2048) ctx · MEASURED Oct 2025 – 2026

PP: 1,850tok/s
TG: 118tok/s
Peak mem: 11.27 GiB

The most useful comparison in this table: a MacBook generates at 118 tok/s where the $4,699 DGX Spark manages 86 on the identical model and quant. The Spark wins on prompt processing and on capacity; on decode, Apple's memory bandwidth simply wins.

Source: llama.cpp — DGX Spark performance thread (Mac comparison runs)

Caveats

This is an audit log, not original research. We cross-check published community figures and date when we cross-checked. Where ≥2 sources agreed within ~5% we adopt the figure; where they disagreed we publish a band. The source list per category is in the References section below.
Single-stream throughput only. Batched inference wins meaningfully at 4–8 concurrent requests; the numbers here are for one user, one prompt at a time — matching how community sources typically report.
Llama-bench convention for PP and TG (pp512 and tg128 equivalents). MLX runs use the analogous mlx_lm metrics. Published source figures are typically the median of 3+ warmed-up runs in their original reports; FA setting stated per row.
Context-length matters more than people think. 4K context numbers are not 32K context numbers. Where the context-length effect is the headline (the RTX 5090 30B-A3B long-context row), we publish both.
Each row is dated independently. When a published figure goes stale (new runner, new quant, new model version), we cross-verify the new number and update that row with a new date, leaving the others alone.

References — community benchmark sources

Every row above links its own primary source directly, so you can check any single number without trusting this list. These are the sources we consider credible enough to draw from in the first place — they publish a methodology, state their runner and build, and report results that survive arithmetic.

What we refuse to cite, and why you should care. A large share of the pages ranking for “[model] on [GPU] tok/s” are not measurements. Some are calculators — WillItRunAI states in its own footer that “all estimates are approximations based on mathematical models and public specifications”, and it returns the same tok/s for a 96 GB and a 256 GB machine because a formula cannot tell them apart. Others are simply broken: we found sites reporting prompt processing an order of magnitude below token generation, which is not physically possible on these runners, and one claiming an RTX 3090 beats an RTX 4090 by 15× on the same workload. A third group reprints Hardware Corner’s figures verbatim and reads like independent corroboration when it is nothing of the sort. We cited an estimate engine ourselves in planner copy until July 2026 — this list exists so that does not happen twice.

LocalScore — community-submitted llama.cpp benchmark database with per-accelerator pages. Primary source for RTX 5090 / 5060 Ti / 5070 Ti / RTX 3090 figures.
Hardware Corner — independent local-LLM hardware benchmarks. Primary source for Mac Apple Silicon (M3/M4/M5) figures across the Mac product line.
llama.cpp GitHub Discussions + Issues — runner-author + maintainer benchmarks and the canonical place to find AMD ROCm vs Vulkan back-and-forth.
NVIDIA Developer Forums — vendor-published DGX Spark throughput figures + community reproductions on driver versions (e.g. v2.1 patches for the 122B-A10B benchmark).
Ollama blog — runner-author throughput claims for new backends (Metal, MLX, ROCm) at release.
HuggingFace model card discussions on the canonical model repos (e.g. Qwen3.5-35B-A3B discussions, ubergarm GGUF quants for community llama-bench results).
r/LocalLLaMA benchmark threads — community lived experience, especially useful for catching regressions in newer driver/runner builds.
Databasemart benchmarks — server-hosted multi-GPU figures (dual 5090, dual A6000).

Want a row updated, added, or corrected? Send a reproducible benchmark — model, quant, runner, hardware, prompt, measured PP and TG — and we’ll cross-verify against the existing sources and either update the row or add a new one.

Back to the methodology

How picks are chosen→