MODEL · META · 109B TOTAL / 17B ACTIVE
Llama 4 Scout
The long-context unlock at the frontier tier. Trained at 256K, length-generalized to 10M. Honest framing: 10M works for retrieval (find a clause in 10,000 pages); ~1-2M is the realistic synthesis ceiling. Quantization compounds the limit — 4-bit pushes effective context closer to 5M than 10M. M5 Max 128 GB community measurement: ~30 tok/s thanks to the 17B active footprint.
License: Llama 4 Community License (custom — commercial OK below 700M MAU) · Context: 256K trained, length-generalized to 10M (retrieval-only past ~1-2M) · Released: April 5, 2025
The decision in five lines
- The call
- Skip for local — for docs
- Best for
- docs
- Runs on
- 6 hardware picks fit (cheapest: Framework Desktop (Ryzen AI Max+ 395) · $1,999)
- Watch out
- Also: synthesis tasks across the full 10M (the model degrades materially past ~1-2M for cross-context reasoning).
- Evidence
- Estimated
- 109B total
- PARAMETERS
- MOE
- TYPE
- 256K
- CONTEXT
- ~55 GB (Int4) / ~110 GB (BF16)
- VRAM AT Q4
Where we recommend this
Every tier slot in the planner where this model is a top or alternate pick. Pulled live from planner.js — when the planner refreshes, this table stays current.
The call
The long-context unlock at the frontier tier. Trained at 256K, length-generalized to 10M. Honest framing: 10M works for retrieval (find a clause in 10,000 pages); ~1-2M is the realistic synthesis ceiling. Quantization compounds the limit — 4-bit pushes effective context closer to 5M than 10M. M5 Max 128 GB community measurement: ~30 tok/s thanks to the 17B active footprint.
When not to use: Anything under ~64 GB effective. Also: synthesis tasks across the full 10M (the model degrades materially past ~1-2M for cross-context reasoning). Phase 25 demoted Scout from `docs.top` because the "10M" framing was misleading at 32 GB; at 128 GB the criticism softens but doesn't disappear.
Runner notes
Hosted via OpenRouter and Meta direct API. Local: `meta-llama/Llama-4-Scout-17B-16E-Instruct` on HF, NVIDIA-optimized int4 build for single H100. Ollama tag pending. NVIDIA TensorRT-LLM has the most mature inference path; vLLM works but consumes more memory.
Hardware that fits
Every hardware pick whose memory fits this model at the quant we recommend. Sorted cheapest-first — the top row is your best-value fit. Click through for the full buyer’s guide.
- Framework Desktop (Ryzen AI Max+ 395)Perfect · 1.5× 128 GB unified · $1,999–$2,851
- Mac Studio M4 Max 64 GBRequires tweak · 1.0× 64 GB unified · $3,799
- NVIDIA DGX SparkPerfect · 1.5× 128 GB unified · $4,699
- M5 Max MacBook Pro 64 GBRequires tweak · 1.0× 64 GB unified · ~$5,199 (est.; June 25 2026 increase)
- Mac Studio M3 Ultra 96 GBGood · 1.1× 96 GB unified · $5,299
- Dual RTX 5090Good · 1.1× 64 GB (2×32) · $8,500–$10,500
Next step
Find-by-model — see what hardware runs this→