Llama 4 Scout

The long-context unlock at the frontier tier. Trained at 256K, length-generalized to 10M. Honest framing: 10M works for retrieval (find a clause in 10,000 pages); ~1-2M is the realistic synthesis ceiling. Quantization compounds the limit — 4-bit pushes effective context closer to 5M than 10M. M5 Max 128 GB community measurement: ~30 tok/s thanks to the 17B active footprint.

License: Llama 4 Community License (custom — commercial OK below 700M MAU) · Context: 256K trained, length-generalized to 10M (retrieval-only past ~1-2M) · Released: April 5, 2025

The decision in five lines

The call: Skip for local — for docs
Best for: docs
Runs on: 6 hardware picks fit (cheapest: Framework Desktop (Ryzen AI Max+ 395) · $1,999)
Watch out: Also: synthesis tasks across the full 10M (the model degrades materially past ~1-2M for cross-context reasoning).
Evidence: Estimated · last verified July 2026

109B total: PARAMETERS
MOE: TYPE
256K: CONTEXT
~55 GB (Int4) / ~110 GB (BF16): VRAM AT Q4

Where we recommend this

Every tier slot in the planner where this model is a top or alternate pick. Pulled live from planner.js — when the planner refreshes, this table stays current.

DOCS ·

Llama 4 Scout (109B/17B, 10M for retrieval / ~1-2M synthesis)Honest framing: 10M context works for retrieval (find a clause in 10,000 pages); synthesis degrades past ~1-2M; quantization compounds the limit. Use when you have 64+ GB of room for KV and the task is genuinely long-context retrieval.

The call

The long-context unlock at the frontier tier. Trained at 256K, length-generalized to 10M. Honest framing: 10M works for retrieval (find a clause in 10,000 pages); ~1-2M is the realistic synthesis ceiling. Quantization compounds the limit — 4-bit pushes effective context closer to 5M than 10M. M5 Max 128 GB community measurement: ~30 tok/s thanks to the 17B active footprint.
When not to use: Anything under ~64 GB effective. Also: synthesis tasks across the full 10M (the model degrades materially past ~1-2M for cross-context reasoning). Phase 25 demoted Scout from `docs.top` because the "10M" framing was misleading at 32 GB; at 128 GB the criticism softens but doesn't disappear.

Runner notes

Hosted via OpenRouter and Meta direct API. Local: `meta-llama/Llama-4-Scout-17B-16E-Instruct` on HF, NVIDIA-optimized int4 build for single H100. Ollama tag pending. NVIDIA TensorRT-LLM has the most mature inference path; vLLM works but consumes more memory.

License: Llama 4 Community License (custom — commercial OK below 700M MAU)
Released: April 5, 2025
Maker: Meta
Model card: huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct →

Hardware that fits

Every hardware pick whose memory fits this model at the quant we recommend. Sorted cheapest-first — the top row is your best-value fit. Click through for the full buyer’s guide.

Next step

Find-by-model — see what hardware runs this→