the AI bench
VERIFIED MAY 2026

← All use cases

USE CASE · DOCS · LONG-CONTEXT · RAG

Local AI for documents, retrieval, and long-context work.

The honest reliable context window in May 2026 is still 32–64K tokens for most local rigs. Advertised numbers — 200K, 262K, 1M — degrade past their training cap. Plan for RAG below 96GB hardware. Above it, DeepSeek V4-Flash (April 2026) is the first model that makes 1M genuinely useful.


Verdict — Workable with RAG; 1M-context era is frontier-only

Gemma 4 31B (256K) for dense long-context; Qwen 3.5 35B-A3B + RAG for chunked retrieval; Ministral 3 14B Instruct (256K) for compact dense alternative.


What's the answer at each tier

Frontier (64+ GB)

DeepSeek V4-Flash (April 24 2026, 284B/13B-A, 1M context with Compressed Sparse Attention) is the first locally-runnable model where 1M context is genuine — FLOPs ~10% of V3.2 at 1M. Needs 2× H100 or DGX Spark-class. Below that: Qwen 3.5 122B-A10B (262K extensible) for synthesis; Llama 3.3 70B + RAG for proven 128K reliable.

  1. Qwen 3.5 122B-A10B (262K native, extensible to 1M via YaRN) — Biggest reliable-context model for long-doc synthesis at this tier. 10B active keeps 262K responsive. Multimodal for diagrams + tables. Apache 2.0.
  2. Llama 3.3 70B Q4 + RAG (128K reliable) — Community consensus: no locally-runnable model has truly reliable 1M context as of April 2026. Llama 3.3 70B at 128K context with proper RAG is the stable, proven path. Practical reliable limit is 32-64K of input attention.
  3. Llama 4 Scout (109B/17B, 10M for retrieval / ~1-2M synthesis) — Honest framing: 10M context works for retrieval (find a clause in 10,000 pages); synthesis degrades past ~1-2M; quantization compounds the limit. Use when you have 64+ GB of room for KV and the task is genuinely long-context retrieval.
Top (32+ GB)

Qwen 3.6-27B (262K native, extensible to 1M via YaRN) is the new dense long-context top pick at 32GB. Pair with Gemma 4 31B for calmer long-context behaviour.

  1. Qwen 3.6-27B — April 22 2026 dense refresh — 262K native context extensible to 1M, multimodal, single-GPU at Q4. Now the dense long-context top pick.
  2. Qwen 3.5 35B-A3B + RAG — MoE plus proper RAG beats brute-force long context for most real docs work.
  3. Gemma 4 31B (256K context) — 256K context with vision+audio; calmer long-context behaviour than the 35B-A3B MoE on dense retrieval prompts.
High (20–24 GB)

Gemma 4 31B (256K) for dense long-context; Qwen 3.5 35B-A3B + RAG for chunked retrieval; Ministral 3 14B Instruct (256K) for compact dense alternative.

  1. Gemma 4 31B (256K context) — 31B dense with 256K context; Gemma commercial-permissive terms; Arena top 5.
  2. Qwen 3.5 35B-A3B + RAG — MoE + RAG combo fits 24GB and handles chunked retrieval well.
  3. Ministral 3 14B Instruct (256K context) — 256K context in a compact dense model; use Instruct — Reasoning variant has community-reported timeouts.
Mid (12–16 GB)

Qwen 3.5 9B + RAG is the practical workhorse. Chunk aggressively, retrieve well. BGE-M3 is the community-standard retriever; Qwen3-Embedding-8B (June 2025, MTEB Multilingual #1 at 70.58) is the SOTA upgrade if you can spare 16GB for retrieval.

  1. Qwen 3.5 9B + RAG — Chunk aggressively, retrieve well; 262K native context handles big retrieval windows comfortably.
  2. BGE-M3 (retrieval) — Community-standard dense + sparse + multi-vector embeddings; multilingual; pairs with any generator.
  3. Ministral 3 8B Instruct + RAG — Solid 8B + RAG; Mistral vocab handles docs cleanly.
Low (6–12 GB / CPU)

Qwen 3.5 4B + tight RAG + nomic-embed-text-v1.5 retriever. Keep context windows small; rely on retrieval quality.

  1. Qwen 3.5 4B + tight RAG — 4B plus tight chunking; keep context windows small.
  2. nomic-embed-text-v1.5 (retrieval) — Lightweight English embedding; fast on CPU; pairs with any small generator.
  3. Phi-4 Mini — 3.8B STEM specialist for focused document work.

How to actually run it

Embedding model + vector store + generator — three components. Default open-source stack: BGE-M3 (or Qwen3-Embedding-8B if accuracy-first) + Qdrant/Chroma + your generator of choice. Open-WebUI handles the orchestration if you don't want to wire it yourself.


Watchouts

  • Llama 4 Scout's 10M context is real for retrieval (find a clause in 10,000 pages) but degrades to ~1-2M for synthesis. Quantization compounds the limit. Plan around the synthesis ceiling, not the retrieval ceiling.
  • Practical reliable input attention is ~32-64K of meaningful content on most local rigs per RULER + Chroma context-rot studies. Beyond that, retrieval > brute-force long-context.
  • BGE-M3 is excellent but not SOTA anymore — Qwen3-Embedding-8B sits at #1 on MTEB Multilingual (70.58) as of May 2026.
  • Jina embeddings v4 supports multimodal docs (charts, tables, illustrations) but is Qwen Research License (non-commercial) — read terms before commercial deployment.

When cloud still wins

Genuinely massive context windows (Gemini 1.5 Pro's 2M, Claude 4.7's 200K with frontier-class synthesis), or you don't want to manage a vector store. For most document workflows below 100K tokens of meaningful content, local + RAG matches cloud quality at a fraction of the cost.


Hardware that fits this use case


Related guides


Next step

Try the planner with Docs · long-context · RAG preselected

The planner pulls all six dimensions together — your hardware, your VRAM/RAM, your GPU family, your context, and your priorities — and returns specific picks with fit badges.


Notes flagged for next refresh

Qwen3-Embedding-8B (Apache 2.0, June 5 2025, MTEB #1 multilingual at 70.58) + Jina v4 (Qwen Research License, multimodal) are flagged for next quarterly refresh as embedding picks. BGE-M3 stays as predictable budget default.