Local AI for documents, retrieval, and long-context work.

The honest reliable context window in July 2026 is still 32–64K tokens for most local rigs. Advertised numbers — 200K, 262K, 1M — degrade past their training cap. Plan for RAG below 96GB hardware. Above it, DeepSeek V4-Flash (April 2026) is the first model that makes 1M genuinely useful.

Verdict — Workable with RAG; 1M-context era is frontier-only

Gemma 4 31B (256K) for dense long-context; Qwen 3.5 35B-A3B + RAG for chunked retrieval; Ministral 3 14B Instruct (256K) for compact dense alternative.

What's the answer at each tier

Frontier (64+ GB)

DeepSeek V4-Flash (April 24 2026, 284B/13B-A, 1M context with Compressed Sparse Attention) is the first locally-runnable model where 1M context is genuine — FLOPs ~10% of V3.2 at 1M. Needs 2× H100 or DGX Spark-class. Below that: Qwen 3.5 122B-A10B (262K extensible) for synthesis; Llama 3.3 70B + RAG for proven 128K reliable.

Command A+ (218B-A25B, Apache 2.0) — May 20 2026 release — frontier Apache-2.0 MoE with hybrid sliding-window + global attention (cheaper KV at long context than dense 70B-class) + native vision for diagrams/tables. 128K input / 64K generation; 25B active. The first frontier Apache MoE you can self-host on 2× H100 / Mac M3 Ultra without a hosted-only caveat.
Qwen 3.5 122B-A10B (262K native, extensible to 1M via YaRN) — Biggest reliable-context model for long-doc synthesis at this tier. 10B active keeps 262K responsive. Multimodal for diagrams + tables. Apache 2.0.
Llama 3.3 70B Q4 + RAG (128K reliable) — Community consensus: no locally-runnable model has truly reliable 1M context as of July 2026. Llama 3.3 70B at 128K context with proper RAG is the stable, proven path. Practical reliable limit is 32-64K of input attention.

Top (32+ GB)

Qwen 3.6-27B (262K native, extensible to 1M via YaRN) is the new dense long-context top pick at 32GB. Pair with Gemma 4 31B for calmer long-context behaviour.

Qwen 3.6-27B — April 22 2026 dense refresh — 262K native context extensible to 1M, multimodal, single-GPU at Q4. Now the dense long-context top pick.
Qwen 3.5 35B-A3B + RAG — MoE plus proper RAG beats brute-force long context for most real docs work.
Gemma 4 31B (256K context) — 256K context with vision+audio; calmer long-context behaviour than the 35B-A3B MoE on dense retrieval prompts.

High (20–24 GB)

Gemma 4 31B (256K) for dense long-context; Qwen 3.5 35B-A3B + RAG for chunked retrieval; Ministral 3 14B Instruct (256K) for compact dense alternative.

Gemma 4 31B (256K context) — 31B dense with 256K context; Gemma commercial-permissive terms; Arena top 5.
Qwen 3.5 35B-A3B + RAG — MoE + RAG combo fits 24GB and handles chunked retrieval well.
Ministral 3 14B Instruct (256K context) — 256K context in a compact dense model; use Instruct — Reasoning variant has community-reported timeouts.

Mid (12–16 GB)

Qwen 3.5 9B + RAG is the practical workhorse. Chunk aggressively, retrieve well. BGE-M3 is the community-standard retriever; Qwen3-Embedding-8B (June 2025, MTEB Multilingual #1 at 70.58) is the SOTA upgrade if you can spare 16GB for retrieval.

Qwen 3.5 9B + RAG — Chunk aggressively, retrieve well; 262K native context handles big retrieval windows comfortably.
Qwen3-Embedding-8B (Apache 2.0) — #1 on MTEB overall as of 2026; the current best-quality open retrieval pick. Use BGE-M3 (568M) instead when you want cheap, broad multilingual breadth or CPU-only.
Ministral 3 8B Instruct + RAG — Solid 8B + RAG; Mistral vocab handles docs cleanly.

Low (6–12 GB / CPU)

Qwen 3.5 4B + tight RAG + nomic-embed-text-v1.5 retriever. Keep context windows small; rely on retrieval quality.

Qwen 3.5 4B + tight RAG — 4B plus tight chunking; keep context windows small.
nomic-embed-text-v1.5 (retrieval) — Lightweight English embedding; fast on CPU; pairs with any small generator.
Phi-4 Mini — 3.8B STEM specialist for focused document work.

How to actually run it

Embedding model + vector store + generator — three components. Default open-source stack: BGE-M3 (or Qwen3-Embedding-8B if accuracy-first) + Qdrant/Chroma + your generator of choice. Open-WebUI handles the orchestration if you don't want to wire it yourself.

Watchouts

Llama 4 Scout's 10M context is real for retrieval (find a clause in 10,000 pages) but degrades to ~1-2M for synthesis. Quantization compounds the limit. Plan around the synthesis ceiling, not the retrieval ceiling.
Practical reliable input attention is ~32-64K of meaningful content on most local rigs per RULER + Chroma context-rot studies. Beyond that, retrieval > brute-force long-context.
BGE-M3 is excellent but not SOTA anymore — Qwen3-Embedding-8B sits at #1 on MTEB Multilingual (70.58) as of July 2026.
Jina embeddings v4 supports multimodal docs (charts, tables, illustrations) but is Qwen Research License (non-commercial) — read terms before commercial deployment.

When cloud still wins

Genuinely massive context windows (Gemini 1.5 Pro's 2M, Claude 4.7's 200K with frontier-class synthesis), or you don't want to manage a vector store. For most document workflows below 100K tokens of meaningful content, local + RAG matches cloud quality at a fraction of the cost.

Hardware that fits this use case

Related guides

Next step

Try the planner with Docs · long-context · RAG preselected→

The planner pulls all six dimensions together — your hardware, your VRAM/RAM, your GPU family, your context, and your priorities — and returns specific picks with fit badges.

Notes flagged for next refresh

Qwen3-Embedding-8B (Apache 2.0, June 5 2025, MTEB #1 multilingual at 70.58) + Jina v4 (Qwen Research License, multimodal) are flagged for next quarterly refresh as embedding picks. BGE-M3 stays as predictable budget default.