the AI bench
VERIFIED JUNE 2026

MODELS · 56 CURATED OPEN-WEIGHT PICKS

Every model we recommend.

Dated, opinionated, license-audited. Machine-readable by design.

Canonical detail page per base model. Which tier slots it fills, license gotchas, runner friction, and which hardware actually runs it — pulled live from the planner so this index stays current when picks shift.


Filter by use case

Alibaba Qwen — the most-used open family in 2026

ALIBABA35B total / 3B active

Qwen 3.6-35B-A3B

Alibaba's post-3.5 refresh specifically targeting agentic coding — claims to beat dense Qwen 3.5 27B and Gemma 4 31B on coding + reasoning at the same active-param budget. Fully multimodal.

chat · agentsEstimatedRead →
ALIBABA27B dense

Qwen 3.6-27B

The April 2026 dense refresh that supersedes Qwen 3.5 27B — claims to beat the prior 397B MoE flagship on coding benchmarks while staying single-GPU deployable at Q4. The current dense top pick for 24 GB rigs and Mac 32+ GB.

chat · docsEstimatedRead →
ALIBABA35B total / 3B active

Qwen 3.5 35B-A3B

The 24 GB-VRAM unlock — dense-27B quality at 3B-active speed, and the community workhorse for mixed coding / chat / docs where breadth matters. 256 experts with 8 routed + 1 shared per token.

coding · chat · docs · agentsMeasuredRead →
ALIBABA27B dense

Qwen 3.5 27B

A dense, natively multimodal (text + image + video input) mid-large generalist — the biggest non-MoE in the Qwen 3.5 medium line. Best realistic pick for long-context docs on 24 GB VRAM or Mac 48 GB+.

Reference entryMeasuredRead →
ALIBABA9B dense

Qwen 3.5 9B

Strongest "runs on a mid-tier GPU" model in the Qwen 3.5 small line — supports thinking mode and 201-language coverage. Fits 8 GB VRAM with headroom.

coding · chat · docs · agentsEstimatedRead →
ALIBABA4B dense

Qwen 3.5 4B

The low-tier sweet spot — fits 6 GB VRAM at Q4, strong multimodal and agentic tool-use for its size, supports both thinking and non-thinking modes.

coding · chat · docsEstimatedRead →
ALIBABA2B dense

Qwen 3.5 2B

Phone-class multimodal model built on the same Qwen 3.5 foundation as the medium tier. Non-thinking by default. Runs anywhere, including CPU-only setups.

codingEstimatedRead →
ALIBABA30.5B total / 3.3B active

Qwen3-Coder-30B-A3B

The community daily-driver coding MoE for 24 GB-class hardware — purpose-trained for agentic coding + browser-use. Delivers 30B-dense quality at 3B-dense throughput.

coding · agentsEstimatedRead →
ALIBABA480B total / 35B active

Qwen3-Coder-480B-A35B

Frontier open-weight agentic coding model — claimed on par with Claude Sonnet on agentic benchmarks. Alibaba's most powerful coder.

Reference entryEstimatedRead →
ALIBABA30B total / 3B active

Qwen3-Omni-30B-A3B-Instruct

The only locally-runnable open-weight model that does real-time streaming speech-out natively. 119 input languages, 10 speech-output languages (two voices: Chelsie, Ethan).

voiceEstimatedRead →
ALIBABA14B dense

Qwen3-14B

Last-generation Qwen3 14B dense with thinking mode enabled by default and strong tool-calling. Still a solid 16 GB-VRAM pick when you want dense behaviour over MoE. Qwen 3.5 skipped the 14B slot.

codingEstimatedRead →
ALIBABA20B MMDiT (original Qwen-Image lineage)

Qwen-Image-2512 (20B) + Edit-2511

Strongest open-weight image model for text rendering — Arabic, Chinese, English all sharp. Qwen-Image-2512 (Dec 31 2025) is the latest released generation model and claims #1 open-source on AI Arena; pair with Qwen-Image-Edit-2511 (Dec 23 2025) for editing workflows. A unified 7B "Qwen-Image 2.0" was announced Feb 10 2026 (arxiv 2605.10730 tech report) but the weights are not yet open-sourced — Qwen-Image-2512 remains the runnable flagship as of June 2026.

imageEstimatedRead →

Other frontier and mid-tier text

COHERE218B total / 25B active (128 experts, 8 active + 1 shared per token)

Command A+ (218B-A25B)

Cohere's frontier-class MoE: 218B params with 25B active per token, hybrid sliding-window + global attention, native vision + 48-language coverage. The first Apache-2.0 frontier MoE you can actually serve on 2× H100 — same hardware class as DeepSeek V4-Pro and Kimi K2.6 but with a permissive license neither of those carries.

docs · agentsEstimatedRead →
Z.AI (FORMERLY ZHIPU AI)744B total / 40B active

GLM-5.1

Current #1 open-weight on SWE-Bench Pro (58.4) — a long-horizon agentic coding flagship that narrowly beats GPT-5.4 and Claude Opus 4.6 on that benchmark. MIT license means no commercial restrictions, unlike many frontier opens.

agentsEstimatedRead →
MOONSHOT AI1T total / 32B active (384 experts; 8 routed + 1 shared per token)

Kimi K2.6

Moonshot's 1T MoE agentic coder — keeps the K2 architecture (61 layers, 64 attention heads, MLA) and extends context to 256K. Tops SWE-Bench Pro at 58.6 (vs GPT-5.4 xhigh 57.7, Opus 4.6 max 53.4) and lands #4 on the Artificial Analysis Intelligence Index. The real differentiator is Agent Swarm — 300 sub-agents over 4,000 coordinated steps, a different shape of capability than single-step quality.

coding · agentsEstimatedRead →
DEEPSEEK1.6T total / 49B active (MoE)

DeepSeek V4-Pro

DeepSeek's frontier-class V4 flagship — 1.6T MoE that matches GPT-5.4 and Sonnet 4.6 on most benchmarks at meaningfully lower hosted price. The 1M-context default uses ~27% of V3.2's single-token FLOPs and ~10% of its KV cache thanks to architecture changes. MIT-licensed, but not realistically a local pick at this size.

Reference entryEstimatedRead →
DEEPSEEK284B total / 13B active (MoE)

DeepSeek V4-Flash

The smaller half of the V4 family — 284B MoE with 13B active per token. Same 1M context, same MIT license, same architectural KV-cache improvements as V4-Pro. The honest local pick of the V4 line: still frontier-class on most benchmarks, but realistically deployable only on M3 Ultra 192GB unified or dual 80GB server cards.

codingEstimatedRead →
GOOGLE31B dense / 26B total + 3.8B active (MoE)

Gemma 4 (31B dense + 26B A4B MoE)

Google's April 2026 refresh — Arena top 5 in its first week, 256K context native, vision + audio multimodal. Big news: Gemma 4 moved to Apache 2.0 from the custom Gemma Terms. The current Apache-2.0 "best dense under 70B" pick.

chat · docsEstimatedRead →
GOOGLE4B / 12B / 27B dense

Gemma 3 (4B / 12B / 27B)

Previous-generation Gemma line. The 4B is still a useful ultra-compact agent model with native vision. Larger sizes are superseded by Gemma 4.

agentsEstimatedRead →
MISTRAL AI128B dense (folds Magistral reasoning + Devstral 2 coding into one weight set)

Mistral Medium 3.5 128B

Mistral's flagship 128B dense model, replacing Medium 3.1 + retiring the dedicated Magistral (reasoning) and Devstral 2 (coding) specialist models into one weight set with a per-request `reasoning_effort` toggle. 77.6% on SWE-Bench Verified, native multimodal vision encoder trained from scratch, 256K context. The first serious Mistral release since Ministral 3 (Dec 2025).

Reference entryEstimatedRead →
MISTRAL AI3B / 8B / 14B dense (all with image understanding)

Ministral 3 family (3B / 8B / 14B)

Mistral's clean Apache-2.0 edge family with Base / Instruct / Reasoning splits per size. The "no-license-drama" alternative to Qwen or Gemma when lawyers are involved.

chat · docs · agentsEstimatedRead →
OPENAI21B total / 3.6B active

gpt-oss-20b

OpenAI's open-weights MoE. Matches o3-mini on common benchmarks, post-trained with MXFP4 quantization so it lands in 16 GB VRAM — a near-frontier reasoner you can actually run on a 5060 Ti.

coding · chat · agentsEstimatedRead →
IBM RESEARCH3B / 8B / 30B dense (instruct + base each)

IBM Granite 4.1

IBM's refreshed open-weights enterprise family — three dense decoder-only sizes, Apache 2.0, trained on ~15T tokens with progressive annealing toward technical/scientific/mathematical data plus instruction-following. The 8B instruct claims to match the prior Granite 4.0 32B-A9B MoE flagship on IBM's own benchmarks; cross-vendor comparison (vs Qwen/Gemma/Mistral) is unverified at time of publication.

Reference entryEstimatedRead →
IBM GRANITE8B base + 12 embedded LoRA adapters (~10B total)

Granite-Switch 4.1 8B Preview (12 task LoRAs)

IBM Granite 4.1 8B with 12 task-specialized LoRA adapters embedded in a single checkpoint, activated per-token via control tokens in the chat template. Three libraries: **Core** (3 adapters — requirement check, context attribution, uncertainty), **RAG** (5 — query rewrite, query clarification, answerability, hallucination detection, citation generation), **Guardian** (4 — safety detection, factuality detection + correction, policy guardrails). A lightweight switch layer detects control tokens and produces per-position adapter indices applied across all decoder layers; KV-cache normalization keeps adapters independent. Novel deployment pattern for production RAG / agent stacks — one checkpoint, multiple specialized behaviors. 12 languages: EN, DE, ES, FR, JA, PT, AR, CS, IT, KO, NL, ZH.

Reference entryEstimatedRead →
MICROSOFT3.8B dense

Phi-4 Mini

Microsoft's dense 3.8B instruct model built on synthetic + filtered web data. Punches above its weight on reasoning-heavy prompts in the <5B bracket. MIT license is unusually clean for commercial redistribution.

coding · chat · docs · agentsEstimatedRead →
MINIMAX~229B total / ~10B active (MoE, interleaved thinking)

MiniMax M2.5 / M2.7

MiniMax's open-weights agentic-workflow family — strong on coding + tool-use Arena. M2.7 is the first model that "participates in its own evolution" via self-iterated RL. Frontier-class, not a local workhorse.

Reference entryEstimatedRead →
OPENBMB0.5B / 1B / 3B / 8B (all natively trained in 1.58-bit ternary; not post-hoc quantized)

BitCPM4-CANN family (0.5B / 1B / 3B / 8B, native 1.58-bit)

First publicly reported end-to-end 1.58-bit (ternary {-1, 0, 1}) training stack at 8B scale. Trained natively at 1.58-bit via Quantization-Aware Training + Straight-Through Estimator on Huawei Ascend NPU — not a post-hoc PTQ pass over a BF16 model. The 8B model retains 95.7% of full-precision MiniCPM4 performance at ~6× memory reduction; 0.5B retains 90.1%. The new low-VRAM tier ceiling.

Reference entryEstimatedRead →
OPENBMB1.08B total / 679M non-embedding (LlamaForCausalLM)

MiniCPM5-1B (Apache 2.0, OPD-trained)

OpenBMB's claimed 1B-class open-source SOTA — but the training-method story matters more than the size. The post-training pipeline runs SFT → RL → On-Policy Distillation (OPD): RL teachers are trained per domain (math, code, closed-book QA, writing) and then distilled back into one release model. RL + OPD lifts the SFT-only checkpoint by +16pt average on math / code / instruction-following and drops max-token-truncated responses by 29 percentage points. Hybrid `<think>` reasoning toggle (switch via `enable_thinking`) and native XML-style tool calling. English + Chinese.

Reference entryEstimatedRead →

Image generation

BLACK FOREST LABS32B

FLUX.2 [dev]

BFL's frontier open-weights T2I — best-in-class prompt adherence and text rendering for any license-flexible open model in April 2026.

imageEstimatedRead →
BLACK FOREST LABS4B (step-distilled, ~4 inference steps) · 9B (parent flow model)

FLUX.2 [klein] (4B + 9B)

FLUX distilled for fast inference. The 4B variant is Apache 2.0 — the first FLUX-quality image model you can actually ship in a commercial product.

imageEstimatedRead →
HIDREAM8B dense (pixel-space; no VAE, no disjoint text encoder)

HiDream-O1-Image (8B)

HiDream's next-generation image foundation model. The architectural story: pixel-space generation without an external VAE or disjoint text encoder — one Pixel-level Unified Transformer handles text-to-image, image editing, subject-driven personalization, and storyboarding in a single weight set. Debuted at #8 on Artificial Analysis T2I Arena at launch. Supersedes HiDream-I1 for the MIT-license open-weight slot.

imageEstimatedRead →
MICROSOFT3.8B (MMDiT 48-block + FLUX.2 semantic VAE + multi-layer GPT-OSS text features)

Microsoft Lens (3.8B, MIT)

Microsoft's first foundational text-to-image model. Three-step ladder: `Lens-Base` (50-step supervised), `Lens` (20-step RL-tuned default), `Lens-Turbo` (4-step distilled). Architecture is novel: an MMDiT trunk paired with FLUX.2's semantic VAE and multi-layer features from a frozen GPT-OSS text model — Microsoft's public framing claims competitive quality at "substantially less training compute than larger T2I models." Cleanest MIT-licensed T2I at this param count.

imageEstimatedRead →
HIDREAM17B (same across all three; Dev and Fast are step-distilled, not pruned)

HiDream-I1 (Full / Dev / Fast)

17B MMDiT open-weights image foundation, MIT-licensed, SOTA-at-release. Full/Dev/Fast ladder distilled down the step count. The MIT license is a big unlock for production commercial pipelines vs FLUX dev.

imageEstimatedRead →
TONGYI-MAI (ALIBABA)6B (single-stream DiT, Decoupled-DMD distillation)

Z-Image-Turbo

Alibaba Tongyi's distilled 6B T2I that matches FLUX.2-dev quality in 8 steps. Bilingual English + Chinese text rendering. The community daily driver for Apache-2.0 image gen in 2026.

imageEstimatedRead →
NVLABS + MIT HAN LAB0.6B / 1.6B (linear-attention DiT)

SANA (0.6B / 1.6B)

NVLabs + MIT Han Lab's linear-attention diffusion transformer. Fastest image generation at any given quality tier in its class — 23–39× faster than FLUX-dev on the same hardware.

imageEstimatedRead →
STABILITY AI2.5B (MMDiT-X)

Stable Diffusion 3.5 Medium

Stability's consumer-friendly MMDiT-X text-to-image. Designed to run on consumer GPUs, mature ecosystem with thousands of LoRAs and ControlNets. The community "safe SD default."

imageEstimatedRead →

Voice — TTS, STT, and multimodal

HEXGRAD82M (StyleTTS2-based)

Kokoro-82M

An 82M TTS model that punches absurdly above its weight — ranked #1 on TTS Arena against 7B+ models, runs fine on CPU. The honest default for narration, read-aloud, voice-over.

voiceEstimatedRead →
OPENMOSS / MOSI.AI100M (0.1B)

MOSS-TTS-Nano (100M)

A 100M streaming-TTS that closes the multilingual gap Kokoro doesn't cover — 20 languages including English, Japanese, Korean, Spanish, French, Arabic, Mandarin, plus voice cloning from a short audio reference. 48 kHz stereo output, neural-audio-tokenizer + autoregressive LLM pipeline, runs real-time on 4 CPU cores. The ONNX build drops PyTorch entirely and gets ~2× the inference efficiency of the original.

Reference entryEstimatedRead →
RESEMBLE AITurbo: 350M · Base: 0.5B-class

Chatterbox (Turbo + Multilingual)

SOTA open-source voice cloning — 5-second reference audio, paralinguistic tags `[laugh]` `[sigh]` `[cough]` native in Turbo, <150 ms latency, ethical PerTh watermarking baked in.

voiceEstimatedRead →
OPENBMB2B (diffusion-autoregressive, tokenizer-free; MiniCPM-4 backbone)

VoxCPM2 (2B)

Apache 2.0 TTS with 48 kHz output, short-clip zero-shot voice cloning, and natural-language "voice design" (describe a voice, get one — no reference audio required) across 30 languages.

voiceEstimatedRead →
STEPFUN8B (LALM)

Step-Audio 2 mini

StepFun's 8B speech-to-speech LALM trained on 8M+ hours of audio. Competitive with GPT-4o-audio on speech recognition + S2S translation benchmarks, fully open-source weights.

voiceEstimatedRead →
CANOPY LABS3B (Llama-backbone)

Orpheus-TTS 3B

Llama-backbone TTS tuned for naturalness and emotion. Multilingual FTs (Spanish / Italian / French / Hindi) released as research artifacts.

voiceEstimatedRead →
NVIDIA600M

Parakeet-TDT 0.6B v3

NVIDIA's high-throughput multilingual ASR — 25 European languages with auto language detection, handles 24-minute audio at full attention (3 h with local attention). Built for production batch transcription.

voiceEstimatedRead →
NVIDIA NEMO + M-BAIN (PIPELINE)2.5B (Canary) + 1.5B (WhisperX / Whisper-large-v3)

Canary-Qwen 2.5B + WhisperX

Canary-Qwen is an English-only ASR that doubles as a 2.5B LLM over its own transcripts — transcribe, then summarize/Q&A. WhisperX adds word-level timestamps + diarization. The near-frontier English-first pipeline.

voiceEstimatedRead →
M-BAIN (WHISPERX) · PYANNOTE (PIPELINE)~1.5B (Whisper large-v3)

WhisperX + pyannote 3.1

The canonical open-source pipeline for diarized transcription — wraps faster-whisper for ASR, wav2vec2 for alignment, pyannote for speaker segmentation.

voiceEstimatedRead →
SYSTRAN~809M (Whisper large-v3-turbo distilled)

faster-whisper large-v3-turbo

CTranslate2 reimplementation of OpenAI Whisper — 4× faster with int8 quantization, matches reference accuracy. The practical STT default.

voiceEstimatedRead →
OPENBMB8B (SigLip-400M + Whisper-medium + ChatTTS-200M + Qwen2.5-7B)

MiniCPM-o 2.6

GPT-4o-class omnimodal 8B — vision, speech input, speech output, voice cloning in one model. End-to-end with full-duplex live streaming.

voiceEstimatedRead →
OPENMOSS / MOSI.AI~9B total (8B LLM + audio encoder)

MOSS-Music-8B (Instruct + Thinking)

The first open-weight music-understanding LLM worth flagging — does lyrics ASR with time-aligned transcription, musical captioning, key/tempo/chord reasoning, structural analysis (intro/verse/chorus/bridge/outro), instrument + voice recognition, and music QA. Audio encoder runs at 12.5 Hz temporal resolution. 80.38% avg accuracy across 8 music-QA benchmarks; 15.88% avg WER/CER on lyrics; 4.36/5.0 MusicCaps captioning. Thinking variant adds chain-of-thought reasoning over audio.

Reference entryEstimatedRead →

Vision-language and video understanding

Embeddings — retrieval + RAG

Popular model checks

Start with the practical model names people actually need to run: Qwen3-Coder-30B-A3B for local coding, Microsoft Lens-Turbo for MIT-licensed image generation, and Qwen 3.5 35B-A3B for the 24 GB MoE tier.

Reverse lookup

Know the model, want to see which hardware runs it? Use Find-by-Model. Enter any pick and get the hardware options that naturally fit.