Nemotron-3 Ultra — NVIDIA ships the best US open-weight model, fast, and fully open

NVIDIA released Nemotron-3 Ultra 550B-A55B on June 4 (announced at Jensen Huang's Computex keynote on June 1) — a 550B-total / 55B-active hybrid Mamba-Transformer MoE under the new OpenMDW 1.1 license, with 1M context and native NVFP4. Artificial Analysis scores it 48 on its Intelligence Index: the strongest US/Western open-weight model to date — ahead of Gemma 4 31B (39) and gpt-oss-120b (33), but behind the Chinese-led frontier (Kimi K2.6 at 54). The headline is speed-for-intelligence: 300+ tok/s, several times faster than DeepSeek/Kimi peers.

Verdict: The strongest Western open-weight model — genuinely open (weights + data + recipes), but 550B big-iron

The take

The facts, verified against the Hugging Face model card and Artificial Analysis: Nemotron-3 Ultra is ~550B total / 55B active (≈90% sparse), built on NVIDIA's LatentMoE architecture — a hybrid of Mamba-2, MoE, and attention with Multi-Token Prediction, pretrained natively in NVFP4. Context window is up to 1M tokens, and the long-context scores are real, not virtual (RULER-1M 76.8, RULER-512K 84.5). It ships as Base-BF16, instruct BF16, NVFP4, and GenRM variants under the OpenMDW License v1.1 — NVIDIA's "Open Model, Weights & Data" license, which opens weights, training data, and recipes (more than the weights-only releases that dominate the open shelf). Release date June 4, 2026; Unsloth dynamic GGUFs were up within a day.

Why it matters: this is the first time a US lab holds the top spot on the Artificial Analysis Intelligence Index among open-weight models — 48, clear of Gemma 4 31B (39), its own Nemotron 3 Super (36), and gpt-oss-120b (33). It is still behind the Chinese-led open frontier (Kimi K2.6 at 54), so the honest framing is "best Western open model, #2 to the Chinese frontier." The genuinely new thing is speed: the hybrid Mamba-Transformer design plus NVFP4 plus 90% sparsity let pre-release endpoints serve it at 300+ tok/s, where 550B-class peers from DeepSeek and Moonshot are typically served at 50–100 tok/s. Fast frontier-ish intelligence, not just big.

The local reality, stated plainly: this is datacenter iron. NVIDIA's own minimum at BF16 is 8× B200 / 16× H100 / 8× H200. Unsloth's dynamic GGUFs go down to ~1–2 bit (the 1-bit build is ~189 GB on disk), so a heavily-quantized run is technically possible at the very top of the local ladder — a multi-GPU server or a 256 GB+ unified box — but even a 2-bit quant overflows a single 128 GB rig, and aggressive quantization on a reasoning model costs real quality. For the planner's hardware tiers (up to ~128 GB DGX Spark / dual-5090), it does not land as a usable local pick.

Our call: no planner-pick change — it is out of reach for the site's hardware tiers and sits behind Kimi K2.6 on raw intelligence. But it is the most significant open-weight drop of the window and a real milestone for open AI: a US lab taking the open-weight intelligence lead, under a license that opens weights, data, and recipes, at genuinely high throughput. If you run datacenter-class hardware or want the fastest open frontier-reasoning model available, it is the one to watch. For everyone running on a single consumer rig, your local coding/chat picks (Qwen 3.6-35B-A3B, Qwen3-Coder-30B-A3B, GLM-5.1) are unchanged.

Where this fits

Models: Kimi K2.6 · Gemma 4 (31B dense + 26B A4B MoE + 12B multimodal) · Command A+ (218B-A25B) · DeepSeek V4-Flash

Hardware: NVIDIA DGX Spark · Dual RTX 5090 · Mac Studio M3 Ultra 96 GB

Sources

Next step

Try this in the planner→