Nemotron 3.5 ASR — real-time streaming speech-to-text, 40 locales, in 600M params

NVIDIA shipped Nemotron 3.5 ASR Streaming 0.6B on June 5–6 — a 600M-parameter cache-aware streaming speech-to-text model covering 40 language-locales from a single checkpoint, with configurable latency (80 ms to 1,120 ms chunks) and built-in punctuation + capitalization. It is small, fast, and runs comfortably on modest hardware. The one catch for us: it ships under the OpenMDW 1.1 license, which isn't OSI-permissive — so it earns a watch, not a planner-pick slot.

Verdict: A genuinely useful 600M real-time streaming STT model that runs on modest hardware — held off the planner picks only by its non-OSI license

The take

The facts, verified against the Hugging Face model card and NVIDIA's fine-tuning blog: `nvidia/nemotron-3.5-asr-streaming-0.6b` is a 600M-parameter streaming ASR model supporting 40 language-locales (19 transcription-ready, 13 broad-coverage, 8 adaptation-ready) from one checkpoint, with five selectable latency modes (chunk sizes of 1/2/4/7/14 frames at 80 ms each → 80–1,120 ms) and integrated punctuation/capitalization. The card and the "Fine-Tune Nemotron 3.5 ASR" blog are dated 2026-06-05/06-06. License is OpenMDW 1.1 — NVIDIA's open-weights-and-data license, more open than weights-only releases but not classic Apache/MIT.

Why it matters for the voice use case: the differentiator is cache-aware streaming with tunable latency. Most strong open STT today is batch-oriented — Parakeet-TDT (NVIDIA, CC-BY-4.0) and Whisper derivatives transcribe a clip; this is built for live, low-latency transcription where you trade a little accuracy for a smaller chunk size. At 600M it runs on essentially anything in the planner's hardware range, including modest laptops and mini-PCs, and the family's usual CoreML/ONNX paths make Apple-Silicon and edge deployment realistic.

Our call: no planner-pick change. Our voice STT picks stay the OSI-clean set — Parakeet-TDT (CC-BY-4.0), faster-whisper, WhisperX + diarization, Canary-Qwen — because we hold a permissive-license bar for picks, and OpenMDW 1.1 doesn't clear it. But if you specifically need real-time streaming transcription (live captions, voice agents, meeting transcription as it happens) and the OpenMDW terms work for your use, Nemotron 3.5 ASR is the most interesting small streaming-STT option to land this quarter. Worth a try; not (yet) a default recommendation.

Where this fits

Models: Parakeet-TDT 0.6B v3 · faster-whisper large-v3-turbo · WhisperX + pyannote 3.1 · Canary-Qwen 2.5B + WhisperX

Hardware: RTX 5060 Ti 16 GB · Mac Mini M4 16 GB · Minisforum UM890 Pro

Sources

Next step

Try this in the planner→