Qwen3-ASR — a tiny Apache speech-recognition family that actually fits anywhere

Qwen shipped Qwen3-ASR on June 26 — a dedicated open-weight speech-recognition family (1.7B and 0.6B), Apache 2.0, built on the Qwen3-Omni audio stack. It does language identification plus ASR across 52 languages and dialects (30 languages + 22 Chinese dialects), and Qwen claims the 1.7B is state-of-the-art among open-source ASR and competitive with the strongest proprietary commercial APIs. Unlike most of what lands in this feed, this one is genuinely local for everyone: ~4 GB at fp16 for the 1.7B, ~1.5–2 GB for the 0.6B.

Verdict: Qwen's first dedicated open-weight ASR family — Apache 2.0, 52 languages, and small enough to run on a laptop or even CPU

The take

The facts, verified against the Hugging Face model cards (`Qwen/Qwen3-ASR-1.7B-hf` and `Qwen3-ASR-0.6B-hf`, both created 2026-06-26, Apache 2.0): two sizes, transformers-native (also vLLM/SGLang), built on the Qwen3-Omni audio foundation. The card lists 52 languages and dialects with multi-accent English coverage, and frames the 1.7B as SOTA-among-open ASR and competitive with the best commercial APIs. Treat the "beats commercial APIs" line as a vendor claim until third-party WER numbers land, but the shape is unusual: a permissively-licensed, multilingual, sub-2B recognizer from a frontier lab.

Why it matters: the open STT shelf has been a patchwork — faster-whisper as the practical default, Canary-Qwen for English-plus-reasoning, Parakeet for multilingual, WhisperX bolted on for diarization. A clean Apache family from Qwen, small enough for CPU or any consumer GPU, with day-one transformers support, is a real addition rather than a re-quant. The 0.6B in particular is interesting for edge/on-device transcription where Whisper-large is too heavy.

Our call: added as a model entry on the voice shelf alongside faster-whisper and Canary-Qwen. We are not swapping a planner STT pick this sweep — the existing recommendations are well-proven and we have no independent WER comparison yet — so the honest framing is "verify on your own audio before replacing a production pipeline." If you want the smallest credible recognizer or a permissive license, the 0.6B is the one to try; for accuracy, the 1.7B. Still recognition-only: pair with WhisperX/pyannote for diarization and word-level timestamps.

Where this fits

Models: Qwen3-ASR (1.7B / 0.6B) · Canary-Qwen 2.5B + WhisperX · Parakeet-TDT 0.6B v3 · faster-whisper large-v3-turbo

Hardware: Mac Mini M4 16 GB · NVIDIA RTX 3060 12 GB · Intel Arc B580 12 GB

Sources

Next step

Try this in the planner→