USE CASE · VOICE · TTS · STT
Local voice — narration, cloning, transcription.
TTS Arena went multi-polar in March 2026 — Fish Audio S2 Pro now leads at Elo 1128 but is non-commercial; Kokoro-82M (Apache 2.0) dropped from #1 to mid-pack on quality but remains the practical choice for English narration on CPU. Voice runs differently from text — Ollama has no native TTS or STT, so you'll route through Open-WebUI plus dedicated servers.
Verdict — Strong at every tier — choose by license + use case, not just quality
Chatterbox-Turbo (Resemble AI, MIT, Dec 15 2025) for voice cloning. Sesame CSM-1B (Apache 2.0, Llama+Mimi backbone) for realtime conversational. Canary-Qwen + WhisperX for STT pipelines.
What's the answer at each tier
Qwen3-Omni-30B-A3B FP16 (unified vision + voice + text) for omnimodal workloads. Fish Audio S2 Pro (5B, non-commercial) if quality outranks license cleanliness. Production-grade STT: Canary-Qwen 2.5B (#1 on Open ASR Leaderboard at 5.63% WER) + WhisperX for diarization.
- Qwen3-Omni-30B-A3B FP16 (~32 GB, no quant) — Frontier multimodal voice unified — speech-out + audio/video/image/text in at full FP16 precision. Apache 2.0. The single-model frontier pick.
- Step-Audio 2 mini FP16 (~16 GB) — StepFun 8B end-to-end speech-to-speech LALM. Competitive with GPT-4o-audio on benchmarks. Apache 2.0; full FP16 retains audio nuance.
- VibeVoice-Realtime 0.5B (top open-source TTS, ~300ms latency) — Highest quality among open-source on-device TTS in 2026 per community comparison. Tiny — frontier hardware lets you load it concurrent with everything else.
Chatterbox-Turbo (Resemble AI, MIT, Dec 15 2025) for voice cloning. Sesame CSM-1B (Apache 2.0, Llama+Mimi backbone) for realtime conversational. Canary-Qwen + WhisperX for STT pipelines.
- Qwen3-Omni-30B-A3B-Instruct — Apache 2.0 MoE; audio+video+image+text in, speech+text out; 17GB at Q4. Frontier unified voice.
- Chatterbox-Turbo (Resemble AI) — MIT; paralinguistic tags + fastest high-quality tier; voice clone in under 200ms.
- Canary-Qwen 2.5B + WhisperX — Canary tops HF Open ASR (5.63% WER English) paired with WhisperX for word-level timestamps + diarization.
VoxCPM2 (2B Apache 2.0, "voice design" from text alone — generate voices from natural-language descriptions, no reference audio required). Step-Audio 2 mini (8B Apache 2.0) for end-to-end speech-to-speech. WhisperX + pyannote 3.1 for diarized transcription.
- VoxCPM2 (2B, Apache 2.0) — 30 languages, 48 kHz, tokenizer-free diffusion AR; voice design from text. April 2026 release.
- Step-Audio 2 mini (8B, Apache 2.0) — Unified speech-to-speech; competitive with GPT-4o-audio on several benchmarks; ~16GB FP16.
- WhisperX + pyannote 3.1 — Whisper large-v3 alignment + speaker diarization in one pipeline; the only pick for multi-speaker transcripts.
Chatterbox Multilingual (MIT) for cloning + Parakeet-TDT 0.6B v3 (NVIDIA, multilingual European, CC-BY-4.0) + Orpheus-TTS 3B (Apache 2.0) for narration.
- Chatterbox Multilingual (Resemble AI) — MIT; 23 languages; voice cloning + emotion dial; pip 0.1.7 (March 2026) shows active development.
- Parakeet-TDT 0.6B v3 (NVIDIA) — CC-BY-4.0; 10× faster than Whisper turbo on English + 25 European langs; no CJK/Arabic/Hindi coverage.
- Orpheus-TTS 3B — Apache 2.0; Llama-3 backbone; ~200ms streaming; 8 English voices + zero-shot clone.
Kokoro-82M (Apache 2.0) is the right Apache-clean default for English narration — community daily driver, 54 voices, CPU-real-time at 82M params. MOSS-TTS-Nano (100M Apache 2.0, April 10 2026) closes the multilingual + cloning gap on 4 CPU cores. faster-whisper for STT.
- Kokoro-82M (Apache 2.0) — Community daily driver for English TTS; CPU-real-time at 82M params; v1.0 with 8 languages and 54 voices. No voice cloning.
- MOSS-TTS-Nano (100M, Apache 2.0) — April 13 2026 release fills the multilingual + voice-clone gap Kokoro doesn't cover — 20+ languages, 48 kHz, real-time on 4 CPU cores. ONNX build (April 17) drops PyTorch entirely.
- faster-whisper large-v3-turbo (int8) — MIT; 99 languages; 4× faster than vanilla Whisper; the STT default at this tier.
How to actually run it
Open-WebUI is the orchestration layer — separate TTS server (Kokoro / Piper / Chatterbox), separate STT server (faster-whisper / Parakeet / WhisperX). Ollama does NOT handle voice natively. Most picks are Python/CLI-installed, not GGUF — different friction profile from text LLMs.
Watchouts
- TTS Arena ranking ≠ best choice. Fish Audio S2 Pro leads quality (Elo 1128) but is research/non-commercial; Step-Audio EditX (~#2) is also restricted. Apache-2.0 picks (Kokoro, Sesame CSM, MOSS-TTS-Nano, VoxCPM2) sit lower on quality but are commercial-clean.
- Voice cloning ethics: most cloning models require explicit consent and watermarking. Chatterbox has PerTh watermarking baked in. Use responsibly — every voice-cloning section on this site assumes that.
- VoxCPM2 (April 2026) is currently the only "voice design from text alone" open-weight TTS — write "gravelly mid-50s warm storyteller" and get a usable distinct voice. Categorical novelty as of May 2026.
- MiniCPM-V-4.6 (May 15 2026) is the latest in the V-line (vision-only) — distinct from MiniCPM-o which is the omni line (vision + voice). Pick V for image-language; o for full multimodal.
When cloud still wins
You need ElevenLabs / OpenAI Voice / Google Cloud Speech-class quality at studio level, or you don't want to manage a 3-server stack. For most narration and transcription workflows, the open-weight options at low/mid tier match cloud quality at zero per-token cost.
Hardware that fits this use case
Related guides
Next step
Try the planner with Voice · TTS · STT preselected→The planner pulls all six dimensions together — your hardware, your VRAM/RAM, your GPU family, your context, and your priorities — and returns specific picks with fit badges.
Notes flagged for next refresh
Flagged for next quarterly refresh: Fish Audio S2 Pro (5B non-commercial, #1 TTS Arena Mar 2026), Sesame CSM-1B (Apache 2.0 realtime conversational, March 2025), Step-Audio EditX (4B, Feb 2026), NVIDIA Magpie-Multilingual 357M (Feb 2026), Parakeet-Realtime-EOU 120M-v1 (80-160ms streaming). Plus the new MOSS-Audio-8B family (Apr 14 2026 — audio understanding) and MOSS-Music-8B (May 1 2026 — music understanding). Music generation locally remains a gap.