GUIDE · VOICE · JUNE 2026
The local voice stack actually works in 2026.
For two years, “run voice locally” meant choosing between bad robotic narration, slow Whisper transcription, and accepting that anything multimodal had to call OpenAI. That’s no longer true. As of April 2026, an Apache-2.0 stack on consumer hardware does English narration, multilingual TTS with cloning, and production-grade transcription with diarization — all local, all open weights, all under 24 GB VRAM.
Here’s the verdict-first stack we’d wire up today, picked by tier, and the watchouts that aren’t obvious until you’ve burned three afternoons on them.
The verdict
Run Open-WebUI as the integration layer. Not Ollama — Ollama doesn’t do voice natively. Open-WebUI handles the TTS / STT routing and lets you mix and match the components below.
For English narration: Kokoro-82M at 82M parameters runs real-time on CPU and tops TTS Arena. v1.0 covers 8 languages and 54 voices. Apache 2.0. No voice cloning — accept that and use it for what it’s for.
For multilingual or voice cloning on weak hardware: MOSS-TTS-Nano (100M) shipped April 13 2026 and fills the gap Kokoro doesn’t — 20+ languages, 48 kHz, voice clone from short reference audio, real-time on 4 CPU cores via the ONNX build.
For high-fidelity voice cloning on a real GPU: Chatterbox-Turbo (Resemble AI, MIT) — 5-second reference audio, paralinguistic tags [laugh] [sigh] native, <150 ms latency. Active development as of pip 0.1.7 (March 2026).
For transcription: Parakeet-TDT 0.6B v3 (NVIDIA, CC-BY-4.0) on European languages — 4× faster than Whisper turbo, 25 European languages with auto detection. For Mandarin / Arabic / Hindi or anything outside that set, fall back to faster-whisper large-v3-turbo (int8).
For meeting transcripts with speakers: WhisperX + pyannote 4.0. Whisper for words, pyannote for “who said what.” The only honest pick for multi-speaker pipelines.
For unified speech-to-speech on 24 GB+: Qwen3-Omni-30B-A3B takes text, audio, image, video in and emits text + speech out. ~16 GB VRAM at AWQ-4bit. The only locally-runnable model that does real-time streaming speech-out natively.
Why local voice is genuinely viable now
- Apache-2.0 frontier TTS exists. Kokoro is Apache 2.0. MOSS-TTS-Nano is Apache 2.0. VoxCPM2 is Apache 2.0. The license-clean commercial-redistribution path is no longer Stability-flavored or vendor-locked.
- CPU inference for narration is real. Kokoro does real-time TTS at 82M params on a laptop CPU. MOSS-TTS-Nano does the same multilingually on 4 cores. You don’t need a GPU for narration anymore.
- Voice cloning quality crossed the threshold. Chatterbox-Turbo’s 5-second clones are good enough for most paid-narration products in private testing. Ethical PerTh watermarking is baked in.
- Multilingual STT got fast. Parakeet-TDT 0.6B v3 transcribes 24-minute audio at full attention with auto language detection across 25 European languages. 4× faster than Whisper turbo for the same quality.
The four workflows that actually work locally
1. Long-form narration. Audiobook / podcast / read-aloud. Kokoro on CPU is the right answer. MOSS-TTS-Nano if multilingual. Don’t reach for anything bigger; you don’t need it.
2. Voice cloning for production audio. Chatterbox-Turbo on GPU (~4–6 GB at Q4) for English+paralinguistic; Chatterbox Multilingual for 23 non-English languages; VoxCPM2 (2B) if you need 30 languages and tokenizer-free voice design from text alone.
3. Transcription with diarization. WhisperX + pyannote 4.0 is the production standard. faster-whisper int8 if you don’t need word-level timestamps. Parakeet for European-language batch jobs where speed dominates accuracy on the long tail.
4. Real-time voice agents. Qwen3-Omni on a 24 GB+ card is the only locally-realistic path today. MiniCPM-o 2.6 (8B int4, ~7 GB) for laptop GPUs at lower fidelity. Below 16 GB VRAM, route audio out via Open-WebUI to a separate TTS server — the unified models don’t fit.
The runner — Open-WebUI, not Ollama
Ollama is the right runner for text-only LLMs but ships no native voice path. Open-WebUI is what you actually want for voice — it speaks the OpenAI Realtime API on the outside, routes TTS to whichever engine you wire up (Piper, Kokoro, Chatterbox, faster-whisper), and keeps the LLM tier unchanged.
Most TTS / STT models ship as either pip packages (chatterbox-tts, kokoro, faster-whisper) or HuggingFace ONNX/Diffusers workflows. Plan on running each as a small REST service behind Open-WebUI rather than expecting a single binary to handle everything.
Watchouts
- Voice cloning has consent constraints. Chatterbox bakes in PerTh watermarking, but laws and platform rules vary. If you’re cloning anyone’s voice for production, get explicit consent in writing. This is non-optional.
- MiniCPM-o registration. Commercial use requires a registration questionnaire — not a click-through Apache. Document the path before you commit.
- Apple Silicon path is uneven. Kokoro and Whisper run well via CoreML/ANE; Chatterbox + Qwen3-Omni need Metal-via-PyTorch (slower than NVIDIA CUDA-equivalent). Test before you commit a Mac to a voice pipeline.
- Latency under 200 ms is hard. Most local stacks land 300–500 ms first-byte for streaming TTS. If you need true sub-200 ms voice agents, GPT-4o Realtime or hosted Inworld is still the better pick.
- Multilingual coverage is uneven. Parakeet covers European languages well but skips CJK and Arabic. Whisper covers everything but slower. MOSS-TTS-Nano covers 20+ languages but TTS only. Pick the one matching your language matrix instead of pretending one covers all.
Hardware that runs this stack well
Tier 1 (under $700): CPU-only or RTX 5060 Ti 16 GB. Kokoro + Parakeet + MOSS-TTS-Nano all run here. Chatterbox-Turbo at Q4 fits 16 GB with room.
Tier 2 (24 GB): RTX 4090, used 3090, 7900 XTX, RX 9070 XT (with the 16 GB caveat). Adds unified voice via Qwen3-Omni AWQ-4bit. The honest sweet spot for serious voice work.
Tier 3 (32 GB+): RTX 5090, M5 Max 64 GB+, Mac Studio M3 Ultra. Adds Step-Audio 2 mini at FP16 and the headroom to run multiple voice services concurrently behind Open-WebUI.
Next step
Plan a voice setup in the planner→