FAST TAKE · 2026-04-14 · VOXCPM2 (OPENBMB)
VoxCPM2 — the first open-weight TTS that designs voices from text alone
OpenBMB shipped a 2B Apache-2.0 TTS that does what no other open-weight model does — generate a voice from a natural-language description, no reference audio required. Plus 30 languages, 48 kHz output, tokenizer-free diffusion AR.
Verdict: Apache 2.0 voice with text-only voice design — new capability category
The take
OpenBMB released VoxCPM2 in mid-April 2026 — a 2B Apache 2.0 diffusion-autoregressive TTS with three capabilities most open-weight models can't combine: 48 kHz studio-quality output, 30-language coverage, and 'voice design' — generating a fresh voice from a natural-language description, with no reference audio.
The voice-design capability is the categorical novelty. Existing open-weight TTS picks (Kokoro, Chatterbox, Step-Audio) either use fixed voice packs or require a reference audio sample for cloning. VoxCPM2 lets you write 'gravelly mid-50s warm storyteller voice' and get a usable, distinct voice out — every prompt produces a different consistent voice. For applications that need many distinct voices without curating reference samples, this is a real unlock.
Practical: ~6–8 GB VRAM at Q4. MiniCPM-4 backbone. No Ollama route (architectural mismatch — TTS not LM). The single `OpenBMB/VoxCPM` GitHub repo (18.9k stars) covers all three family members. Short-clip zero-shot voice cloning works alongside voice design when you do want to mimic a specific voice. Step-down options: `VoxCPM1.5` (0.6B, 44.1 kHz, Jan 2026) for mid-VRAM; `VoxCPM-0.5B` (0.5B, 16 kHz, EN/ZH only, Sep 2025) for low-VRAM.
Where it sits: voice.high in the planner (paired with Step-Audio 2 mini and WhisperX + pyannote). For pure narration without cloning, Kokoro-82M is still smaller and runs CPU-only. For multilingual cloning on weaker hardware, MOSS-TTS-Nano (April 10) is the right pick. VoxCPM2 is for production audio where voice variety + Apache 2.0 commercial-redistribution matters.
Where this fits
Models: VoxCPM2 (2B) · MOSS-TTS-Nano (100M) · Chatterbox (Turbo + Multilingual) · Kokoro-82M
Hardware: RTX 5060 Ti 16 GB · NVIDIA RTX 4090 · Mac Studio M4 Max 64 GB
Sources
Next step
Try this in the planner→