MOSS-Music-8B — open-weight music understanding lands at usable accuracy

OpenMOSS released MOSS-Music-8B (Instruct + Thinking) on May 1 — an Apache 2.0 audio-text-to-text model that does lyrics ASR with time-aligned transcription, music captioning, key/tempo/chord reasoning, structural analysis, instrument recognition, and music QA at production accuracy. 80.38% average on music-QA benchmarks; 4.36/5.0 on MusicCaps captioning. No open-weight model previously covered this category usably.

Verdict: First open-weight music-understanding LLM worth flagging — lyrics ASR, structure, key/tempo reasoning, 80% on music QA

The take

The category is the editorial moment, not the score. Music UNDERSTANDING — audio in, structured text out — has been a hosted-only capability up to this drop. Suno and Udio do generation; nobody open-source has done analysis (key, tempo, structure, lyric transcription, instrument recognition) at the level you'd actually deploy. MOSS-Music-8B clears that bar with one weight set, Apache 2.0, ~6–8 GB at Q4.

What it does specifically: lyrics ASR with time alignment (15.88% WER/CER average — competitive with hosted ASR for sung audio, which is materially harder than spoken), musical captioning (4.36/5.0 on MusicCaps), key + tempo + chord identification, intro/verse/chorus/bridge/outro structural segmentation, instrument and voice recognition, long-form QA over a full track. The audio encoder runs at 12.5 Hz temporal resolution. The Thinking variant adds chain-of-thought reasoning over the audio for harder analytical queries.

What it doesn't do: generate music. This is analysis-only — audio file in, text out. For text-to-music you still need hosted Suno / Udio (closed) or Stable Audio (open-weight, separate model). And the lyrics ASR is English-leaning — non-Latin-script vocals likely degrade, though OpenMOSS hasn't published a language matrix yet.

Practical: recommended runtime is SGLang Serving (`sglang serve --model-path ./weights/MOSS-Music-8B-Instruct --trust-remote-code`). PyTorch 2.9+cu128, FlashAttention 2 optional, FFmpeg 7, Python 3.12. Gradio app available for local UI. No GGUF or Ollama path yet — the audio-LLM architecture doesn't map cleanly to either runtime. Sits alongside MOSS-TTS-Nano in our voice category at /models/moss-music-8b/ — but its real audience is audio engineers and music tooling builders who want to add analysis to their pipelines without a hosted API call per track.

Where this fits

Models: MOSS-Music-8B (Instruct + Thinking) · MOSS-TTS-Nano (100M) · Qwen3-Omni-30B-A3B-Instruct

Hardware: RTX 5060 Ti 16 GB · NVIDIA RTX 4090 · Mac Studio M4 Max 64 GB

Sources

Next step

Try this in the planner→