BitCPM4-CANN — OpenBMB ships the first native ternary 8B LLM family

OpenBMB released the BitCPM4-CANN family (0.5B / 1B / 3B / 8B) in mid-May — the first publicly reported end-to-end 1.58-bit (ternary {-1, 0, 1}) training stack at 8B scale, trained natively on Huawei Ascend NPU. Apache 2.0. The 8B model retains 95.7% of full-precision MiniCPM4 performance at ~6× memory reduction; the 0.5B variant retains 90.1% of its full-precision baseline at ~100 MB on-disk. Not the strongest model at its size — but the smallest credible model at this quality level.

Verdict: First end-to-end native 1.58-bit training stack at 8B scale — the new floor for low-VRAM open weights

The take

The training story is the editorial moment, not the benchmark scores. Every "1.58-bit" model published before this has been a post-hoc quantization of a model trained at BF16 — a quality compromise dressed up as a footprint win. BitCPM4-CANN trains natively in ternary via Quantization-Aware Training + Straight-Through Estimator during the actual training loop, with only ~4.5% throughput overhead. That's a real infrastructure milestone — the first time anyone outside the BitNet research team has demonstrated this end-to-end at 8B scale on production hardware.

What it actually does: standard chat / completion at retained-quality benchmarks. 8B-Ternary scores 77.84 average across 11 evals vs 81.31 for full-precision MiniCPM4 8B (95.7% retention). MMLU 70.65 vs 75.83, GSM8K 85.75 vs 91.51, ARC-c 86.10 vs 87.46. Not best-in-class — Llama 3.1 8B / Ministral 3 8B / Qwen 3.5 9B all beat it on standard benchmarks. But those models can't ship in 1.6 GB.

Where it fits in our taxonomy: niche but real. We've added /models/bitcpm4-cann/ as a research-milestone reference, not displacing any planner pick. The use case is embedded targets (the 0.5B at ~100 MB fits a Raspberry Pi-class accelerator), high-volume agent fan-out (every active model multiplies cost), and any deployment where weight memory matters more than the last 5% of capability. Native ternary is the new lowest-VRAM tier ceiling — and importantly, no post-hoc quantization quality loss because the model was never floating-point in the first place.

What runs it: pseudo-quantized format means standard PyTorch / Transformers handle inference with no special kernels. Primary serving target is Huawei Ascend 910B/910C (where the training-side advantage is biggest); GGUF builds exist for llama.cpp on consumer hardware. No Ollama path yet. For most readers this is a research-curiosity entry — but the architecture is going to matter, because the cost of running 100M-active-parameter MoEs in 2027 will turn on whether native low-bit training scales to those checkpoints.

Where this fits

Models: BitCPM4-CANN family (0.5B / 1B / 3B / 8B, native 1.58-bit) · MiniCPM-V-4.6 (1B vision-language) · MiniCPM-o 4.5 · Phi-4 Mini

Hardware: NVIDIA RTX 3060 12 GB · Mac Mini M4 16 GB · Minisforum UM890 Pro

Sources

Next step

Try this in the planner→