VERIFIED JULY 2026

MODEL · OPENBMB · 0.5B / 1B / 3B / 8B (ALL NATIVELY TRAINED IN 1.58-BIT TERNARY; NOT POST-HOC QUANTIZED)

BitCPM4-CANN family (0.5B / 1B / 3B / 8B, native 1.58-bit)

First publicly reported end-to-end 1.58-bit (ternary {-1, 0, 1}) training stack at 8B scale. Trained natively at 1.58-bit via Quantization-Aware Training + Straight-Through Estimator on Huawei Ascend NPU — not a post-hoc PTQ pass over a BF16 model. The 8B model retains 95.7% of full-precision MiniCPM4 performance at ~6× memory reduction; 0.5B retains 90.1%. The new low-VRAM tier ceiling.

License: Apache 2.0 · Context: Inherits MiniCPM4 base (8K-32K depending on variant) · Released: May 2026

The decision in five lines

The call: Consider — runnable locally, family reference
Best for: Local evaluation and family reference
Runs on: 23 hardware picks fit (cheapest: Intel Arc B580 12 GB · $249)
Watch out: Frontier reasoning — these are research-milestone weights, not the strongest model at their size.
Evidence: Estimated · last verified July 2026

0.5B: PARAMETERS
NATIVE TERNARY LLM: TYPE
Inherits: CONTEXT
~0.1 GB (0.5B) / ~0.2 GB (1B) / ~0.6 GB (3B) / ~1.6 GB (8B) — these ARE the runtime storage, not a quant of something bigger: VRAM AT Q4

Where we recommend this

This model isn’t currently in an active planner slot. See the runner notes below if you’re running it anyway.

The call

First publicly reported end-to-end 1.58-bit (ternary {-1, 0, 1}) training stack at 8B scale. Trained natively at 1.58-bit via Quantization-Aware Training + Straight-Through Estimator on Huawei Ascend NPU — not a post-hoc PTQ pass over a BF16 model. The 8B model retains 95.7% of full-precision MiniCPM4 performance at ~6× memory reduction; 0.5B retains 90.1%. The new low-VRAM tier ceiling.
When not to use: Frontier reasoning — these are research-milestone weights, not the strongest model at their size. For best-quality 8B work, Llama 3.1 8B or Ministral 3 8B still win on benchmarks. BitCPM4-CANN wins on $/byte of weight memory, not on absolute capability.

Runner notes

Models load as pseudo-quantized via standard PyTorch / Transformers — no special kernels needed for inference. Primary serving target is Huawei Ascend 910B/910C. GGUF builds available for llama.cpp. The 0.5B at ~100 MB on-disk is the smallest credible chat model in the open-weight landscape; useful for embedded targets and high-volume agent fan-out where every active model multiplies cost.

License: Apache 2.0
Released: May 2026
Maker: OpenBMB
Model card: huggingface.co/openbmb/BitCPM-CANN-8B →

Hardware that fits

Every hardware pick whose memory fits this model at the quant we recommend. Sorted cheapest-first — the top row is your best-value fit. Click through for the full buyer’s guide.

Next step

Find-by-model — see what hardware runs this→