the AI bench
VERIFIED MAY 2026
All models

MODEL · OPENBMB · 0.5B / 1B / 3B / 8B (ALL NATIVELY TRAINED IN 1.58-BIT TERNARY; NOT POST-HOC QUANTIZED)

BitCPM4-CANN family (0.5B / 1B / 3B / 8B, native 1.58-bit)

First publicly reported end-to-end 1.58-bit (ternary {-1, 0, 1}) training stack at 8B scale. Trained natively at 1.58-bit via Quantization-Aware Training + Straight-Through Estimator on Huawei Ascend NPU — not a post-hoc PTQ pass over a BF16 model. The 8B model retains 95.7% of full-precision MiniCPM4 performance at ~6× memory reduction; 0.5B retains 90.1%. The new low-VRAM tier ceiling.

License: Apache 2.0 · Context: Inherits MiniCPM4 base (8K-32K depending on variant) · Released: May 2026

The decision in five lines

The call
Consider — runnable locally, family reference
Best for
Local evaluation and family reference
Runs on
23 hardware picks fit (cheapest: Intel Arc B580 12 GB · $249)
Watch out
Frontier reasoning — these are research-milestone weights, not the strongest model at their size.
Evidence
Estimated · last verified May 2026

0.5B
PARAMETERS
NATIVE TERNARY LLM
TYPE
Inherits
CONTEXT
~0.1 GB (0.5B) / ~0.2 GB (1B) / ~0.6 GB (3B) / ~1.6 GB (8B) — these ARE the runtime storage, not a quant of something bigger
VRAM AT Q4

Where we recommend this

This model isn’t currently in an active planner slot. See the runner notes below if you’re running it anyway.

The call

First publicly reported end-to-end 1.58-bit (ternary {-1, 0, 1}) training stack at 8B scale. Trained natively at 1.58-bit via Quantization-Aware Training + Straight-Through Estimator on Huawei Ascend NPU — not a post-hoc PTQ pass over a BF16 model. The 8B model retains 95.7% of full-precision MiniCPM4 performance at ~6× memory reduction; 0.5B retains 90.1%. The new low-VRAM tier ceiling.

When not to use: Frontier reasoning — these are research-milestone weights, not the strongest model at their size. For best-quality 8B work, Llama 3.1 8B or Ministral 3 8B still win on benchmarks. BitCPM4-CANN wins on $/byte of weight memory, not on absolute capability.

Runner notes

Models load as pseudo-quantized via standard PyTorch / Transformers — no special kernels needed for inference. Primary serving target is Huawei Ascend 910B/910C. GGUF builds available for llama.cpp. The 0.5B at ~100 MB on-disk is the smallest credible chat model in the open-weight landscape; useful for embedded targets and high-volume agent fan-out where every active model multiplies cost.

License
Apache 2.0
Released
May 2026
Maker
OpenBMB

Hardware that fits

Every hardware pick whose memory fits this model at the quant we recommend. Sorted cheapest-first — the top row is your best-value fit. Click through for the full buyer’s guide.

Next step

Find-by-model — see what hardware runs this