MiniCPM-V-4.6 — vision-language at 1B that prices like a sub-billion text model

OpenBMB shipped MiniCPM-V-4.6 on May 15 — a 1B-param vision-language model built on SigLIP2-400M + Qwen3.5-0.8B that scores higher than its own LLM backbone on Artificial Analysis Intelligence Index (13 vs 10) at ~19× lower token cost. Apache 2.0. Day-one GGUF, BNB, AWQ, GPTQ quants plus a Thinking variant. The newest entry in the V (vision-only) branch, parallel to the MiniCPM-o omnimodal line.

Verdict: A 1B vision-language model that beats its own 0.8B text-LLM backbone at ~19× lower token cost

The take

The price-performance claim is the editorial moment. The Artificial Analysis Intelligence Index puts MiniCPM-V-4.6 at 13 — three points above the raw Qwen3.5-0.8B (10) it's built on — while charging roughly 1/19th the token cost when serving via hosted inference. For a vision-language workload, that's an unusually clean Pareto improvement. The mixed 4×/16× visual-token compression cuts visual-encoding FLOPs over 50% versus prior MiniCPM-V revs, which is where the cost lives.

Capabilities: single-image, multi-image, and video understanding from a 1B-class checkpoint. Native function / tool calling. OCR + scene understanding + visual reasoning at the level you'd expect from a much heavier model. Inherits the Qwen3.5-0.8B context window (128K) — long enough for multi-page document workflows. Flash Attention 2 recommended for multi-image and video runs.

Where it fits in our taxonomy: MiniCPM-V-4.6 is V-line (vision-only). MiniCPM-o 2.6 (which we already track at voice.low) is the omni line (vision + speech in + speech out). They are parallel, both Apache 2.0, both still maintained. Pick V if you need vision and nothing else; pick o if you need voice I/O too. Day-one quants land at ~1.5–2 GB int4 / ~3–4 GB FP16 — fits a Raspberry Pi-class accelerator, let alone any GPU we recommend.

Practical pick: this is the new floor for editorial-quality vision-language at home. We've added /models/minicpm-v-4-6/ this week. No planner-pick slot yet — our 6 use cases (coding / chat / docs / image / agents / voice) don't have a pure vision-language band — but the detail page covers when to reach for it vs the omni MiniCPM-o.

Where this fits

Models: MiniCPM-V-4.6 (1B vision-language) · MiniCPM-o 2.6 · Qwen3-Omni-30B-A3B-Instruct

Hardware: RTX 5060 Ti 16 GB · Mac Mini M4 16 GB · NVIDIA RTX 3060 12 GB

Sources

Next step

Try this in the planner→