the AI bench
VERIFIED JULY 2026
All models

MODEL · WEIBO AI · 3B DENSE

VibeThinker-3B

A 3B reasoning model that punches far above its size on VERIFIABLE reasoning — math, competitive coding, STEM. Weibo AI reports it reaching the range of much larger models (DeepSeek V3.2, GLM-5, Kimi K2.5) on IMO-AnswerBench (76.4, → 80.6 with a test-time verification strategy) despite only 3B params, via their Spectrum-to-Signal post-training. The thesis: verifiable reasoning is a parameter-dense, compressible capability where small models can reach near-frontier.

License: MIT · Context: Long — 60K–100K recommended for hard math · Released: June 12, 2026

The decision in five lines

The call
Consider — runnable locally, family reference
Best for
Local evaluation and family reference
Runs on
23 hardware picks fit (cheapest: Intel Arc B580 12 GB · $249)
Watch out
Tool-calling, agent orchestration, or autonomous coding agents — the authors explicitly say it was NOT trained for those.
Evidence
Estimated · last verified July 2026

3B dense
PARAMETERS
SMALL REASONING MODEL
TYPE
Long
CONTEXT
~2–3 GB
VRAM AT Q4

Where we recommend this

This model isn’t currently in an active planner slot. See the runner notes below if you’re running it anyway.

The call

A 3B reasoning model that punches far above its size on VERIFIABLE reasoning — math, competitive coding, STEM. Weibo AI reports it reaching the range of much larger models (DeepSeek V3.2, GLM-5, Kimi K2.5) on IMO-AnswerBench (76.4, → 80.6 with a test-time verification strategy) despite only 3B params, via their Spectrum-to-Signal post-training. The thesis: verifiable reasoning is a parameter-dense, compressible capability where small models can reach near-frontier.

When not to use: Tool-calling, agent orchestration, or autonomous coding agents — the authors explicitly say it was NOT trained for those. Also weak on open-domain knowledge / general chat (small models cover facts poorly). Use it for math / contest-coding / STEM reasoning, not as a generalist.

Runner notes

Runs anywhere (3B) — Ollama / llama.cpp / transformers. Set a high max-token budget (60K–100K) for hard math; it thinks long. For agentic or tool work, use Qwen3-Coder-30B-A3B instead.

License
MIT
Released
June 12, 2026
Maker
Weibo AI

Hardware that fits

Every hardware pick whose memory fits this model at the quant we recommend. Sorted cheapest-first — the top row is your best-value fit. Click through for the full buyer’s guide.

Next step

Find-by-model — see what hardware runs this