DeepSeek V4 — the architecture is the story, not the size

A 1.6T Pro and a 284B Flash sibling, both MIT, both 1M context, released the same day. Skip the size headlines: the real news is the architectural change that drops V3.2 single-token FLOPs by ~73% and KV cache by ~90%.

Verdict: Frontier-class hosted; multi-GPU local-realistic for V4-Flash

The take

DeepSeek shipped V4-Pro (1.6T total / 49B active) and V4-Flash (284B total / 13B active) as a paired preview on April 24. Both MIT-licensed. Both 1M-context default with 384K max output. Both available as open weights on HuggingFace and through DeepSeek's hosted API at materially lower per-token prices than GPT-5.5.

The architectural story is the most interesting part. DeepSeek's own benchmarks show V4-Pro using 27% of V3.2's single-token inference FLOPs and 10% of its KV cache at 1M context. That's the kind of efficiency move that makes the long-context default actually viable rather than aspirational — most 1M-context models from earlier generations had unusable prefill latency at full context. V4 starts to fix that.

But the local-deployment math is honest: V4-Pro at Q4 is ~800 GB on disk. That's hosted-API or 8× H100 cluster territory, not single-card local. V4-Flash at ~140 GB Q4 is the realistic local pick for the V4 line — a DGX Spark, dual A6000, or 8× consumer-card rig will run it. Single 5090? No. Mac Studio M3 Ultra 96 GB? Tight, possible at heavy quantization, not where the model wants to live.

The honest read: use V4-Pro via API for outright frontier work, treat V4-Flash as a serious-local pick when you have multi-GPU, and don't pretend V4 is a single-card model. Qwen3-Coder-30B-A3B remains the right local coding pick on a 24 GB card; Qwen 3.6-27B is the dense default for everything else single-card.

Where this fits

Models: DeepSeek V4-Pro · DeepSeek V4-Flash · Qwen3-Coder-30B-A3B

Hardware: NVIDIA DGX Spark · NVIDIA RTX A6000 (48 GB, used) · Mac Studio M3 Ultra 96 GB

Sources

Next step

Try this in the planner→