NVIDIA DGX Spark

The only sub-$5k box that fits 100B+ MoE models on your desk.

128 GB of unified memory via NVIDIA's GB10 Grace Blackwell Superchip — 4× what any consumer GPU gives you. The catch: 273 GB/s bandwidth is ~27% of an RTX 4090, so you trade raw speed for fit. A capacity-first machine, not a speed-first machine.

The decision in five lines

The call: Buy — The only sub-$5k box that fits 100B+ MoE models on your desk.
Best for: AI workstation
Runs well: Qwen3-Coder-30B-A3B (MoE, fits 24GB) · Qwen 3.6-27B · Z-Image-Turbo (Apache 2.0)
Watch out: Bandwidth ceiling: 273 GB/s means prefill is slow and dense models over 30B decode painfully. MoE with small active params is the native fit — dense 70B at Q4 will feel sluggish.
Evidence: Measured · last verified July 2026

128: GB LPDDR5X UNIFIED
273: GB/S BANDWIDTH
240: W PSU (140W SOC)
$4,699: MSRP (FEB 2026)

What fits at this tier

Runs what no consumer GPU can: Qwen 3.5 122B-A10B at 38–51 tok/s (v2.1 patches), 70B dense Q4 at ~5–7 tok/s, 70B Q8 loads but decodes at ~3 tok/s (bandwidth wall). Prefill is slow and decode on dense 30B+ models is painful — this is for fit-over-speed MoE workloads.

CODING

Qwen3-Coder-30B-A3B (MoE, fits 24GB) Community daily driver for local coding; 3B-active MoE delivers 30B quality at 3B-dense speed.

CHAT / GENERAL

Qwen 3.6-27B April 22 2026 dense refresh; supersedes Qwen 3.5 27B and claims to beat the prior 397B MoE flagship while staying single-GPU at Q4 (~17 GB).

DOCS & RETRIEVAL

Qwen 3.6-27B April 22 2026 dense refresh — 262K native context extensible to 1M, multimodal, single-GPU at Q4. Now the dense long-context top pick.

IMAGE

Z-Image-Turbo (Apache 2.0) Community daily driver for realism; 6B, 8-step inference, Apache 2.0 — commercial OK.

AGENTS

Qwen 3.6-35B-A3B Latest Qwen MoE; strong function calling; realistic on 24GB+ VRAM or Mac 48GB+ — the local agentic top pick.

VOICE

Qwen3-Omni-30B-A3B-Instruct Apache 2.0 MoE; audio+video+image+text in, speech+text out; 17GB at Q4. Frontier unified voice.

The call

Buy it if you need to run frontier open-weight MoE (122B-A10B at 38–51 tok/s) that no consumer GPU fits, and if bandwidth isn't your bottleneck because you're running MoE where active parameters stay small.
Skip it if you want maximum interactive speed on 8B–30B models. An RTX 4090 or 5090 will beat it 3–4× on anything that fits in 24–32 GB.

Watchouts

Bandwidth ceiling: 273 GB/s means prefill is slow and dense models over 30B decode painfully. MoE with small active params is the native fit — dense 70B at Q4 will feel sluggish.
ARM Linux ecosystem: CUDA works, but some tooling is x86-first. Expect occasional "ARM wheel not found" friction with newer llama.cpp builds or Python deps.
Price already jumped $700 in Feb 2026 (originally $3,999 at announce) on memory supply. Further increases aren't impossible.
Not a general workstation. No discrete GPU slot, limited I/O, thermals tuned for sustained AI. This is an AI appliance, not a PC.

Local vs cloud at this tier

● LOCAL WINS

Runs models no consumer GPU can touch — full 122B-A10B MoE at 38–51 tok/s, frontier open-weight experimentation. Fully offline, fully private.

● CLOUD WINS

Speed per token on anything ≤30B — an RTX 4090 at cloud-rental pricing gives you 3–4× throughput. Always-latest frontier models. No $4,700 upfront.

Break-even vs $100/mo ChatGPT Pro is ~47 months at flat pricing. Much faster (12–18 months) if you'd otherwise be paying Claude Opus 4.8 or GPT-5.4 API rates at heavy usage.

Next step

Load this setup into the planner→