HARDWARE · WORKSTATION · 128 GB UNIFIED
NVIDIA DGX Spark
The only sub-$5k box that fits 100B+ MoE models on your desk.
128 GB of unified memory via NVIDIA's GB10 Grace Blackwell Superchip — 4× what any consumer GPU gives you. The catch: 273 GB/s bandwidth is ~27% of an RTX 4090, so you trade raw speed for fit. A capacity-first machine, not a speed-first machine.
The decision in five lines
- The call
- Buy — The only sub-$5k box that fits 100B+ MoE models on your desk.
- Best for
- AI workstation
- Runs well
- Qwen3-Coder-30B-A3B (MoE, fits 24GB) · Qwen 3.6-27B · Z-Image-Turbo (Apache 2.0)
- Watch out
- Bandwidth ceiling: 273 GB/s means prefill is slow and dense models over 30B decode painfully. MoE with small active params is the native fit — dense 70B at Q4 will feel sluggish.
- Evidence
- Measured
- 128
- GB LPDDR5X UNIFIED
- 273
- GB/S BANDWIDTH
- 240
- W PSU (140W SOC)
- $4,699
- MSRP (FEB 2026)
What fits at this tier
Runs what no consumer GPU can: Qwen 3.5 122B-A10B at 38–51 tok/s (v2.1 patches), 70B dense Q4 at ~5–7 tok/s, 70B Q8 loads but decodes at ~3 tok/s (bandwidth wall). Prefill is slow and decode on dense 30B+ models is painful — this is for fit-over-speed MoE workloads.
The call
Buy it if you need to run frontier open-weight MoE (122B-A10B at 38–51 tok/s) that no consumer GPU fits, and if bandwidth isn't your bottleneck because you're running MoE where active parameters stay small.
Skip it if you want maximum interactive speed on 8B–30B models. An RTX 4090 or 5090 will beat it 3–4× on anything that fits in 24–32 GB.
Watchouts
- Bandwidth ceiling: 273 GB/s means prefill is slow and dense models over 30B decode painfully. MoE with small active params is the native fit — dense 70B at Q4 will feel sluggish.
- ARM Linux ecosystem: CUDA works, but some tooling is x86-first. Expect occasional "ARM wheel not found" friction with newer llama.cpp builds or Python deps.
- Price already jumped $700 in Feb 2026 (originally $3,999 at announce) on memory supply. Further increases aren't impossible.
- Not a general workstation. No discrete GPU slot, limited I/O, thermals tuned for sustained AI. This is an AI appliance, not a PC.
Local vs cloud at this tier
● LOCAL WINS
Runs models no consumer GPU can touch — full 122B-A10B MoE at 38–51 tok/s, frontier open-weight experimentation. Fully offline, fully private.
● CLOUD WINS
Speed per token on anything ≤30B — an RTX 4090 at cloud-rental pricing gives you 3–4× throughput. Always-latest frontier models. No $4,700 upfront.
Break-even vs $100/mo ChatGPT Pro is ~47 months at flat pricing. Much faster (12–18 months) if you'd otherwise be paying Claude Opus 4.8 or GPT-5.4 API rates at heavy usage.
Next step
Load this setup into the planner→