NVIDIA RTX A6000 (48 GB, used)

Single-slot 70B Q4 at 14.5 tok/s without the dual-GPU space heater.

The only consumer-reachable single-card path to 48 GB VRAM under $5,000. Ampere-generation workstation silicon with ECC memory, 768 GB/s bandwidth, and a dual-slot blower that tolerates sustained load. Used-market prices make it an Ada-tax-avoidance play.

The decision in five lines

The call: Buy — Single-slot 70B Q4 at 14.5 tok/s without the dual-GPU space heater.
Best for: Prosumer single-card
Runs well: Qwen3-Coder-30B-A3B (MoE, fits 24GB) · Qwen 3.6-27B · Z-Image-Turbo (Apache 2.0)
Watch out: Blower fan ramps loud under sustained inference — qualitatively "quieter than 3090" but still noticeable. Server-aesthetic, not bedroom-aesthetic.
Evidence: Estimated · last verified July 2026

48: GB GDDR6 ECC
768: GB/S BANDWIDTH
300: W TDP
~$3,500: USED (EBAY)

What fits at this tier

48 GB fits Llama 3.3 70B Q4 (~40 GB) cleanly with room for context: ~14.5 tok/s TG via Ollama (Databasemart). Qwen 3.5 27B dense Q4 at ~50–65 tok/s, 35B-A3B MoE at ~75–95 tok/s estimated. 122B-A10B at 4-bit fits with room (~70 GB total for weights + KV — tight). ECC memory is a real win for ML science workloads.

CODING

Qwen3-Coder-30B-A3B (MoE, fits 24GB) Community daily driver for local coding; 3B-active MoE delivers 30B quality at 3B-dense speed.

CHAT / GENERAL

Qwen 3.6-27B April 22 2026 dense refresh; supersedes Qwen 3.5 27B and claims to beat the prior 397B MoE flagship while staying single-GPU at Q4 (~17 GB).

DOCS & RETRIEVAL

Qwen 3.6-27B April 22 2026 dense refresh — 262K native context extensible to 1M, multimodal, single-GPU at Q4. Now the dense long-context top pick.

IMAGE

Z-Image-Turbo (Apache 2.0) Community daily driver for realism; 6B, 8-step inference, Apache 2.0 — commercial OK.

AGENTS

Qwen 3.6-35B-A3B Latest Qwen MoE; strong function calling; realistic on 24GB+ VRAM or Mac 48GB+ — the local agentic top pick.

VOICE

Qwen3-Omni-30B-A3B-Instruct Apache 2.0 MoE; audio+video+image+text in, speech+text out; 17GB at Q4. Frontier unified voice.

The call

Buy it if you want 70B Q4 in a single-slot workstation form factor — blower fan, PCIe 4.0 x16, server-ready — and you can't stomach dual-3090 thermal + PCIe split. Also the default pick when ECC matters (some scientific workflows).
Skip it if you don't need 48 GB specifically — a single RTX 5090 32 GB is newer, faster for everything that fits in 32 GB, and has FP8/FP4 support Ampere lacks. Also skip if you want warranty — used A6000s are as-is.

Watchouts

Blower fan ramps loud under sustained inference — qualitatively "quieter than 3090" but still noticeable. Server-aesthetic, not bedroom-aesthetic.
Ampere architecture is nearly 6 years old in July 2026. No native FP8/FP4 (Blackwell has this), no flash-attention-3 speedups. 2× the raw VRAM of a 5090 at ~60% the bandwidth.
Used market is opaque. PCSP + ITCreations offer 90-day warranties; eBay is as-is. Budget for a memory stress test on arrival; cards sold by ex-mining operations need thorough vetting.
CUDA 13+ features bias toward Ada/Blackwell. Some llama.cpp performance optimizations (FA-3) won't light up on Ampere. RTX 6000 Ada ($7,000+ new) is the forward-looking successor if budget allows.

Local vs cloud at this tier

● LOCAL WINS

Single-slot 70B Q4 at 14.5 tok/s — the quietest single-card 70B path in the lineup. ECC memory for scientific ML. Workstation-grade drivers for CUDA / ML pipelines.

● CLOUD WINS

Cloud wins on frontier reasoning and zero-hardware-risk. An A6000 is a multi-year bet on Ampere remaining driver-supported, which NVIDIA's Production Branch 570.xx still is but won't be forever.

The right pick for a quiet, single-GPU 70B workstation in July 2026 — if ECC matters, and if the used-market price holds at $3,500–$4,500. Above $5,000 the RTX 6000 Ada new-card path starts to make sense.

Next step

Load this setup into the planner→