Figure outyour AI setupwith more clarity

A focused planner for people trying to understand what their hardware can realistically do, what kind of setup makes sense, and when local AI is actually worth the effort.

Planner

Start with your situation.

Try a preset

1. What are you doing?

2. Platform

3. GPU memory (VRAM)

4. System RAM

5. GPU family

6. Main use case

7. What matters most

8. While the model runs, what else?

9. Typical context window

Recommendation

Your result

Comfortable

Best fit

Comfortable for midsize local models

Strong for daily local use, coding, and experimentation.

Best-fit setup

MID TIER
  • 01Qwen3-14BGOOD · 1.4×Sticky 14B workhorse; 128K context; Apache 2.0; broad runner support.Needs ≈ 11.5 GB at 16K context
  • 02Qwen 3.5 9BPERFECT · 2.3×262K context; strong on LiveCodeBench, IFEval, MMLU-Pro for its size.Needs ≈ 7.0 GB at 16K context
  • 03gpt-oss-20bTOO BIG · 0.9×MXFP4-native Apache 2.0; fits 16GB cleanly; reasoning + tool use at this tier.Needs ≈ 18.5 GB at 16K context
Runner
Ollama or LM StudioLM Studio for UI, Ollama for CLI + API. Jan is a good privacy-first alternative.
Quantization
Stick to Q4_K_M for most picks.
Expected speed
30–50 tok/s on 7–8B. Long prompts pull it lower.

Next steps · Windows

  1. Install Ollama
    winget install Ollama.Ollama
  2. Pull the model
    ollama pull qwen3:14b
  3. Run it
    ollama run qwen3:14b

Local vs cloud at this tier

Local works for chat, lightweight coding, and short-document Q&A. For hard reasoning, long context (>64K), or production image-gen at quality, cloud wins on quality per dollar. Think of local as the privacy-preserving fallback and the zero-marginal-cost workhorse for small tasks, not the daily driver for heavy output.

Workflow notes

  • Wire into your editor — Continue.dev, Cline, or Aider.
  • Keep a small fast model for autocomplete, a bigger one for review.
  • Cloud fallback for the heaviest reasoning tasks.
  • Take the faster path per task — local or cloud.

Watchouts

  • Very large models still push this setup — quantize aggressively

Feel the quality

What you actually get at each tier.

Same prompt, top pick from each band. This is the quality jump you're paying for when you move up a tier — no chatbot hand-waving.

Prompt

Write a Python function that deduplicates a list while preserving insertion order.

Low tier·Qwen 3.5 4B

def dedupe(lst):
    return list(set(lst))
# order lost :(

Mid tier·Qwen 3.5 9B

def dedupe(lst):
    seen = set()
    return [x for x in lst
            if not (x in seen
                    or seen.add(x))]

High tier·Qwen 3.5 27B

# O(n), preserves order
def dedupe(lst):
    seen = set()
    out = []
    for x in lst:
        if x not in seen:
            seen.add(x)
            out.append(x)
    return out

Top tier·GLM-5.1

from typing import Iterable, TypeVar

T = TypeVar("T")

def dedupe(items: Iterable[T]) -> list[T]:
    """Remove duplicates, preserve order, O(n)."""
    seen: set[T] = set()
    return [x for x in items
            if not (x in seen or seen.add(x))]

Static snapshot — dated April 2026. Your own run on these models may differ.

Is it worth it?

Local vs cloud over 36 months.

~5M tokens / mo · Daily coding + chat

Flat $20/mo; ~50–100 messages per 5h.

Estimated from your planner inputs.

US average ≈ $0.13/kWh. Adjust for your utility.

Cumulative cost, 36 months

  • Cloud
  • Local
0mo6mo12mo18mo24mo30mo36mo$0$400$800$1.2k$1.6k

Headline

Pays back in 76 months (beyond 3 years).

3-year total: $1,504 local vs $720 cloud. Cloud saves $784 over 36 months at this usage.

Pricing as of April 2026. Assumes ~250 W active inference at ~400 tok/s. Your real numbers will vary.

About

The AI Bench is a practical publication for local AI decisions. No hype, no newsletter, no directory bloat — just fewer, sharper tools that help you decide what to run, what to buy, and when local AI is actually worth it.

Runs locally via Ollama, LM Studio, or ComfyUI for image models.

Latest snapshot · April 2026

Gemma 4, GLM-5.1, and the MoE moment.

Qwen 3.5 27B is now the dense top pick for 24 GB. Gemma 4 31B took #3 on Arena. Z-Image-Turbo runs on 6 GB VRAM. The big story is mixture-of-experts: 30B-A3B MoE models now hit 3B-class inference speeds on a 4090 at 30B-class quality.

Read the changes feedQuarterly snapshots + fast takes on major drops. RSS. No newsletter.

Hardware we'd actually buy

Top tier · 32 GB VRAM

RTX 5090

VRAM
32 GB GDDR7
Bandwidth
1,792 GB/s
TDP
575 W
Street
~$2,500–$3,900

Runs everything up to 32B dense and 70B MoE comfortably. Launch MSRP $1,999 — memory-crisis supply through mid-2026 keeps it well above that. Best Buy / Newegg restocks sell out in minutes.

Smart money · 48 GB total

Dual RTX 3090 (used)

VRAM
2× 24 GB GDDR6X
Bandwidth
936 GB/s each
TDP
350 W each
Street
~$1,600–$2,400 all-in

The community sweet spot. Used 3090s run $670–$1,000 each on eBay; add $200–$400 for a beefier PSU. llama.cpp + Ollama auto-split across PCIe — no NVLink needed. Runs 70B Q4 without quant pain. Loudest of the picks.

Team red · 24 GB VRAM

AMD Radeon RX 7900 XTX

VRAM
24 GB GDDR6
Bandwidth
960 GB/s
TDP
355 W
Street
~$750–$1,100 new

The AMD pick. 24 GB at ~85–90% of 4090 throughput on ROCm. vLLM + llama.cpp (HIP) are the reliable runners — Ollama on AMD is still patchy. Budget 5–10h for first-time ROCm driver setup. Used ones go ~$750–$850 on eBay.

All-rounder · Mac

M5 Max MacBook Pro 64 GB

Unified memory
64 GB
Bandwidth
614 GB/s
TDP
~40 W sustained
Price
~$4,599

Silent, portable. Prefill on long prompts is slow vs NVIDIA; throughput on short prompts is fine. Custom-config from 16" M5 Max base (2 TB SSD minimum).

Budget · 16 GB VRAM

RTX 5060 Ti 16 GB

VRAM
16 GB GDDR7
Bandwidth
448 GB/s
TDP
180 W
Street
~$550

Cheapest way into 14B Q4 and MoE 30B-A3B. Launch MSRP $429 (April 2025); current street runs higher.

No affiliate links yet. These are what we'd buy today. We'll only add paid links for picks that are still the honest best answer.