MacBook Air M5 24 GB

Fanless, silent, and barely enough bandwidth to feel interactive.

The M5 Air at 24 GB is the first Apple laptop where 8B dense Q4 inference feels responsive without a fan ramping up — because there is no fan. 153 GB/s bandwidth is the honest limiting factor; this is not a 14B-comfortable machine.

The decision in five lines

The call: Consider — Fanless, silent, and barely enough bandwidth to feel interactive.
Best for: Fanless portable
Runs well: Qwen3-14B · Qwen 3.5 9B · Qwen 3.5 9B + RAG
Watch out: Fanless means sustained inference throttles. Expect ~15–20% speed loss after 5–10 minutes of continuous token generation.
Evidence: Estimated · last verified July 2026

24: GB UNIFIED
153: GB/S BANDWIDTH
FANLESS: SUSTAINED THROTTLE
$1,499: 13" 24GB (JUN 25)

What fits at this tier

Fits 8B dense Q4 at ~25–35 tok/s, 4B-class picks at 60–75 tok/s. 14B dense Q4 fits (weights at 8 GB on 24 GB unified) but 153 GB/s bandwidth caps TG at ~12–18 tok/s — workable, not comfortable. 30B-A3B MoE Q4 is a REQUIRES TWEAK fit at the memory margin; bandwidth makes it slower than the Mac mini M4 Pro.

CODING

Qwen3-14B Sticky 14B workhorse; 128K context; Apache 2.0; broad runner support.

CHAT / GENERAL

Qwen 3.5 9B 262K context with native multimodal; strong on GPQA, IFEval, LiveCodeBench at the 9B size.

DOCS & RETRIEVAL

Qwen 3.5 9B + RAG Chunk aggressively, retrieve well; 262K native context handles big retrieval windows comfortably.

IMAGE

FLUX.2 klein 4B (Apache 2.0) BFL's first fully Apache-2.0 model; 4B distilled for fast inference on mid-tier GPUs; commercial OK.

AGENTS

Qwen 3.5 9B Strong tool-use performance for 9B; supports thinking mode and 201-language coverage.

VOICE

Chatterbox Multilingual (Resemble AI) MIT; 23 languages; voice cloning + emotion dial; pip 0.1.7 (March 2026) shows active development.

The call

Buy it if portability + silent operation matters more than raw throughput — you want local 8B on a plane, in a library, or next to a sleeping kid. The M5 Neural Accelerators do help prefill vs M4 generation.
Skip it if you'll plug it in and use it as a desk machine — a Mac mini M4 Pro at $1,599 gives 273 GB/s (80% more bandwidth) in a smaller footprint that stays on your desk.

Watchouts

Fanless means sustained inference throttles. Expect ~15–20% speed loss after 5–10 minutes of continuous token generation.
Ollama’s MLX backend requires more than 32 GB unified (Ollama’s own wording). 24 GB MBA users stay on the Metal path.
153 GB/s is the hard bandwidth ceiling. No sysctl tweak helps — memory bandwidth is not memory capacity.
macOS 33% reservation caps the LLM budget at ~16 GB. With sysctl tweak, ~21 GB.

Local vs cloud at this tier

● LOCAL WINS

Airplane-mode local inference. Privacy-first work from a cafe. Local chat + embeddings + summarization that doesn't need a network.

● CLOUD WINS

Anything interactive at 14B+. A portable laptop on wifi running Claude Pro beats this on quality for chat work, at $20/mo.

A sensible secondary inference device, not a primary one. The value is "local AI that travels," not raw throughput.

Next step

Load this setup into the planner→