Granite-Switch 4.1 — IBM's 12-adapter-in-one-checkpoint pattern is the deployment story

IBM uploaded the Granite-Switch 4.1 family (3B / 8B / 30B previews) to Hugging Face on May 25. Each checkpoint is the base Granite 4.1 dense model with 12 task-specialized LoRA adapters embedded, activated per-token via control tokens in the chat template. Three libraries: Core (requirement check, context attribution, uncertainty), RAG (query rewrite, query clarification, answerability, hallucination detection, citation generation), Guardian (safety detection, factuality detection + correction, policy guardrails). Apache 2.0, 128K context, 12 languages.

Verdict: IBM ships an agent toolkit as 12 task LoRAs in one checkpoint — Apache 2.0, preview

The take

The pattern is the editorial moment, not the base capability. Until now, the production-RAG and agent-orchestration playbook has been: serve a generalist base model + maintain a fleet of specialist sidecars (a reranker, a hallucination detector, a query rewriter, a safety filter). Each sidecar is its own deployment, its own inference cost, its own monitoring story. Granite-Switch 4.1 collapses that fleet into one checkpoint by embedding 12 task LoRAs and switching them per-position via control tokens — a lightweight switch layer reads tokens in the chat template, produces per-position adapter indices applied across all decoder layers, and KV-cache normalization keeps adapters independent so a single conversation can fluidly use Query-Rewrite → Answerability → Generation → Hallucination-Detection without unloading or reloading models.

The deployment math: one model in VRAM (8B base + ~150 MB per active adapter). One inference endpoint. One scaling story. The base is `LlamaForCausalLM`-class architecture, so vLLM and Transformers handle it natively — no exotic kernels, no special runtime. The 12 adapters are: Core (3), RAG (5), Guardian (4). 12 languages including EN, DE, ES, FR, JA, PT, AR, CS, IT, KO, NL, ZH.

What it is NOT: a step-up on base capability. IBM's Granite 4.1 8B is a respectable but not standout 8B model — Qwen 3.5 9B, Ministral 3 8B Instruct, and Llama 3.1 8B all beat it on standard chat / coding benchmarks. The value is the adapter toolkit, not the base. And the "Preview" label is real: IBM's card explicitly says adapters should be tested per-use-case before production, and Guardian-Library safety adapters are not a substitute for application-level safety testing.

Where it fits in our taxonomy: we've added /models/granite-switch-4-1/ as an editorial reference for the deployment-pattern story; no planner pick slot (the preview status + niche use case keep it out of the modelPicks arrays until full release). For readers building RAG or agent stacks at production scale, this is the pattern to study — even if you ultimately stick with sidecars on a different base model, the embedded-adapter shape is where this category is heading. For readers running local chat on consumer hardware, it doesn't change your picks: Qwen 3.5 35B-A3B / Qwen3-Coder-30B-A3B / Ministral 3 14B Instruct still own the 16–24 GB tier.

Where this fits

Models: Granite-Switch 4.1 8B Preview (12 task LoRAs) · IBM Granite 4.1 · Qwen 3.5 35B-A3B · Ministral 3 family (3B / 8B / 14B)

Hardware: NVIDIA RTX 4090 · RTX 5060 Ti 16 GB · Mac Studio M4 Max 64 GB

Sources

Next step

Try this in the planner→