AI Model Catalog — llama-cpp Preset Bank¶
What each model in the cluster is, what it's for, and when to use it. Companion to
3090-llm-optimization.md(topology, KV math, engine choice). Source of truth for the presets themselves:my-apps/ai/llama-cpp/presets.ini.Last updated: 2026-05-30.
How the bank works (read this first)¶
All local LLMs are served by one llama-server (GPU0) using its built-in
router: --models-preset /config/presets.ini. Each [section] is a preset;
the text before the - is the OpenAI-API model ID clients send in their
model field. Clients hit http://llama-cpp-service.llama-cpp.svc.cluster.local:8080/v1.
Critical operational fact — the router runs one child process per preset,
not per GGUF file, and --models-max 1 keeps only ONE resident. Switching
presets unloads one child and loads the next — a ~22GB --no-mmap read off NFS
(~16s warm, 30–90s cold). Even presets that share the same GGUF and --ctx-size
(e.g. qwen3.6 think vs qwen3.6-nothink) are separate instances and force
a full reload when you alternate. If a load is still in-flight when a different
preset is requested, the LRU evictor force-kills it → the caller gets
500 "model … failed to load". That reload window is the "stall."
Consequence: keep day-to-day traffic on ONE shared preset. That preset is the
daily driver, qwen3.6-nothink (aliased default / daily). Reserve other
presets for deliberate, occasional, serialized use. This is why Perplexica and
OpenWebUI were repointed off longctx/think onto nothink (PR #1312).
Quick reference¶
| Preset (model ID) | GGUF | Params | Ctx | Use it for |
|---|---|---|---|---|
qwen3.6-nothink ⭐ |
Qwen3.6-35B-A3B UD-Q4_K_XL | 35B MoE / 3B active | 64K | DAILY DRIVER — tools, agents, JSON, RAG, chat |
qwen3.6 |
(same GGUF) | 35B / 3B | 64K | reasoning chat (think mode) |
qwen3.6-longctx |
(same GGUF) | 35B / 3B | 256K | deep research — dual-card only |
coder |
Qwen3-Coder-30B-A3B UD-Q4_K_XL | 30B MoE / 3.3B active | 64K | local coding agent (OpenCode/Cline/Aider) |
tool-fast |
Qwen3-4B-Instruct-2507 UD-Q4_K_XL | 4B dense | 32K | high-volume n8n / triage tool calls |
whiterabbitneo |
WhiteRabbitNeo-2.5-Qwen2.5-Coder-7B Q8_0 (→ swap to V3-7B) | 7B dense | 64K | authorized security learning (isolated) |
gemma4 / gemma4-nothink |
gemma-4-26B-A4B UD-Q4_K_XL | 26B MoE / 3.8B active | 128K | multimodal fallback / bulk vision captioning |
gemma4-31b / -nothink |
gemma-4-31B UD-Q4_K_XL | 31B dense | 64K | top-quality second opinion (slow) |
uncensored / nothink-uncensored |
Qwen3.5-35B-A3B-Uncensored-HauhauCS Q4_K_M | 35B MoE / 3B | 32K | creative / RP toy only |
⭐ = the model everything should default to. think/nothink variants of a model
share one GGUF and switch sampling + enable_thinking only — but are still
separate router instances, so don't split daily traffic across them.
Qwen 3.6-35B-A3B — the primary model (3 presets)¶
Unsloth Dynamic Q4_K_XL (~22GB), MoE 35B total / ~3B active per token. Native
256K context (1M via YaRN). Multimodal via mmproj-BF16.gguf. The best
self-hostable research/agentic/tool-calling model for a 24GB card as of May 2026
(Qwen 3.7 is API-only; see optimization doc). One set of weights covers text,
coding, tool calls, and vision.
qwen3.6-nothink ⭐ — DAILY DRIVER (default / daily)¶
- For: everything day-to-day — n8n, Karakeep tagger, K8sGPT, Perplexica, OpenWebUI chat + tasks, NOMAD, agent loops, RAG, JSON/tool calls.
- Why nothink:
<think>tokens break strict JSON tool parsing and inflate context. Sampling tuned for clean structured output:temp 0.7, top_p 0.8, top_k 20, presence-penalty 1.5(the penalty stops agent loops re-issuing the same tool). 64K ctx. - Point any new tool at model
defaultso it shares this resident instance. To re-point the whole cluster to a different daily driver later, move thedaily, defaultaliases to another preset — one edit, everything follows.
qwen3.6 — reasoning chat (think mode)¶
- For: interactive chat where you want visible reasoning. Same GGUF/ctx as
nothink; differs only in
temp 1.0+enable_thinking:true. - Caution: separate router instance from nothink → using it alongside the daily driver causes a reload. Fine for deliberate "I want thinking now" use; don't wire automated/utility traffic to it.
qwen3.6-longctx — deep research, DUAL-CARD ONLY¶
- For: the occasional very long research dig or full-repo context, when both 3090s are pooled (layer-split, 48GB → 256K KV resident).
- On a single 3090 it's a trap: 256K KV (~14GB) spills ~7GB to host RAM → CPU-driven attention on the Broadwell CPU = slow, and it's a distinct instance that thrashes against the daily driver. Perplexica's embedding filter keeps prompts <64K anyway, so it does not default here. Select manually only when you've pooled both cards (scale ComfyUI→0). See optimization doc.
Qwen3-Coder-30B-A3B — coder¶
- What: "Coder-Flash", MoE 30B total / 3.3B active, 256K native, purpose-built for agentic / repo-scale coding. Unsloth UD-Q4_K_XL (~17.7GB).
- For: a local coding agent — point OpenCode / Cline / Aider / Continue at the
llama-cpp
/v1endpoint and selectcoder. Free, unlimited volume; per-request window ismin(model, VRAM)(~64K single-card; full 256K only if both cards pooled). Sampling: Qwen coding profiletemp 0.7, top_p 0.8, top_k 20. - Use it for the "free for small stuff" tier — boilerplate, refactors, quick questions. Hard architecture work still goes to paid APIs.
Qwen3-4B-Instruct-2507 — tool-fast¶
- What: small 4B dense, UD-Q4_K_XL (~2.5GB). Punches far above its size on function-calling evals.
- For: high-volume, low-complexity n8n triggers / triage where swap latency on
the 35B would dominate. Tiering:
tool-fastfor simple triggers,default(35B) for complex agent chains. - Note: small enough to potentially co-reside with the daily driver if
--models-maxis raised to 2 after verifying the VRAM budget (a 35B + 4B fit; two 35B presets do not). Currently--models-max 1.
WhiteRabbitNeo — whiterabbitneo (isolated security model)¶
- What: WhiteRabbitNeo, built on Qwen2.5-Coder-7B, trained on offensive + defensive security data. Domain-tuned, chosen over an abliterated general model because abliteration degrades accuracy and a security-trained model is both more accurate AND more willing on this domain.
- ⚠️ Version — use V3, not 2.5: the currently-deployed file is
WhiteRabbitNeo-2.5-Qwen-2.5-Coder-7B-Q8_0.gguf, but the latest open release is WhiteRabbitNeo-V3-7B (GGUF: bartowski). TODO: swap v2.5 → V3-7B (download to NFS, repoint themodel =path). - WhiteRabbitNeo's open line is Qwen2.5-Coder-class (v2.5 confirmed Qwen2.5-Coder-7B; V3 is the newest open 7B release in the same lineage). There is no Qwen3-based WhiteRabbitNeo — so "it's Qwen2.5" is the model family, not a sign you're running something stale within WRN. 7B is the only current open size.
- Larger 13B/33B exist only on older v1/v1.5 bases — stale, skip them.
- The recommended self-hosted target is V3-7B; v2.5 still works fine in the meantime (it's isolated/occasional-use), so this is a low-priority swap.
- The 30B-MoE "Deep Hat V2" successor is proprietary (Kindo), not self-hostable.
- For: authorized security learning on systems you own or are authorized to test — attack-chain explanation, PoC/CTF code, red/blue concepts.
- Isolation by design: security-specific aliases, and it is NOT in Perplexica's model list — its tuning must never back research/RAG. Swap to it deliberately in the UI; swap back when done.
Gemma 4 — multimodal fallback (4 presets)¶
gemma4/gemma4-nothink(26B-A4B MoE, ~3.8B active, ~17GB + mmproj, 128K ctx): kept as a multimodal fallback in case Qwen 3.6's vision regresses on something, and for bulk vision captioning (e.g. ComfyUI's caption node). Google sampling:temp 1.0, top_p 0.95, top_k 64.gemma4-31b/-nothink(31B dense, ~18.8GB + mmproj, 64K ctx): strongest Gemma, but all 31B are active per token → markedly slower than every MoE preset. Use only when quality matters more than latency, never for agent loops.- Qwen 3.6 beats Gemma 4 on coding/agent/tool-calling; Gemma wins on some multimodal — hence "fallback / second opinion," not default.
Qwen 3.5 Uncensored (HauhauCS) — uncensored / nothink-uncensored¶
- What: unfiltered fine-tune of the older Qwen 3.5 base, Q4_K_M, 32K ctx.
Creative/RP sampling (
temp 0.9, min_p 0.05, top_k 40, presence-penalty 0.6). - For: creative / RP / jailbreak experimentation only.
- Keep OUT of accuracy work: abliteration/uncensoring degrades benchmarks
(esp. reasoning/JSON). It is deliberately not the default anywhere and must
not back Perplexica/RAG/tool-calling. For security learning, use
whiterabbitneo(domain-tuned, accurate) instead of this.
Which model should I use?¶
| I want to… | Use |
|---|---|
| Anything automated / tools / RAG / default chat | default (= qwen3.6-nothink) |
| Chat with visible reasoning | qwen3.6 (think) |
| Code with a local agent | coder |
| Fire lots of tiny n8n tool calls fast | tool-fast |
| Learn security on my own gear | whiterabbitneo |
| Caption images in bulk / vision fallback | gemma4-nothink |
| Squeeze max single-shot quality (slow) | gemma4-31b |
| Deep 256K research (both cards pooled) | qwen3.6-longctx |
| Creative/RP play | uncensored |
Who points at what (cluster apps)¶
| App | Model | Notes |
|---|---|---|
| Perplexica / Vane | qwen3.6-nothink (default) |
longctx demoted to "dual-card only"; active model lives in browser localStorage["chatModelKey"] |
| OpenWebUI | DEFAULT_MODELS + TASK_MODEL = qwen3.6-nothink |
⚠️ VISION_MODELS still qwen3.6 (think) → image input reloads a separate instance; repoint to qwen3.6-nothink for zero thrash |
| Project NOMAD | AI_BENCHMARK_MODEL = qwen3.6 (think) |
⚠️ residual think reference; low impact (benchmark feature) |
| n8n / Karakeep / K8sGPT | nothink / default |
tool/JSON workloads |
Residual thrash cleanup (optional): the two ⚠️ rows above still point at the
think instance and will force a reload when exercised. Repoint both to
qwen3.6-nothink for a single resident instance across all traffic.
Operating rules (don't break these)¶
- KV cache stays symmetric
q8_0/q8_0— never mixq8/q4(llama.cpp #20866 → KV falls to CPU, 44× slower). - Don't split daily traffic across presets — every alternation is a ~16–90s
reload. Consolidate on
default. --models-max 1is intentional; raising it only helps co-residing a small model with the daily driver, and only after checking VRAM (two 22GB presets won't fit one 24GB card).- Adding a model: drop the GGUF on NFS (
192.168.10.133:/mnt/ai-pool/llama-cpp), add a[id - alias]section inpresets.iniwith a UNIQUE alias set (llama-server rejects the whole file on any duplicate), confirm themodel =path, commit. The hash-suffixed configMap auto-rolls llama-cpp.
Further reading — club-3090 (and what we learned from it)¶
noonghunna/club-3090 is a community
recipe repo for serving LLMs on RTX 3090/4090/5090 (multi-engine: vLLM,
llama.cpp, ik_llama; model-agnostic — ships Qwen3.6-27B/35B and Gemma 4 26B/31B
configs). Our hardware (2× RTX 3090) matches it directly, so its findings drove a
lot of the decisions here and in 3090-llm-optimization.md.
The docs worth reading there:
- docs/CLIFFS.md — the three memory "cliffs" (FA2 softmax_lse padding; DeltaNet GDN OOM >50–60K; recurrent-state prefix-cache blowup). All three are vLLM-specific; llama.cpp is immune — a big reason we stay on llama.cpp for RAG/agentic.
- docs/KV_MATH.md — KV-cache memory math. Rule of thumb: each halving of bytes/element ~doubles affordable context. q4_0 KV ≈ 0.28× fp16; TQ3/TurboQuant ≈ 0.21×.
- docs/SINGLE_CARD.md / DUAL_CARD.md — single-3090 flag recipes (-ub, -ngl, -ckv) and the TP=2 path.
- docs/QUANTIZATION.md, COMPARISONS.md, HARDWARE.md, BENCHMARKS.md.
- Model recipes for our exact models: models/qwen3.6-35b-a3b/ (ik-llama + vllm), gemma-4-26b-a4b/.
Interesting findings (with sources) that shaped our setup:
- Single-card: llama.cpp ≫ vLLM for this MoE. Same-hardware benchmark
(tfriedel/qwen3.6-rtx3090-lab):
Qwen3.6-35B-A3B on one 3090 → llama.cpp/IQ4_XS 115–133 TPS vs vLLM/AWQ-INT4
18.6 TPS (~7×; AWQ weights starve the card → --enforce-eager). vLLM tp1 is
"an anti-pattern." vLLM only wins at TP=2 (149 TPS @ 200K; 264 code TPS w/ MTP-3).
- MTP/speculative decoding gives NO net speedup on Ampere + A3B under llama.cpp
(thc1006 bench /
writeup) — the MoE is already
memory-bound. It only helps under vLLM TP=2. (We skipped the MTP GGUF for this reason.)
- ik_llama.cpp is the fastest single-card path (~18–20% faster decode, leaner
VRAM via IQK quants + Hadamard KV) — the relevant upgrade to try before vLLM.
- TurboQuant (turbo3 KV, ≈5× smaller) is the coming KV win
(llama.cpp #20969) —
not in mainline yet (PR #21089). club-3090 calls it the "TQ3/Genesis" path.
- Asymmetric KV trap (llama.cpp #20866):
q8-K + q4-V forces the KV cache to CPU, 44× slower. Keep KV symmetric.
- Docker tag scheme moved cuda12 → cuda13 (CUDA 13;
#21429) — why the renovate
regex was frozen and we bumped to server-cuda13-b9354.
Full topology / KV / engine analysis lives in
docs/domains/ai-gpu/3090-llm-optimization.md.