Single-GPU Llama.cpp + ComfyUI Vision Integration
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Split dual 3090s so llama-cpp uses 1 GPU with Qwen3.5 Q4_K_XL (only model) and ComfyUI uses the other, with ComfyUI able to call llama-cpp for vision/captioning.
Architecture: llama-cpp drops from 2 GPUs to 1, switches from dual-model (Coder + Qwen3.5 Q6) to single-model (Qwen3.5 Q4_K_XL with mmproj for vision). Context reduced to 16K for captioning use case. ComfyUI already has comfyui-llamacpp-client node installed and requests 1 GPU — no changes needed to ComfyUI deployment. Both pods schedule on the same GPU node, each claiming 1 of 2 GPUs.
Tech Stack: llama.cpp server (CUDA), Qwen3.5-35B-A3B multimodal, ComfyUI, comfyui-llamacpp-client node
Task 1: Update llama-cpp ConfigMap — single Qwen3.5 Q4_K_XL preset
Section titled “Task 1: Update llama-cpp ConfigMap — single Qwen3.5 Q4_K_XL preset”Files:
- Modify:
my-apps/ai/llama-cpp/configmap.yaml
Step 1: Replace configmap with single-model preset
Replace entire data.presets.ini content with:
# ==========================================================# QWEN3.5-35B-A3B [MULTIMODAL] — Single GPU (RTX 3090 24GB)# ==========================================================[qwen3.5]# 35B total / 3B active (MoE) - Gated DeltaNet + Gated Attention# Natively multimodal (vision + language)# Q4_K_XL (20.6GB) + mmproj (858MB) fits in single 24GB 3090# Feb 27 2026: Updated Unsloth Dynamic 2.0 quant (MXFP4 retired from attention)# Qwen official "precise" thinking paramsmodel = /models/Qwen3.5-35B-A3B-UD-Q4_K_XL.ggufmmproj = /models/mmproj-F16.ggufalias = qwen3.5, qwen 3.5, general, vision, image, multimodal, coder, codectx-size = 16384n-gpu-layers = 99temp = 0.6top-p = 0.95top-k = 20min-p = 0.0presence-penalty = 0.0chat-template-kwargs = {"enable_thinking": true}jinja = 1Key changes:
- Removed Qwen3-Coder-Next preset entirely
- Switched Q6_K_XL → Q4_K_XL
- Removed
tensor-split = 1,1(single GPU) - Context 131072 → 16384 (captioning/prompting use case)
- Added
coder, codealiases so existing API consumers still resolve
Task 2: Update llama-cpp Deployment — single GPU, reduced resources
Section titled “Task 2: Update llama-cpp Deployment — single GPU, reduced resources”Files:
- Modify:
my-apps/ai/llama-cpp/deployment.yaml
Step 1: Update server args
Change global -c from 131072 to 16384.
Remove --models-max 8 (only 1 model now — remove or set to 1).
Remove -b 4096 and -ub 1024 (single GPU with smaller context doesn’t need oversized batches). Replace with -b 2048 and -ub 512.
Keep: --models-preset, -ngl 99, -fa on, --jinja, --fit on, --no-mmap, --cache-type-k q8_0, --cache-type-v q8_0, --parallel 1, --host, --port.
Step 2: Update env vars for single GPU
env: - name: NVIDIA_VISIBLE_DEVICES value: "all" - name: CUDA_VISIBLE_DEVICES value: "0" - name: NVIDIA_DRIVER_CAPABILITIES value: "compute,utility" - name: GGML_CUDA_ENABLE_UNIFIED_MEMORY value: "1"Remove:
GGML_CUDA_PEER_MAX_BATCH_SIZE(multi-GPU peer transfer, not needed)CUDA_SCALE_LAUNCH_QUEUES(multi-GPU launch queue optimization, not needed)
Step 3: Update resource requests/limits
resources: limits: cpu: "32" memory: 64Gi # Q4_K_XL (20.6GB) + KV cache + overhead, RAM for expert paging nvidia.com/gpu: "1" # Was 2 ephemeral-storage: "50Gi" requests: cpu: "8" memory: 32Gi nvidia.com/gpu: "1" # Was 2 ephemeral-storage: "10Gi"Step 4: Reduce /dev/shm
Change sizeLimit: 32Gi → sizeLimit: 8Gi (single GPU, smaller context).
Step 5: Update comments
terminationGracePeriodSeconds: 300comment → update from “400GB memory unmapping” to “model unload time”GGML_CUDA_ENABLE_UNIFIED_MEMORYcomment → update to reference single 3090
Task 3: Create vision captioning workflow for ComfyUI
Section titled “Task 3: Create vision captioning workflow for ComfyUI”Files:
- Create:
my-apps/ai/comfyui/workflows/qwen35-vision-caption.json
This workflow: Load Image → LlamaCpp Client (vision) → Show Text
The comfyui-llamacpp-client node needs the llama-cpp service URL:
http://llama-cpp-service.llama-cpp.svc.cluster.local:8080
Note: The exact class_type and parameter names depend on the installed version of comfyui-llamacpp-client. The workflow should be created in the ComfyUI UI and exported, or verified against the node’s actual parameter schema. Create a minimal reference workflow:
{ "1": { "class_type": "LoadImage", "inputs": { "image": "input.png" } }, "2": { "class_type": "LlamaCppClient", "inputs": { "server_url": "http://llama-cpp-service.llama-cpp.svc.cluster.local:8080", "endpoint": "/v1/chat/completions", "prompt": "Describe this image in detail for use as a Stable Diffusion prompt. Focus on composition, lighting, colors, style, and subject matter.", "image": ["1", 0], "temperature": 0.6, "top_p": 0.95, "top_k": 20, "max_tokens": 512 } }, "3": { "class_type": "ShowText|pysssss", "inputs": { "text": ["2", 0] } }}Important: This workflow JSON is a reference template. The actual node class_type and input names must be verified from the installed comfyui-llamacpp-client node in the ComfyUI UI. The user may need to recreate it visually in ComfyUI to match the actual node interface.
Task 4: Update ComfyUI pre-start to copy vision workflow
Section titled “Task 4: Update ComfyUI pre-start to copy vision workflow”Files:
- Modify:
my-apps/ai/comfyui/configmap.yaml(thecomfyui-pre-startConfigMap)
Step 1: Add workflow copy to pre-start.sh
After the WanVideoWrapper workflow copy section, add:
# ── LlamaCpp Vision Workflows ────────────────────────────# Copy from ConfigMap-mounted workflows (if available)LLAMA_WF="/opt/workflows/qwen35-vision-caption.json"if [ -f "$LLAMA_WF" ]; then cp -f "$LLAMA_WF" "$DEST/" && \ echo "[INFO] Copied Qwen3.5 vision captioning workflow" || truefiStep 2: Mount workflow as ConfigMap in ComfyUI deployment
Create a new ConfigMap from the workflow JSON and mount it, OR simply document that the workflow should be loaded manually in the ComfyUI UI.
Given that workflows are typically created/edited in the UI and the JSON structure needs verification against the actual node, the simpler approach is: skip auto-deployment and have the user create the workflow in ComfyUI UI using these parameters:
- Server URL:
http://llama-cpp-service.llama-cpp.svc.cluster.local:8080 - Endpoint:
/v1/chat/completions - Model alias:
qwen3.5(or any alias from the preset)
This avoids fragile JSON that might not match the node’s actual schema.
Decision: Skip Task 3 and Task 4. The workflow JSON depends on the exact node interface which is better created in the UI. Document the connection URL instead.
Task 5: Commit and verify
Section titled “Task 5: Commit and verify”Step 1: Commit changes
git add my-apps/ai/llama-cpp/configmap.yaml my-apps/ai/llama-cpp/deployment.yamlgit commit -m "feat(llama-cpp): single GPU Qwen3.5 Q4_K_XL, free GPU for ComfyUI
- Drop from 2 GPUs to 1 (frees RTX 3090 for ComfyUI)- Remove Qwen3-Coder-Next model, use only Qwen3.5-35B-A3B- Switch Q6_K_XL → Q4_K_XL (20.6GB fits in single 24GB 3090)- Reduce context 131K → 16K (captioning/prompting use case)- Remove multi-GPU env vars and tensor-split- Reduce memory/CPU requests for single-model single-GPU"Step 2: Verify after ArgoCD sync
# Check both pods are running (each on 1 GPU)kubectl get pods -n llama-cppkubectl get pods -n comfyui
# Verify llama-cpp loaded modelkubectl logs -n llama-cpp -l app=llama-cpp-server --tail=50
# Verify GPU allocation (should show 1 GPU each)kubectl describe node <gpu-node> | grep -A5 "Allocated resources"
# Test vision APIkubectl run -it --rm curl --image=curlimages/curl --restart=Never -- \ curl http://llama-cpp-service.llama-cpp.svc.cluster.local:8080/healthStep 3: Configure ComfyUI llamacpp-client node
In ComfyUI UI:
- Add “LlamaCpp Client” node from AI/LlamaCpp category
- Set server URL:
http://llama-cpp-service.llama-cpp.svc.cluster.local:8080 - Connect a LoadImage node to its image input
- Set prompt: “Describe this image in detail for use as a Stable Diffusion prompt”
- Connect output to text display or directly to a prompt input