Log RTX PRO 6000 GGUF attempts

cderinbogaz · cderinbogaz · commit eb58a699ff5c · 2026-04-25T19:24:23.000+02:00
diff --git a/experiments/002-kimi-k26-gguf-q2/2026-04-25-runpod-rtxpro6000x4-q2-attempts.md b/experiments/002-kimi-k26-gguf-q2/2026-04-25-runpod-rtxpro6000x4-q2-attempts.md
@@ -0,0 +1,82 @@
+# 2026-04-25 RunPod 4x RTX PRO 6000 Q2 Attempts
+
+## Goal
+
+Try the `unsloth/Kimi-K2.6-GGUF:UD-Q2_K_XL` path on a non-AMD node that still meets the smallest practical aggregate VRAM target.
+
+## Why This Topology
+
+- Each `RTX PRO 6000 Blackwell Server Edition` has `96 GB` VRAM.
+- `4x` gives `384 GB` aggregate VRAM, which is above the `340 GB` GGUF artifact size and keeps the topology in the requested even-GPU pattern.
+- llama.cpp supports multi-GPU model sharding through `--split-mode` and `--tensor-split`, so GGUF is a valid multi-GPU path.
+
+Sources:
+
+- https://github.com/ggml-org/llama.cpp
+- https://github.com/ggml-org/llama.cpp/discussions/6046
+- https://github.com/ggml-org/llama.cpp/discussions/11784
+
+## Launch Configuration
+
+| Field | Value |
+| --- | --- |
+| Cloud | `COMMUNITY` |
+| Image | `runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2404` |
+| Model | `unsloth/Kimi-K2.6-GGUF:UD-Q2_K_XL` |
+| Context length | `2048` |
+| GPU layers | `999` |
+| Split mode | `layer` |
+| Tensor split | `1,1,1,1` |
+| Volume | `500 GB` |
+| Port | `30000/http` |
+
+The launcher embedded `scripts/serve/start_llamacpp_kimi_k26_gguf_cuda.sh` directly into the pod startup command.
+
+## Attempt 1
+
+| Field | Value |
+| --- | --- |
+| Pod name | `kimi-k26-gguf-q2-rtxpro6000x4-20260425-1` |
+| Pod ID | `eleak5xoojla2a` |
+| Cost | `$6.76/hr` |
+| Machine ID | `9ti6j8484pn1` |
+
+Observed behavior:
+
+- Pod allocated
+- `desiredStatus: RUNNING`
+- `uptimeSeconds: 0`
+- `publicIp: null`
+- SSH remained `pod not ready`
+
+## Attempt 2
+
+| Field | Value |
+| --- | --- |
+| Pod name | `kimi-k26-gguf-q2-rtxpro6000x4-20260425-2` |
+| Pod ID | `qm93vevzo0cz1j` |
+| Cost | `$6.76/hr` |
+| Machine ID | `9ti6j8484pn1` |
+
+Observed behavior:
+
+- Pod allocated again onto the same machine ID as attempt 1
+- RunPod later exposed SSH metadata: `107.150.186.62:13340`
+- Direct SSH attempts still returned `Connection refused`
+- `uptimeSeconds` remained `0`
+- No benchmark was possible because the runtime never became reachable
+
+## Interpretation
+
+This is another provider readiness failure. The second attempt is especially useful because it shows that even after RunPod exposed SSH metadata, the host still was not accepting TCP connections.
+
+That means:
+
+- the issue is not the GGUF format
+- the issue is not the multi-GPU split configuration
+- the issue is not SSH key auth
+- the issue is the allocated RunPod host failing to transition into a live runtime
+
+## Cleanup
+
+Both pods were deleted after the failed readiness windows.
diff --git a/experiments/002-kimi-k26-gguf-q2/README.md b/experiments/002-kimi-k26-gguf-q2/README.md
@@ -60,6 +60,36 @@ Outcome:
 
 This failure mode points to RunPod host readiness, not model fit. The deployment never reached the point where `llama-server` could start downloading or loading the GGUF.
 
+## 2026-04-25 4x RTX PRO 6000 Attempts
+
+Two `4x RTX PRO 6000 Blackwell Server Edition` attempts were made on 2026-04-25 for the same `UD-Q2_K_XL` GGUF using llama.cpp CUDA with explicit multi-GPU sharding enabled.
+
+Configuration:
+
+| Field | Value |
+| --- | --- |
+| Cloud | Community |
+| GPU | `NVIDIA RTX PRO 6000 Blackwell Server Edition` |
+| GPU count | `4` |
+| Aggregate VRAM | `384 GB` |
+| Image | `runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2404` |
+| Model | `unsloth/Kimi-K2.6-GGUF:UD-Q2_K_XL` |
+| Context | `2048` |
+| Split mode | `layer` |
+| Tensor split | `1,1,1,1` |
+| Port | `30000/http` |
+
+Outcome:
+
+- First pod: `eleak5xoojla2a`
+- Second pod: `qm93vevzo0cz1j`
+- Both landed on the same machine family: `9ti6j8484pn1`
+- First attempt never exposed reachable SSH
+- Second attempt later exposed SSH metadata (`107.150.186.62:13340`) but direct SSH still returned `Connection refused`
+- In both cases `uptimeSeconds` remained `0`, so the runtime never transitioned into a usable state
+
+These attempts confirm that the current RunPod `4x RTX PRO 6000` allocator is also returning a non-ready host for this workflow.
+
 ## Launch Runbook
 
 Capacity probe with a disposable pod volume:
diff --git a/scripts/serve/start_llamacpp_kimi_k26_gguf_cuda.sh b/scripts/serve/start_llamacpp_kimi_k26_gguf_cuda.sh
@@ -7,6 +7,8 @@ PORT="${PORT:-30000}"
 CONTEXT_LENGTH="${CONTEXT_LENGTH:-4096}"
 GPU_LAYERS="${GPU_LAYERS:-999}"
 PARALLEL="${PARALLEL:-1}"
+SPLIT_MODE="${SPLIT_MODE:-layer}"
+TENSOR_SPLIT="${TENSOR_SPLIT:-1,1,1,1}"
 LLAMA_CPP_DIR="${LLAMA_CPP_DIR:-/workspace/src/llama.cpp}"
 LLAMA_CACHE="${LLAMA_CACHE:-/workspace/llama-cache}"
 CUDA_ARCHITECTURES="${CUDA_ARCHITECTURES:-80;86;90;100;120}"
@@ -46,4 +48,6 @@ exec ./build/bin/llama-server \
   -c "$CONTEXT_LENGTH" \
   -ngl "$GPU_LAYERS" \
   --jinja \
-  --parallel "$PARALLEL"
+  --parallel "$PARALLEL" \
+  --split-mode "$SPLIT_MODE" \
+  --tensor-split "$TENSOR_SPLIT"