Skip to content

Commit eb58a69

Browse files
committed
Log RTX PRO 6000 GGUF attempts
1 parent 1726e96 commit eb58a69

3 files changed

Lines changed: 117 additions & 1 deletion

File tree

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# 2026-04-25 RunPod 4x RTX PRO 6000 Q2 Attempts
2+
3+
## Goal
4+
5+
Try the `unsloth/Kimi-K2.6-GGUF:UD-Q2_K_XL` path on a non-AMD node that still meets the smallest practical aggregate VRAM target.
6+
7+
## Why This Topology
8+
9+
- Each `RTX PRO 6000 Blackwell Server Edition` has `96 GB` VRAM.
10+
- `4x` gives `384 GB` aggregate VRAM, which is above the `340 GB` GGUF artifact size and keeps the topology in the requested even-GPU pattern.
11+
- llama.cpp supports multi-GPU model sharding through `--split-mode` and `--tensor-split`, so GGUF is a valid multi-GPU path.
12+
13+
Sources:
14+
15+
- https://github.com/ggml-org/llama.cpp
16+
- https://github.com/ggml-org/llama.cpp/discussions/6046
17+
- https://github.com/ggml-org/llama.cpp/discussions/11784
18+
19+
## Launch Configuration
20+
21+
| Field | Value |
22+
| --- | --- |
23+
| Cloud | `COMMUNITY` |
24+
| Image | `runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2404` |
25+
| Model | `unsloth/Kimi-K2.6-GGUF:UD-Q2_K_XL` |
26+
| Context length | `2048` |
27+
| GPU layers | `999` |
28+
| Split mode | `layer` |
29+
| Tensor split | `1,1,1,1` |
30+
| Volume | `500 GB` |
31+
| Port | `30000/http` |
32+
33+
The launcher embedded `scripts/serve/start_llamacpp_kimi_k26_gguf_cuda.sh` directly into the pod startup command.
34+
35+
## Attempt 1
36+
37+
| Field | Value |
38+
| --- | --- |
39+
| Pod name | `kimi-k26-gguf-q2-rtxpro6000x4-20260425-1` |
40+
| Pod ID | `eleak5xoojla2a` |
41+
| Cost | `$6.76/hr` |
42+
| Machine ID | `9ti6j8484pn1` |
43+
44+
Observed behavior:
45+
46+
- Pod allocated
47+
- `desiredStatus: RUNNING`
48+
- `uptimeSeconds: 0`
49+
- `publicIp: null`
50+
- SSH remained `pod not ready`
51+
52+
## Attempt 2
53+
54+
| Field | Value |
55+
| --- | --- |
56+
| Pod name | `kimi-k26-gguf-q2-rtxpro6000x4-20260425-2` |
57+
| Pod ID | `qm93vevzo0cz1j` |
58+
| Cost | `$6.76/hr` |
59+
| Machine ID | `9ti6j8484pn1` |
60+
61+
Observed behavior:
62+
63+
- Pod allocated again onto the same machine ID as attempt 1
64+
- RunPod later exposed SSH metadata: `107.150.186.62:13340`
65+
- Direct SSH attempts still returned `Connection refused`
66+
- `uptimeSeconds` remained `0`
67+
- No benchmark was possible because the runtime never became reachable
68+
69+
## Interpretation
70+
71+
This is another provider readiness failure. The second attempt is especially useful because it shows that even after RunPod exposed SSH metadata, the host still was not accepting TCP connections.
72+
73+
That means:
74+
75+
- the issue is not the GGUF format
76+
- the issue is not the multi-GPU split configuration
77+
- the issue is not SSH key auth
78+
- the issue is the allocated RunPod host failing to transition into a live runtime
79+
80+
## Cleanup
81+
82+
Both pods were deleted after the failed readiness windows.

experiments/002-kimi-k26-gguf-q2/README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,36 @@ Outcome:
6060

6161
This failure mode points to RunPod host readiness, not model fit. The deployment never reached the point where `llama-server` could start downloading or loading the GGUF.
6262

63+
## 2026-04-25 4x RTX PRO 6000 Attempts
64+
65+
Two `4x RTX PRO 6000 Blackwell Server Edition` attempts were made on 2026-04-25 for the same `UD-Q2_K_XL` GGUF using llama.cpp CUDA with explicit multi-GPU sharding enabled.
66+
67+
Configuration:
68+
69+
| Field | Value |
70+
| --- | --- |
71+
| Cloud | Community |
72+
| GPU | `NVIDIA RTX PRO 6000 Blackwell Server Edition` |
73+
| GPU count | `4` |
74+
| Aggregate VRAM | `384 GB` |
75+
| Image | `runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2404` |
76+
| Model | `unsloth/Kimi-K2.6-GGUF:UD-Q2_K_XL` |
77+
| Context | `2048` |
78+
| Split mode | `layer` |
79+
| Tensor split | `1,1,1,1` |
80+
| Port | `30000/http` |
81+
82+
Outcome:
83+
84+
- First pod: `eleak5xoojla2a`
85+
- Second pod: `qm93vevzo0cz1j`
86+
- Both landed on the same machine family: `9ti6j8484pn1`
87+
- First attempt never exposed reachable SSH
88+
- Second attempt later exposed SSH metadata (`107.150.186.62:13340`) but direct SSH still returned `Connection refused`
89+
- In both cases `uptimeSeconds` remained `0`, so the runtime never transitioned into a usable state
90+
91+
These attempts confirm that the current RunPod `4x RTX PRO 6000` allocator is also returning a non-ready host for this workflow.
92+
6393
## Launch Runbook
6494

6595
Capacity probe with a disposable pod volume:

scripts/serve/start_llamacpp_kimi_k26_gguf_cuda.sh

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ PORT="${PORT:-30000}"
77
CONTEXT_LENGTH="${CONTEXT_LENGTH:-4096}"
88
GPU_LAYERS="${GPU_LAYERS:-999}"
99
PARALLEL="${PARALLEL:-1}"
10+
SPLIT_MODE="${SPLIT_MODE:-layer}"
11+
TENSOR_SPLIT="${TENSOR_SPLIT:-1,1,1,1}"
1012
LLAMA_CPP_DIR="${LLAMA_CPP_DIR:-/workspace/src/llama.cpp}"
1113
LLAMA_CACHE="${LLAMA_CACHE:-/workspace/llama-cache}"
1214
CUDA_ARCHITECTURES="${CUDA_ARCHITECTURES:-80;86;90;100;120}"
@@ -46,4 +48,6 @@ exec ./build/bin/llama-server \
4648
-c "$CONTEXT_LENGTH" \
4749
-ngl "$GPU_LAYERS" \
4850
--jinja \
49-
--parallel "$PARALLEL"
51+
--parallel "$PARALLEL" \
52+
--split-mode "$SPLIT_MODE" \
53+
--tensor-split "$TENSOR_SPLIT"

0 commit comments

Comments
 (0)