Skip to content

Commit 88dab56

Browse files
[Klaud Cold] Add KLAUD_DEBUG.md operational playbook + AGENTS.md reference (#1464)
* Add KLAUD_DEBUG.md operational playbook + link from AGENTS.md Captures recurring failure modes and fixes accumulated across many Klaud-Cold image-bump PR debugging sessions: 1. perf-changelog deletion-not-allowed (stale rebase) 2. bench-client LlamaTokenizer crash on sglang v0.5.12 images 3. vLLM v0.20.x / v0.21.x CUDA-graph profiler OOM 4. DSV4 custom-image -> generic v0.5.12 weight footprint OOM 5. sglang v0.5.12 B300 regressions (DeepGemm TMA, trtllm bs=128, flash_attn sm_120) 6. Cluster infra (drained nodes, docker-perm nodes, /nvme_home disk-full, port collision, drain watchdog pattern) 7. Docker tag gotchas (no clean release for mi355x sglang-rocm) 8. CI rerun mechanics (gh run rerun --failed only on completed) 9. gh CLI gotchas (silent gh pr edit failures, rollup truncation, CR/LF in --jq output) 10. PR conventions ([Klaud Cold] prefix, full-sweep-enabled label) 11. Useful slash commands AGENTS.md now points new agents to KLAUD_DEBUG.md near the top so they don't re-learn the playbook from logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Drop tokenizer-crash section from KLAUD_DEBUG.md Section 2 was overly specific to a single transient transformers/vllm mismatch and won't recur on the same path; the rest of the playbook covers patterns that are still actively useful. Renumber remaining sections and update the AGENTS.md pointer accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 3542221 commit 88dab56

2 files changed

Lines changed: 211 additions & 0 deletions

File tree

AGENTS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
Guidance for AI agents working with InferenceX.
44

5+
> **Before debugging a failing Klaud-Cold / claude/* image-bump PR, read [`KLAUD_DEBUG.md`](KLAUD_DEBUG.md).** It captures recurring failure modes (vLLM CUDA-graph OOM, B300 sglang regressions, cluster docker/perms/disk issues), the exact workarounds, and gh-CLI gotchas — most cron-PR failures are already cataloged there.
6+
57
## Project Overview
68

79
InferenceX is an open-source automated benchmarking system that tracks LLM inference performance across hardware (NVIDIA B200/H100/H200/GB200, AMD MI300X/MI325X/MI355X) and software stacks (vLLM, SGLang, TensorRT-LLM, ATOM). Results published to https://inferencex.com/.

KLAUD_DEBUG.md

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
# KLAUD_DEBUG.md — Operational Knowledge for Recipe-Bump PRs
2+
3+
A running playbook of failures the Klaud-Cold image-bump cron has hit, the diagnoses, and the fixes/workarounds applied. **Read this first** before debugging a new failing claude/* PR — most failure modes here recur.
4+
5+
When you fix something not yet listed, add it here so the next session doesn't re-learn it.
6+
7+
---
8+
9+
## 1. PR setup-stage failures
10+
11+
### 1.1 `perf-changelog.yaml`: deletion-not-allowed
12+
**Symptom:** the `setup` job fails before any sweep runs with
13+
```
14+
ValueError: Deletions are not allowed in /home/runner/work/InferenceX/InferenceX/perf-changelog.yaml.
15+
Only additions to the changelog are permitted. Found deleted line: ...
16+
```
17+
**Root cause:** Cron-PR branches go stale; when main merges new changelog entries, the PR's local snapshot of `perf-changelog.yaml` no longer covers them, so the validator sees the missing lines as deletions. A naive rebase can also strip trailing whitespace from unrelated entries — same effect (e.g. `pr-link: ...1311 ``pr-link: ...1311`).
18+
19+
**Fix (canonical):**
20+
```bash
21+
# In the PR's worktree, after `git merge origin/main` conflicts on perf-changelog.yaml:
22+
git checkout origin/main -- perf-changelog.yaml # take main's bytes verbatim
23+
cat >> perf-changelog.yaml <<EOF # then append THIS PR's entry at tail
24+
25+
- config-keys:
26+
- <recipe-key>
27+
description:
28+
- "<one-line summary>"
29+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/<N>
30+
EOF
31+
python3 -c "import yaml; yaml.safe_load(open('perf-changelog.yaml'))"
32+
```
33+
34+
Do **not** try a 3-way merge of `perf-changelog.yaml` — whitespace edits will silently re-trigger the deletion check.
35+
36+
---
37+
38+
## 2. vLLM v0.21.x / v0.20.x: GPU OOM at model-load
39+
40+
**Symptom:** vLLM workers OOM during weight loading or right after warmup:
41+
- `HSA_STATUS_ERROR_OUT_OF_RESOURCES: Available Free mem : 0 MB` (AMD)
42+
- `torch.OutOfMemoryError: CUDA out of memory. ... GPU N has X GiB of which Y MiB is free` (NVIDIA)
43+
- vLLM may also log `_check_enough_kv_cache_memory` failing with **negative** available bytes (e.g. `-25.24 GiB`).
44+
45+
**Root cause:** v0.21.0 (and v0.20.2+) enabled an aggressive CUDA-graph memory profiler that pre-reserves a large chunk of VRAM up front (~30% on B200), shrinking effective `--gpu-memory-utilization` well below what the flag says. Old SHA-pinned custom images had a smaller footprint, so the recipe's existing `0.95` setting now starves the KV cache.
46+
47+
**Fix:** in `benchmarks/single_node/<recipe>.sh`, either:
48+
1. **Lower `--gpu-memory-utilization`** (`0.95 → 0.90`, sometimes 0.85). Matches the H100/H200/B200 NVIDIA pattern. Smallest blast radius.
49+
2. **Disable the profiler entirely** for cases where lowering isn't enough: `export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0` before `vllm serve`. Matches `benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh:65`.
50+
51+
Seen on: #1395 (kimik2.5-fp4-b200-vllm — needed env var), #1403 (gptoss-fp4-mi300x-vllm — needed 0.90), #1461 (dsv4-fp8-h200-vllm — needed 0.90).
52+
53+
---
54+
55+
## 3. Custom DSV4 image → generic v0.5.12 OOMs
56+
57+
**Symptom:** DSV4 recipes work on their SHA-pinned `lmsysorg/sglang:deepseek-v4-hopper@sha256:...` (or `deepseek-v4-b300`, `deepseek-v4-blackwell`) custom builds, but OOM on weights load when bumped to the generic `v0.5.12-cu130` release tag. Example: DSV4-Pro FP8+MTP weights consume ~125.43 GB / 141 GB per H200, leaving `-4.05 GB` for KV cache.
58+
59+
**Root cause:** The custom DSV4 images use a different weight layout / EAGLE draft handling that fits in less memory than the generic release. The release tag isn't a drop-in replacement.
60+
61+
**Fix:** keep DSV4 recipes pinned to their custom SHA-pinned image until upstream sglang gains the same DSV4-specific weight handling. Bumping to the generic tag is currently NOT viable.
62+
63+
Seen on: #1460 (dsv4-fp8-h200-sglang+mtp).
64+
65+
---
66+
67+
## 4. Upstream sglang v0.5.12 B300 regressions
68+
69+
Two distinct upstream regressions on NVIDIA B300 (Blackwell, `sm_120`) shipped in `lmsysorg/sglang:v0.5.12-cu130`:
70+
71+
### 4a. DeepGemm TMA-descriptor crash (GLM-5-FP8)
72+
**Symptom:** CUDA graph capture aborts with `CUDA_ERROR_ILLEGAL_ADDRESS (700)` at `/deepgemm/csrc/.../runtime_utils.hpp:143` on the **first batch size** for **every TP rank**. Server never serves a prompt.
73+
74+
**Workarounds (any one):**
75+
1. `--fp8-gemm-runner-backend cutlass` to bypass DeepGemm via CUTLASS.
76+
2. `export SGL_ENABLE_JIT_DEEPGEMM=0` before `python -m sglang.launch_server` to skip JIT DeepGemm.
77+
3. Pin recipe to `lmsysorg/sglang:v0.5.11-cu130`.
78+
79+
Filed upstream: sgl-project/sglang#25551. Seen on #1421.
80+
81+
### 4b. trtllm GEMM bug at bs=128 + MTP / EAGLE (GLM-5-NVFP4)
82+
**Symptom:** EAGLE draft CUDA graph capture crashes immediately at the largest batch size with `RuntimeError ... trtllm_batched_gemm_runner.cu:276 ... numBatches=256, GemmMNK 128x1024x6144`. The target model captures fine; only the draft model crashes.
83+
84+
**Workarounds:**
85+
1. Cap `--cuda-graph-max-bs` and `--max-running-requests` to 64 in the launch script to avoid the bs=128 trigger.
86+
2. Comment out the MTP/EAGLE scenarios on B300 in the recipe.
87+
3. Pin to v0.5.11-cu130.
88+
89+
Seen on #1420.
90+
91+
### 4c. flash_attn SM-arch assertion (qwen3.5-bf16)
92+
**Symptom:** All 4 TP workers AssertionError on first forward pass:
93+
```
94+
File "/opt/venv/.../sglang/srt/layers/attention/flashattention_backend.py:..."
95+
assert sm_100 <= arch <= sm_110f
96+
```
97+
B300 is `sm_120`, outside the asserted range. Server never becomes healthy; warmup times out at 600s.
98+
99+
**Fix:** Needs sglang image with flash_attn supporting `sm_120` — no local workaround. Pin to v0.5.11-cu130 in the meantime.
100+
101+
Seen on #1422.
102+
103+
---
104+
105+
## 5. Cluster infrastructure (AMD MI355X / MI300X / MI325X)
106+
107+
### 5.1 `mia1-p01-g09 / g19 / g37` (amd-tw-mi355) — persistently drained
108+
- **g09**: `pyxis is broken`
109+
- **g19**: `Kill task failed (JobId=N StepId=N)`
110+
- **g37**: `permission issues with GHA runner workflows : Not responding` (down since Mar 2026)
111+
112+
If a sweep job lands on any of these, it'll never start. Nothing to do at the recipe level — these stay drained until ops fixes them.
113+
114+
### 5.2 `mia1-p01-g11 / g12 / g31` — docker socket perms
115+
**Symptom:** mi355x jobs fail with `permission denied while trying to connect to the docker API at unix:///var/run/docker.sock` during the `docker stop $(docker ps -a -q)` cleanup step, cascading into SLURM job expiration.
116+
**Fix:** ops needs to fix docker group / socket perms on these nodes. Recipe-level workaround: none.
117+
118+
### 5.3 `chi-mi300x-049``/nvme_home` disk-full
119+
**Symptom:** pyxis container extraction fails with `No space left on device` writing to `/nvme_home/gharunner/.local/share/enroot/pyxis_*/opt/rocm-*/...`. The `/nvme_home` partition is hosted under `/` on this node and has been chronically near-full.
120+
121+
**Fix already landed:** `runners/launch_mi300x-amds.sh` now pins salloc to only known-good mi300x nodes (`chi-mi300x-[034-036,054,057-058]`) — see PR #1462. `chi-mi300x-049` is held in `State=DOWN` by a watchdog on the controller (`/home/gharunner/_audit/drain_049_watchdog.sh`) that re-applies the drain every 10s if SLURM auto-clears it (which it does on dynamic-norm nodes).
122+
123+
### 5.4 `chi-mi325x-pod1-017` — orphaned port-8888 process
124+
**Symptom:** sglang server bind fails with `[Errno 98] Address already in use` on port 8888. Held by an MLPerf accuracy run started outside SLURM.
125+
**Fix:** SSH to controller, find the holder via `ss -tlnp | grep :8888`, `kill` the PID. If recurring, file with the team running MLPerf experiments.
126+
127+
### 5.5 Cluster controller layout
128+
- **amd-vultr-mi300**: SLURM controller for 7 mi300x nodes (3 down, see 5.3).
129+
- **amd-vultr-mi325**: SLURM controller for 6 mi325x nodes.
130+
- **amd-tw-mi355**: jumpbox → ssh further to compute nodes (`mia1-p01-gNN`). 12 nodes (3 drained, see 5.1).
131+
- `/home` is NFS-mounted across clusters from `chi-mi325x-pod1-001:/nfs/homes`, **root-writable**.
132+
- `/tmp` and `/nvme_home` are per-node local; HF cache lives at node-local `/raid/hf-hub-cache/` (2.7T per mi300x node).
133+
- Use `srun -w <FQDN>` (with the **full FQDN**, not the short hostname) from the controller to run admin commands on a compute node.
134+
135+
### 5.6 Drain watchdog pattern
136+
SLURM auto-clears `State=DRAIN` on `DYNAMIC_NORM` nodes when they re-register. To keep a node out of the pool sticky-style, use `State=DOWN` AND start a watchdog:
137+
```bash
138+
# on the controller, as root
139+
nohup bash -c '
140+
while true; do
141+
s=$(scontrol show node <FQDN> 2>/dev/null | grep -oE "State=[A-Z+_]+")
142+
if ! echo "$s" | grep -qE "DOWN|DRAIN"; then
143+
scontrol update NodeName=<FQDN> State=DOWN Reason="watchdog" >/dev/null 2>&1
144+
fi
145+
sleep 10
146+
done
147+
' > /home/gharunner/_audit/drain_<node>_watchdog.log 2>&1 &
148+
```
149+
Doesn't survive controller reboots — for permanent removal a SLURM admin should edit `slurm.conf`.
150+
151+
---
152+
153+
## 6. Docker image tag gotchas
154+
155+
**Don't invent a "release" tag pattern from a date-suffixed nightly.** `lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x` does **not** exist — only the dated `v0.5.12-rocm720-mi35x-20260517` does. All MI355X `sglang-rocm:rocm720` tags follow the dated-nightly pattern.
156+
157+
Before bumping an image, verify the target tag exists:
158+
```bash
159+
curl -sI "https://hub.docker.com/v2/repositories/lmsysorg/sglang-rocm/tags/v0.5.12-rocm720-mi35x"
160+
# 200 → exists; 404 → doesn't
161+
```
162+
163+
Or check whether any other recipe on main uses the proposed tag — if zero uses, suspect.
164+
165+
---
166+
167+
## 7. CI: rerun mechanics
168+
169+
- `gh run rerun <id> --failed` only works when the workflow run is **completed** with `conclusion=failure`. If the run is still `queued`/`in_progress`, the call returns "cannot be rerun".
170+
- To abandon an in-flight run and start fresh, push an **empty commit** to the PR branch:
171+
```bash
172+
git commit --allow-empty -m "Re-trigger sweep"
173+
git push
174+
```
175+
The old run will be auto-cancelled by `workflow/cancel-sweep-on-merge` (provided the head SHA changed).
176+
- For a `cancelled` run (not `failure`), use `gh run rerun <id>` without `--failed` to re-run everything.
177+
178+
---
179+
180+
## 8. gh CLI gotchas
181+
182+
- **`gh pr edit` silently aborts** on a Projects-classic deprecation GraphQL error. Title/body updates won't apply. Use `gh api -X PATCH "repos/<org>/<repo>/pulls/<N>" -f title="..." -F body=@file.md` instead.
183+
- Same issue for adding labels — use `gh api -X POST "repos/<org>/<repo>/issues/<N>/labels" -f "labels[]=<name>"`.
184+
- `gh pr view ... --jq .headRefName` output can have a trailing `\r`. Strip it: `gh pr view <N> --json headRefName --jq .headRefName | tr -d '\r\n'`. Otherwise shell concatenation produces `branchunners/launch_mi300x-amds.sh`-style corruption.
185+
- `gh pr list --json statusCheckRollup` **truncates** each PR's rollup — never trust it for per-check filters. Re-query each PR individually with `gh pr view <N> --json statusCheckRollup`.
186+
- `gh` and the GitHub Actions API: `conclusion` is `""` (empty string, not `null`) for in-flight checks, so `jq`'s `// .status` fallback doesn't trigger. Use:
187+
```jq
188+
def state: if (.conclusion // "") != "" then .conclusion else .status end;
189+
```
190+
191+
---
192+
193+
## 9. PR conventions for this repo
194+
195+
- Image-bump / new-recipe PRs I open on behalf of the user (or that the user creates) get the **`[Klaud Cold]`** title prefix.
196+
- Add the `full-sweep-enabled` label so a full sweep actually runs (`gh api -X POST ... labels[]=full-sweep-enabled`). Without it, the sweep is mostly SKIPPED.
197+
- After any code change that shifts a PR's scope (drops a recipe, changes an image tag), **update the PR title AND body in the same step** and **verify** with `gh pr view <N> --json title,body``gh pr edit` silently fails (see §8).
198+
- `utils/merge_with_reuse.sh <N>` is the merge entrypoint; it handles the `perf-changelog.yaml` auto-append.
199+
200+
---
201+
202+
## 10. Useful slash commands (defined in `.claude/commands/`)
203+
204+
- `/find-mergeable-claude-prs` — lists `claude/*` PRs whose full sweep finished all-green.
205+
- `/list-claude-pr-status` — lists READY/RUNNING (and optionally FAILED) state per `claude/*` PR.
206+
- `/fix-klaud-cron-prs` — diagnoses failing `claude/*` PRs by reading their failed job logs.
207+
- `/merge-prs <N> [<N>...]` — sequential merge via `utils/merge_with_reuse.sh`.
208+
209+
Each command file is self-contained; read them to understand the exact jq filters they use.

0 commit comments

Comments
 (0)