Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

Guidance for AI agents working with InferenceX.

> **Before debugging a failing Klaud-Cold / claude/* image-bump PR, read [`KLAUD_DEBUG.md`](KLAUD_DEBUG.md).** It captures recurring failure modes (tokenizer crash, vLLM CUDA-graph OOM, B300 sglang regressions, cluster docker/perms/disk issues), the exact workarounds, and gh-CLI gotchas β€” most cron-PR failures are already cataloged there.

## Project Overview

InferenceX is an open-source automated benchmarking system that tracks LLM inference performance across hardware (NVIDIA B200/H100/H200/GB200, AMD MI300X/MI325X/MI355X) and software stacks (vLLM, SGLang, TensorRT-LLM, ATOM). Results published to https://inferencex.com/.
Expand Down
230 changes: 230 additions & 0 deletions KLAUD_DEBUG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
# KLAUD_DEBUG.md β€” Operational Knowledge for Recipe-Bump PRs

A running playbook of failures the Klaud-Cold image-bump cron has hit, the diagnoses, and the fixes/workarounds applied. **Read this first** before debugging a new failing claude/* PR β€” most failure modes here recur.

When you fix something not yet listed, add it here so the next session doesn't re-learn it.

---

## 1. PR setup-stage failures

### 1.1 `perf-changelog.yaml`: deletion-not-allowed
**Symptom:** the `setup` job fails before any sweep runs with
```
ValueError: Deletions are not allowed in /home/runner/work/InferenceX/InferenceX/perf-changelog.yaml.
Only additions to the changelog are permitted. Found deleted line: ...
```
**Root cause:** Cron-PR branches go stale; when main merges new changelog entries, the PR's local snapshot of `perf-changelog.yaml` no longer covers them, so the validator sees the missing lines as deletions. A naive rebase can also strip trailing whitespace from unrelated entries β€” same effect (e.g. `pr-link: ...1311 ` β†’ `pr-link: ...1311`).

**Fix (canonical):**
```bash
# In the PR's worktree, after `git merge origin/main` conflicts on perf-changelog.yaml:
git checkout origin/main -- perf-changelog.yaml # take main's bytes verbatim
cat >> perf-changelog.yaml <<EOF # then append THIS PR's entry at tail

- config-keys:
- <recipe-key>
description:
- "<one-line summary>"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/<N>
EOF
python3 -c "import yaml; yaml.safe_load(open('perf-changelog.yaml'))"
```

Do **not** try a 3-way merge of `perf-changelog.yaml` β€” whitespace edits will silently re-trigger the deletion check.

---

## 2. Bench-client tokenizer crash (sglang v0.5.12 images)

**Symptom:** the sglang server loads the model fine; the bench client crashes before sending any request with
```
AttributeError: LlamaTokenizer has no attribute all_special_tokens_extended
File "/opt/venv/.../vllm/transformers_utils/tokenizer.py:101" in get_cached_tokenizer
```

**Root cause:** `utils/bench_serving/benchmark_serving.py:48-51` prefers vLLM's `get_tokenizer` (which calls `get_cached_tokenizer` β†’ probes `.all_special_tokens_extended`). The newer `transformers` library bundled inside sglang v0.5.12 images no longer exposes that attribute on `LlamaTokenizer`.

**Fix (in-PR workaround):** swap import preference so the local `backend_request_func.get_tokenizer` (HF AutoTokenizer-based, no `get_cached_tokenizer` probe) wins:
```python
try:
from backend_request_func import get_tokenizer
except ImportError:
from vllm.transformers_utils.tokenizer import get_tokenizer
```
Affects PRs touching sglang v0.5.12 images on any AMD platform. Apply per-PR (each PR's branch needs its own swap commit) until a global fix lands.

---

## 3. vLLM v0.21.x / v0.20.x: GPU OOM at model-load

**Symptom:** vLLM workers OOM during weight loading or right after warmup:
- `HSA_STATUS_ERROR_OUT_OF_RESOURCES: Available Free mem : 0 MB` (AMD)
- `torch.OutOfMemoryError: CUDA out of memory. ... GPU N has X GiB of which Y MiB is free` (NVIDIA)
- vLLM may also log `_check_enough_kv_cache_memory` failing with **negative** available bytes (e.g. `-25.24 GiB`).

**Root cause:** v0.21.0 (and v0.20.2+) enabled an aggressive CUDA-graph memory profiler that pre-reserves a large chunk of VRAM up front (~30% on B200), shrinking effective `--gpu-memory-utilization` well below what the flag says. Old SHA-pinned custom images had a smaller footprint, so the recipe's existing `0.95` setting now starves the KV cache.

**Fix:** in `benchmarks/single_node/<recipe>.sh`, either:
1. **Lower `--gpu-memory-utilization`** (`0.95 β†’ 0.90`, sometimes 0.85). Matches the H100/H200/B200 NVIDIA pattern. Smallest blast radius.
2. **Disable the profiler entirely** for cases where lowering isn't enough: `export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0` before `vllm serve`. Matches `benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh:65`.

Seen on: #1395 (kimik2.5-fp4-b200-vllm β€” needed env var), #1403 (gptoss-fp4-mi300x-vllm β€” needed 0.90), #1461 (dsv4-fp8-h200-vllm β€” needed 0.90).

---

## 4. Custom DSV4 image β†’ generic v0.5.12 OOMs

**Symptom:** DSV4 recipes work on their SHA-pinned `lmsysorg/sglang:deepseek-v4-hopper@sha256:...` (or `deepseek-v4-b300`, `deepseek-v4-blackwell`) custom builds, but OOM on weights load when bumped to the generic `v0.5.12-cu130` release tag. Example: DSV4-Pro FP8+MTP weights consume ~125.43 GB / 141 GB per H200, leaving `-4.05 GB` for KV cache.

**Root cause:** The custom DSV4 images use a different weight layout / EAGLE draft handling that fits in less memory than the generic release. The release tag isn't a drop-in replacement.

**Fix:** keep DSV4 recipes pinned to their custom SHA-pinned image until upstream sglang gains the same DSV4-specific weight handling. Bumping to the generic tag is currently NOT viable.

Seen on: #1460 (dsv4-fp8-h200-sglang+mtp).

---

## 5. Upstream sglang v0.5.12 B300 regressions

Two distinct upstream regressions on NVIDIA B300 (Blackwell, `sm_120`) shipped in `lmsysorg/sglang:v0.5.12-cu130`:

### 5a. DeepGemm TMA-descriptor crash (GLM-5-FP8)
**Symptom:** CUDA graph capture aborts with `CUDA_ERROR_ILLEGAL_ADDRESS (700)` at `/deepgemm/csrc/.../runtime_utils.hpp:143` on the **first batch size** for **every TP rank**. Server never serves a prompt.

**Workarounds (any one):**
1. `--fp8-gemm-runner-backend cutlass` to bypass DeepGemm via CUTLASS.
2. `export SGL_ENABLE_JIT_DEEPGEMM=0` before `python -m sglang.launch_server` to skip JIT DeepGemm.
3. Pin recipe to `lmsysorg/sglang:v0.5.11-cu130`.

Filed upstream: sgl-project/sglang#25551. Seen on #1421.

### 5b. trtllm GEMM bug at bs=128 + MTP / EAGLE (GLM-5-NVFP4)
**Symptom:** EAGLE draft CUDA graph capture crashes immediately at the largest batch size with `RuntimeError ... trtllm_batched_gemm_runner.cu:276 ... numBatches=256, GemmMNK 128x1024x6144`. The target model captures fine; only the draft model crashes.

**Workarounds:**
1. Cap `--cuda-graph-max-bs` and `--max-running-requests` to 64 in the launch script to avoid the bs=128 trigger.
2. Comment out the MTP/EAGLE scenarios on B300 in the recipe.
3. Pin to v0.5.11-cu130.

Seen on #1420.

### 5c. flash_attn SM-arch assertion (qwen3.5-bf16)
**Symptom:** All 4 TP workers AssertionError on first forward pass:
```
File "/opt/venv/.../sglang/srt/layers/attention/flashattention_backend.py:..."
assert sm_100 <= arch <= sm_110f
```
B300 is `sm_120`, outside the asserted range. Server never becomes healthy; warmup times out at 600s.

**Fix:** Needs sglang image with flash_attn supporting `sm_120` β€” no local workaround. Pin to v0.5.11-cu130 in the meantime.

Seen on #1422.

---

## 6. Cluster infrastructure (AMD MI355X / MI300X / MI325X)

### 6.1 `mia1-p01-g09 / g19 / g37` (amd-tw-mi355) β€” persistently drained
- **g09**: `pyxis is broken`
- **g19**: `Kill task failed (JobId=N StepId=N)`
- **g37**: `permission issues with GHA runner workflows : Not responding` (down since Mar 2026)

If a sweep job lands on any of these, it'll never start. Nothing to do at the recipe level β€” these stay drained until ops fixes them.

### 6.2 `mia1-p01-g11 / g12 / g31` β€” docker socket perms
**Symptom:** mi355x jobs fail with `permission denied while trying to connect to the docker API at unix:///var/run/docker.sock` during the `docker stop $(docker ps -a -q)` cleanup step, cascading into SLURM job expiration.
**Fix:** ops needs to fix docker group / socket perms on these nodes. Recipe-level workaround: none.

### 6.3 `chi-mi300x-049` β€” `/nvme_home` disk-full
**Symptom:** pyxis container extraction fails with `No space left on device` writing to `/nvme_home/gharunner/.local/share/enroot/pyxis_*/opt/rocm-*/...`. The `/nvme_home` partition is hosted under `/` on this node and has been chronically near-full.

**Fix already landed:** `runners/launch_mi300x-amds.sh` now pins salloc to only known-good mi300x nodes (`chi-mi300x-[034-036,054,057-058]`) β€” see PR #1462. `chi-mi300x-049` is held in `State=DOWN` by a watchdog on the controller (`/home/gharunner/_audit/drain_049_watchdog.sh`) that re-applies the drain every 10s if SLURM auto-clears it (which it does on dynamic-norm nodes).

### 6.4 `chi-mi325x-pod1-017` β€” orphaned port-8888 process
**Symptom:** sglang server bind fails with `[Errno 98] Address already in use` on port 8888. Held by an MLPerf accuracy run started outside SLURM.
**Fix:** SSH to controller, find the holder via `ss -tlnp | grep :8888`, `kill` the PID. If recurring, file with the team running MLPerf experiments.

### 6.5 Cluster controller layout
- **amd-vultr-mi300**: SLURM controller for 7 mi300x nodes (3 down, see 6.3).
- **amd-vultr-mi325**: SLURM controller for 6 mi325x nodes.
- **amd-tw-mi355**: jumpbox β†’ ssh further to compute nodes (`mia1-p01-gNN`). 12 nodes (3 drained, see 6.1).
- `/home` is NFS-mounted across clusters from `chi-mi325x-pod1-001:/nfs/homes`, **root-writable**.
- `/tmp` and `/nvme_home` are per-node local; HF cache lives at node-local `/raid/hf-hub-cache/` (2.7T per mi300x node).
- Use `srun -w <FQDN>` (with the **full FQDN**, not the short hostname) from the controller to run admin commands on a compute node.

### 6.6 Drain watchdog pattern
SLURM auto-clears `State=DRAIN` on `DYNAMIC_NORM` nodes when they re-register. To keep a node out of the pool sticky-style, use `State=DOWN` AND start a watchdog:
```bash
# on the controller, as root
nohup bash -c '
while true; do
s=$(scontrol show node <FQDN> 2>/dev/null | grep -oE "State=[A-Z+_]+")
if ! echo "$s" | grep -qE "DOWN|DRAIN"; then
scontrol update NodeName=<FQDN> State=DOWN Reason="watchdog" >/dev/null 2>&1
fi
sleep 10
done
' > /home/gharunner/_audit/drain_<node>_watchdog.log 2>&1 &
```
Doesn't survive controller reboots β€” for permanent removal a SLURM admin should edit `slurm.conf`.

---

## 7. Docker image tag gotchas

**Don't invent a "release" tag pattern from a date-suffixed nightly.** `lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x` does **not** exist β€” only the dated `v0.5.12-rocm720-mi35x-20260517` does. All MI355X `sglang-rocm:rocm720` tags follow the dated-nightly pattern.

Before bumping an image, verify the target tag exists:
```bash
curl -sI "https://hub.docker.com/v2/repositories/lmsysorg/sglang-rocm/tags/v0.5.12-rocm720-mi35x"
# 200 β†’ exists; 404 β†’ doesn't
```

Or check whether any other recipe on main uses the proposed tag β€” if zero uses, suspect.

---

## 8. CI: rerun mechanics

- `gh run rerun <id> --failed` only works when the workflow run is **completed** with `conclusion=failure`. If the run is still `queued`/`in_progress`, the call returns "cannot be rerun".
- To abandon an in-flight run and start fresh, push an **empty commit** to the PR branch:
```bash
git commit --allow-empty -m "Re-trigger sweep"
git push
```
The old run will be auto-cancelled by `workflow/cancel-sweep-on-merge` (provided the head SHA changed).
- For a `cancelled` run (not `failure`), use `gh run rerun <id>` without `--failed` to re-run everything.

---

## 9. gh CLI gotchas

- **`gh pr edit` silently aborts** on a Projects-classic deprecation GraphQL error. Title/body updates won't apply. Use `gh api -X PATCH "repos/<org>/<repo>/pulls/<N>" -f title="..." -F body=@file.md` instead.
- Same issue for adding labels β€” use `gh api -X POST "repos/<org>/<repo>/issues/<N>/labels" -f "labels[]=<name>"`.
- `gh pr view ... --jq .headRefName` output can have a trailing `\r`. Strip it: `gh pr view <N> --json headRefName --jq .headRefName | tr -d '\r\n'`. Otherwise shell concatenation produces `branchunners/launch_mi300x-amds.sh`-style corruption.
- `gh pr list --json statusCheckRollup` **truncates** each PR's rollup β€” never trust it for per-check filters. Re-query each PR individually with `gh pr view <N> --json statusCheckRollup`.
- `gh` and the GitHub Actions API: `conclusion` is `""` (empty string, not `null`) for in-flight checks, so `jq`'s `// .status` fallback doesn't trigger. Use:
```jq
def state: if (.conclusion // "") != "" then .conclusion else .status end;
```

---

## 10. PR conventions for this repo

- Image-bump / new-recipe PRs I open on behalf of the user (or that the user creates) get the **`[Klaud Cold]`** title prefix.
- Add the `full-sweep-enabled` label so a full sweep actually runs (`gh api -X POST ... labels[]=full-sweep-enabled`). Without it, the sweep is mostly SKIPPED.
- After any code change that shifts a PR's scope (drops a recipe, changes an image tag), **update the PR title AND body in the same step** and **verify** with `gh pr view <N> --json title,body` β€” `gh pr edit` silently fails (see Β§9).
- `utils/merge_with_reuse.sh <N>` is the merge entrypoint; it handles the `perf-changelog.yaml` auto-append.

---

## 11. Useful slash commands (defined in `.claude/commands/`)

- `/find-mergeable-claude-prs` β€” lists `claude/*` PRs whose full sweep finished all-green.
- `/list-claude-pr-status` β€” lists READY/RUNNING (and optionally FAILED) state per `claude/*` PR.
- `/fix-klaud-cron-prs` β€” diagnoses failing `claude/*` PRs by reading their failed job logs.
- `/merge-prs <N> [<N>...]` β€” sequential merge via `utils/merge_with_reuse.sh`.

Each command file is self-contained; read them to understand the exact jq filters they use.