Skip to content

Commit 9748d74

Browse files
authored
Merge branch 'main' into claude/issue-1154-gptoss-fp4-b200-vllm
2 parents 3ddbf47 + 40fa217 commit 9748d74

12 files changed

Lines changed: 541 additions & 883 deletions

File tree

.github/configs/amd-master.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -368,7 +368,7 @@ qwen3.5-fp8-mi300x-sglang:
368368
- { tp: 8, conc-start: 4, conc-end: 64 }
369369

370370
glm5-fp8-mi355x-sglang:
371-
image: lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260413
371+
image: lmsysorg/sglang-rocm:v0.5.11-rocm720-mi35x-20260513
372372
model: zai-org/GLM-5-FP8
373373
model-prefix: glm5
374374
runner: mi355x
@@ -380,11 +380,11 @@ glm5-fp8-mi355x-sglang:
380380
- isl: 1024
381381
osl: 1024
382382
search-space:
383-
- { tp: 8, conc-start: 4, conc-end: 64 }
383+
- { tp: 4, conc-start: 4, conc-end: 256 }
384384
- isl: 8192
385385
osl: 1024
386386
search-space:
387-
- { tp: 8, conc-start: 4, conc-end: 64 }
387+
- { tp: 4, conc-start: 4, conc-end: 256 }
388388

389389
glm5-fp8-mi355x-sglang-mtp:
390390
image: lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260415
@@ -1644,7 +1644,7 @@ dsv4-fp8-mi355x-vllm:
16441644
# the standard atom0.1.2.post MI355X base (matching qwen3.5-fp8-mi355x-atom);
16451645
# the DSv4 PR is overlaid at runtime by dsv4_fp4_mi355x_atom.sh at a pinned SHA.
16461646
dsv4-fp4-mi355x-atom:
1647-
image: rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post
1647+
image: rocm/atom-dev:nightly_202605130853
16481648
model: deepseek-ai/DeepSeek-V4-Pro
16491649
model-prefix: dsv4
16501650
runner: mi355x
@@ -1656,8 +1656,8 @@ dsv4-fp4-mi355x-atom:
16561656
- isl: 1024
16571657
osl: 1024
16581658
search-space:
1659-
- { tp: 8, ep: 1, conc-start: 1, conc-end: 1 }
1659+
- { tp: 8, ep: 1, conc-start: 1, conc-end: 512 }
16601660
- isl: 8192
16611661
osl: 1024
16621662
search-space:
1663-
- { tp: 8, ep: 1, conc-start: 1, conc-end: 1 }
1663+
- { tp: 8, ep: 1, conc-start: 1, conc-end: 512 }

.github/configs/nvidia-master.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2228,7 +2228,7 @@ glm5-fp4-b200-sglang:
22282228
- { tp: 4, ep: 1, conc-start: 4, conc-end: 256 }
22292229

22302230
glm5-fp4-b200-sglang-mtp:
2231-
image: lmsysorg/sglang:v0.5.10.post1-cu130
2231+
image: lmsysorg/sglang:v0.5.11-cu130
22322232
model: nvidia/GLM-5-NVFP4
22332233
model-prefix: glm5
22342234
runner: b200
@@ -2438,7 +2438,7 @@ qwen3.5-bf16-b300-sglang-mtp:
24382438
- { tp: 4, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
24392439

24402440
kimik2.5-int4-b200-vllm:
2441-
image: vllm/vllm-openai:v0.15.1
2441+
image: vllm/vllm-openai:v0.20.2
24422442
model: moonshotai/Kimi-K2.5
24432443
model-prefix: kimik2.5
24442444
runner: b200
@@ -2481,7 +2481,7 @@ kimik2.5-int4-b300-vllm:
24812481
- { tp: 4, ep: 1, conc-start: 4, conc-end: 64 }
24822482

24832483
kimik2.5-int4-h200-vllm:
2484-
image: vllm/vllm-openai:v0.16.0
2484+
image: vllm/vllm-openai:v0.20.2
24852485
model: moonshotai/Kimi-K2.5
24862486
model-prefix: kimik2.5
24872487
runner: h200

.github/workflows/claude.yml

Lines changed: 7 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ jobs:
2424
uses: actions/checkout@v6.0.2
2525
with:
2626
fetch-depth: 0
27-
token: ${{ secrets.PAT_WITH_WORKFLOW_SCOPE }}
27+
token: ${{ secrets.CLAUDE_PAT }}
2828

2929
- name: Setup MCP Server
3030
run: |
@@ -35,16 +35,17 @@ jobs:
3535
id: claude
3636
uses: anthropics/claude-code-action@v1
3737
env:
38-
GH_TOKEN: ${{ secrets.PAT_WITH_WORKFLOW_SCOPE }}
38+
GH_TOKEN: ${{ secrets.CLAUDE_PAT }}
39+
GITHUB_TOKEN: ${{ secrets.CLAUDE_PAT }}
3940
INFERENCEMAX_ROOT: ${{ github.workspace }}
4041
BASH_DEFAULT_TIMEOUT_MS: "1800000"
4142
BASH_MAX_TIMEOUT_MS: "3600000"
4243
with:
4344
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
44-
github_token: ${{ secrets.PAT_WITH_WORKFLOW_SCOPE }}
45+
github_token: ${{ secrets.CLAUDE_PAT }}
4546
trigger_phrase: "${{ contains(github.event.comment.body || github.event.issue.body || github.event.issue.title || '', '@Klaud-Cold') && '@Klaud-Cold' || '@claude' }}"
4647
track_progress: true
47-
allowed_bots: 'Klaud-Cold'
48+
allowed_bots: '*'
4849
additional_permissions: |
4950
actions: read
5051
settings: |
@@ -220,27 +221,7 @@ jobs:
220221
221222
## Updating perf-changelog.yaml
222223
223-
When making changes to benchmark scripts or master config files that affect image tags, environment variables, or configuration parameters, you MUST add an entry to `perf-changelog.yaml`.
224-
225-
**When to update perf-changelog.yaml:**
226-
- Updating image tags in `.github/configs/*-master.yaml` or `benchmarks/*.sh` scripts
227-
- Adding or modifying environment variables in benchmark configurations
228-
- Changing configuration parameters that affect performance
229-
230-
**Entry format:**
231-
```yaml
232-
- config-keys:
233-
- dsr1-fp8-*-vllm # Use wildcards to match multiple configs
234-
description:
235-
- "Update vLLM image from v0.11.2 to v0.13.0"
236-
- "Add VLLM_MXFP4_USE_MARLIN=1 environment variable"
237-
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
238-
```
239-
240-
**Guidelines:**
241-
- Use wildcards (`*`) in config-keys to match multiple related configurations
242-
- Each description item should be a concise change summary
243-
- The pr-link should reference the PR number (use XXX as placeholder until PR is created)
224+
See `AGENTS.md` → "Updating Docker Images" for entry format and rules. Required whenever you change image tags, env vars, or perf-affecting params in `.github/configs/*-master.yaml` or `benchmarks/*.sh`. Use `XXX` as the PR-link placeholder until the PR exists.
244225
245226
## Spawning Additional Workers:
246227
You CAN spawn additional Claude workers by commenting "@claude" with a specific task.
@@ -271,8 +252,7 @@ jobs:
271252

272253
### Additional Knowledge
273254
- MI355 is gfx950 not gfx1201
274-
- **STP (Single Token Prediction)**: Standard autoregressive decoding — one token per forward pass. No speculative decoding or MTP. Benchmarks labeled "STP only" use vanilla decoding.
275-
- **MTP (Multi-Token Prediction)**: Predicts multiple tokens per forward pass using speculative decoding (e.g., EAGLE, NEXTN).
255+
- STP/MTP terminology: see `AGENTS.md` → "Terminology"
276256

277257
### Expert Parallelism in Benchmark Scripts
278258
vLLM and SGLang handle expert parallelism differently. When writing or reviewing benchmark scripts for MoE models:

.github/workflows/docker-tag-monitor.yml

Lines changed: 93 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,15 @@
11
name: Docker Tag Monitor
22

3+
# Downstream merge note (human-only — intentionally NOT in the @claude prompt):
4+
# Once the per-config-key PRs this workflow asks Claude to open have a
5+
# green run-sweep.yml run and the `full-sweep-enabled` label, merge them
6+
# with `utils/merge_with_reuse.sh <pr-number>` instead of the GitHub UI.
7+
# That script posts /reuse-sweep-run, auto-resolves perf-changelog.yaml
8+
# conflicts, cancels the merge-triggered sweep, and squash-merges with
9+
# --admin so the post-merge run-sweep run reuses the PR's prior sweep.
10+
# Claude doesn't have admin merge rights and shouldn't be told about this
11+
# path — it's a maintainer-only finalization step.
12+
313
on:
414
schedule:
515
- cron: '0 7 * * 6'
@@ -137,6 +147,38 @@ jobs:
137147
138148
echo "has_updates=true" >> "$GITHUB_OUTPUT"
139149
150+
# Snapshot self-hosted runner availability from the CI tracker.
151+
# Used to constrain which SKUs Claude is allowed to open PRs against.
152+
RUNNERS_TABLE=$(mktemp)
153+
ALLOWED_SKUS_FILE=$(mktemp)
154+
if curl -sf --max-time 10 "https://inferencex-ci-tracker.vercel.app/api/ci" -o /tmp/ci.json; then
155+
jq -r '
156+
[
157+
"| SKU | Total | Busy | Idle | Offline | Pressure | Available |",
158+
"|-----|-------|------|------|---------|----------|-----------|"
159+
] + (
160+
.skus
161+
| sort_by(.label)
162+
| map(
163+
.label as $l
164+
| .summary as $s
165+
| (($s.idleRunners > 0) and ($s.offlineRunners < $s.totalRunners)) as $ok
166+
| "| `\($l)` | \($s.totalRunners) | \($s.busyRunners) | \($s.idleRunners) | \($s.offlineRunners) | \($s.pressureLevel) | \(if $ok then "yes" else "no" end) |"
167+
)
168+
)
169+
| .[]
170+
' /tmp/ci.json > "$RUNNERS_TABLE"
171+
jq -r '
172+
.skus[]
173+
| select((.summary.pressureLevel == "clear") and (.summary.idleRunners > 0) and (.summary.offlineRunners < .summary.totalRunners))
174+
| .label
175+
' /tmp/ci.json | paste -sd, - > "$ALLOWED_SKUS_FILE"
176+
else
177+
echo "_Could not reach https://inferencex-ci-tracker.vercel.app/api/ci — proceed cautiously._" > "$RUNNERS_TABLE"
178+
: > "$ALLOWED_SKUS_FILE"
179+
fi
180+
ALLOWED_SKUS=$(cat "$ALLOWED_SKUS_FILE")
181+
140182
# Build issue body and write to file
141183
BODY_FILE=$(mktemp)
142184
{
@@ -153,19 +195,67 @@ jobs:
153195
echo "| \`$repo\` | \`$tag\` | $tag_ver | $cur_ver | $pub_date |"
154196
done <<< "$UPDATES"
155197
echo ""
198+
echo "### Self-Hosted Runner Snapshot"
199+
echo ""
200+
echo "_Source: https://inferencex-ci-tracker.vercel.app/api/ci at $(date -u +%Y-%m-%dT%H:%M:%SZ)_"
201+
echo ""
202+
cat "$RUNNERS_TABLE"
203+
echo ""
204+
if [[ -n "$ALLOWED_SKUS" ]]; then
205+
echo "**Allowed SKUs for this run:** \`$ALLOWED_SKUS\`"
206+
else
207+
echo "**Allowed SKUs for this run:** _none — skip PR creation and post a comment explaining the runner shortage._"
208+
fi
209+
echo ""
156210
echo "---"
157211
echo ""
158212
echo "@claude Please update the configurations:"
159213
echo ""
160214
echo "1. Update image tags in \`.github/configs/nvidia-master.yaml\` and/or \`.github/configs/amd-master.yaml\`"
161215
echo "2. Add entries to \`perf-changelog.yaml\` documenting the version changes"
162-
echo "3. Create separate PRs grouped by framework and image family, and link each PR."
216+
echo "3. For each eligible config-key, push a branch and actually open a PR — do not stop at the \"Create a pull request for ...\" remote hint that \`git push\` prints. Run \`gh pr create\` (or the equivalent MCP tool) and verify the returned PR URL. Link every PR back to this issue in a comment."
217+
echo ""
218+
echo "**Required PR label:** Every PR you open from this issue MUST carry the \`full-sweep-enabled\` label. Apply it at creation time via \`gh pr create --label full-sweep-enabled\` (or add it immediately after with \`gh pr edit <num> --add-label full-sweep-enabled\`). Do not skip this — downstream automation keys off the label."
219+
echo ""
220+
echo "**PR title / commit message formatting:** Multi-line titles and bodies MUST use a heredoc, not \`\\n\` escapes and not \`\$'...'\` ANSI-C quoting. A prior run produced commits literally starting with \`\$\` and containing \`\\n\\n\` as text because of mis-quoted ANSI-C strings. Use this pattern instead:"
163221
echo ""
164-
echo "Group by framework plus CUDA/ROCm image family (for example sglang-cuda, sglang-rocm, vllm-cuda, vllm-rocm, atom, trt), not by individual GPU. Split by GPU only when an update is genuinely hardware-specific."
222+
echo "\`\`\`bash"
223+
echo "git commit -m \"\$(cat <<'EOF'"
224+
echo "Update qwen3.5-bf16-b300-sglang-mtp SGLang image to v0.5.11-cu130"
165225
echo ""
166-
echo "Focus on updating single-node configurations. For each framework/image family, check if there are multiple CUDA/ROCm versions available and choose appropriately based on current usage patterns in the configs."
226+
echo "Ref #<this issue's number>"
227+
echo "EOF"
228+
echo ")\""
229+
echo ""
230+
echo "gh pr create --title \"Update qwen3.5-bf16-b300-sglang-mtp SGLang image to v0.5.11-cu130\" \\"
231+
echo " --label full-sweep-enabled \\"
232+
echo " --body \"\$(cat <<'EOF'"
233+
echo "Updates the SGLang image tag for \`qwen3.5-bf16-b300-sglang-mtp\` to v0.5.11-cu130."
234+
echo ""
235+
echo "Ref #<this issue's number>"
236+
echo "EOF"
237+
echo ")\""
238+
echo "\`\`\`"
239+
echo ""
240+
echo "PR titles must be a single line (no newlines). Bodies should contain real newlines (use a heredoc), not the literal characters \`\\n\`. Never put \`\$\` in front of a quoted message string."
241+
echo ""
242+
echo "**Runner gating:** Only open PRs for config-keys whose runner SKU is in the allowed list (\`$ALLOWED_SKUS\`). The runner SKU is the hardware segment in the config-key (e.g. \`dsr1-fp4-b200-sglang\` → \`b200\`). For any config-key whose SKU is not in the allowed list, skip it and list the skipped keys plus the reason (not clear / all-offline / no idle capacity) in a single comment on this issue."
243+
echo ""
244+
echo "**Single-node only:** Skip any config-key whose master-config entry has \`multinode: true\` or otherwise targets a multinode runner. Only update single-node configurations."
245+
echo ""
246+
echo "**Per-SKU cap (max 5):** For each allowed SKU, work on at most 5 config-keys. If a SKU has more than 5 eligible config-keys, pick 5 in alphabetical order and list the deferred remainder in your wrap-up comment on this issue so a future run can pick them up."
247+
echo ""
248+
echo "**Sequential execution per SKU:** Within a single SKU, process config-keys one at a time. For each config-key: open the PR, dispatch its e2e test (\`mcp__github__run_workflow\` against \`e2e-tests.yml\` on the PR branch), poll with exponential backoff until the run reaches a terminal state (success/failure/cancelled), then move on to the next config-key for that SKU. Do not dispatch the next e2e run for the same SKU until the previous one has finished. Different SKUs may be processed in parallel since they target disjoint hardware, but each SKU's queue must stay serial."
249+
echo ""
250+
echo "One PR per config-key — do not bundle multiple config-keys into one PR even when they share a framework or image family."
251+
echo ""
252+
echo "**Exception — MTP pairs:** When a config-key and its \`-mtp\` sibling exist for the same model/precision/runner/framework (e.g. \`qwen3.5-fp4-b300-sglang\` and \`qwen3.5-fp4-b300-sglang-mtp\`), bundle both into one PR. Treat the pair as a single unit for the per-SKU cap (counts as 1, not 2) and the sequential e2e queue. If only one side of the pair is present in the updates, open a PR for just that one."
253+
echo ""
254+
echo "If Docker Hub lists multiple variants for the same base version (e.g. \`cu128\` vs \`cu130\`, \`rocm70\` vs \`rocm72\`), pick the variant whose suffix matches what the config-key's current image entry already uses — don't switch CUDA/ROCm minor versions in this update."
167255
} > "$BODY_FILE"
168256
257+
rm -f "$RUNNERS_TABLE" "$ALLOWED_SKUS_FILE" /tmp/ci.json
258+
169259
echo "body_file=$BODY_FILE" >> "$GITHUB_OUTPUT"
170260
171261
echo "=== Issue body ==="

0 commit comments

Comments
 (0)