Skip to content

Commit 370a162

Browse files
cquil11claude
andauthored
Migrate agentic-coding benchmarks to aiperf v0.2 (#1391)
* chore: agentic benchmark infrastructure (v0.1) Adds end-to-end agentic-coding benchmark infrastructure on top of the existing fixed-seq-len harness. New components: Trace replayer - New utils/trace-replay submodule (kv-cache-tester @ agentx-minimized) driving multi-turn HF-dataset traces against any OpenAI-compatible endpoint at fixed concurrency. - --debug-trace captures full per-request prompt/response, every streamed chunk via chunk.model_dump(), and integer token IDs (apply_chat_template prompt + logprobs.content completion) into debug_trace.jsonl. - Per-model delta-field abstraction (gpt-oss → delta.reasoning, default → delta.reasoning_content) so reasoning-heavy responses are counted and appended to conversation history correctly. - Input-token metric reads server's usage.prompt_tokens (authoritative) rather than the local apply_chat_template estimate which breaks for gpt-oss harmony's chat template. - Per-user 8-token salt prefix on conversation[0] so two in-flight users replaying the same trace_id don't accidentally share KV-cache blocks. - Period summary: counts up elapsed instead of down remaining; replaces the dispatch-jitter "Wait time" with the trace's true "Inter-turn time" sourced from RequestMetrics.delay_expected. - 5s quiesce between warmup completion and metrics-collector start so warmup-tail prefill doesn't bleed into period 1. Workflow plumbing - e2e-tests.yml: workflow_dispatch + workflow_call inputs for debug-trace (boolean) and duration-override (string seconds), forwarded to test-sweep-agentic and test-sweep-multi-node-agentic jobs. - benchmark-tmpl.yml + benchmark-multinode-tmpl.yml: debug-trace input mapped to DEBUG_TRACE env var; duration override threads through to matrix.config.duration. - benchmark_lib.sh: build_replay_cmd / resolve_trace_source / install_agentic_deps / write_agentic_result_json helpers; consumes DEBUG_TRACE → --debug-trace. - runners/launch_*.sh: shared agentic mode dispatch + scenario routing. - runners/launch_b200-dgxc-slurm.sh → launch_b200-dgxc.sh rename to match the actual runner.name observed by the workflow. Result aggregation - utils/agentic-benchmark/{bench,analysis,scripts}: metrics collector (vllm/sglang Prometheus parsers), pareto plotter, per-config distribution analyzer, sweep aggregator. - utils/process_agentic_result.py: per-job results.json builder. - utils/matrix_logic: agentic-coding scenario plumbing in generate_sweep_configs.py + validation.py. Examples (one per vendor) - benchmarks/single_node/agentic/dsr1_fp4_b200.sh — NVIDIA. - benchmarks/single_node/agentic/dsr1_fp4_mi355x.sh — AMD. - Matching agentic-coding sections in nvidia-master.yaml (dsr1-fp4-b200-sglang) and amd-master.yaml (dsr1-fp4-mi355x-sglang). All other model-specific launchers and matrix entries are deliberately left out of this PR; downstream PRs add them on a per-model basis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cleanup * agentic: rename USERS/users → CONC/conc throughout Same value, two names — collapse to one. Workflow templates already exposed both CONC and USERS env vars (USERS was a mirror of inputs.conc), and the agentic matrix entries carried both `users: int` and `conc: [users]`. Drop the duplicates and standardize on conc/CONC: - benchmark-tmpl.yml / benchmark-multinode-tmpl.yml: drop redundant USERS env var (CONC remains) - e2e-tests.yml / run-sweep.yml: pass `conc: ${{ matrix.config.conc }}` to template; build agentic conc-list as `'[${{ matrix.config.conc }}]'` since matrix.config.conc is now a scalar - generate_sweep_configs.py: agentic entries emit Fields.CONC.value (int) only; loop variable renamed from `users` to `conc`; exp-name template now uses `_conc{N}` instead of `_users{N}` - validation.py: drop Fields.USERS; agentic Pydantic models use `conc: int` - process_agentic_result.py: read CONC env var, emit single `"conc"` key - collect_sweep_results.py: regex updated to match `_conc{N}_offload` - benchmark_lib.sh / agentic launcher scripts: $USERS → $CONC The trace-replayer's --start-users / --max-users CLI flags are upstream's API and are left unchanged; benchmark_lib.sh just passes $CONC into them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * bump trace-replay: kimi tokenizer + reasoning support Pick up these submodule commits (callanjfox/kv-cache-tester): - 7b7f883 silence kimi: target the actual loaded-tokenizer module logger - 5b87e43 silence kimi: replace static logger lookup with content filter - 3394450 silence Kimi tokenization_kimi.py per-call encode warning - 7ad6a9e delta-field map: add 'kimi' substring (uses delta.reasoning like gpt-oss) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic: add gptoss + kimik2.5 single-node launchers 5 new agentic-coding launcher scripts brought over from chore/agentx-integration, with USERS → CONC normalization: - benchmarks/single_node/agentic/gptoss_fp4_h100.sh - benchmarks/single_node/agentic/gptoss_fp4_h200.sh - benchmarks/single_node/agentic/gptoss_fp4_mi300x.sh - benchmarks/single_node/agentic/gptoss_fp4_mi325x.sh - benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic: add pareto-plot analysis tooling + extra Python deps Brings utils/agentic-benchmark/analysis/ (plot_pareto.py — sweep visualizer for cross-config performance comparison) and updates requirements.txt with transformers/xlsxwriter/tqdm/datasets/tiktoken needed by the analyzer + by trace-replay's tokenizer paths. The bench/ directory is intentionally NOT added: bench/metrics_collector.py duplicated utils/trace-replay/server_metrics.py and was already removed on this branch; bench/run_metrics_collector.py depends on it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * configs: add agentic-coding sections for kimik2.5 + gptoss Adds agentic-coding scenario blocks to the master configs for the five models whose launchers were just brought over: - kimik2.5-fp4-b200-vllm (image bumped to v0.19.1) - gptoss-fp4-h100-vllm - gptoss-fp4-h200-vllm - gptoss-fp4-mi300x-vllm - gptoss-fp4-mi325x-vllm Each scenario sweeps tp 4/8 (and 1/2 on AMD/H200) at offloading=none for low/mid concurrency and offloading=cpu for high concurrency, with a crossover at conc=64. Other agentic-coding sections present on chore/agentx-integration (trtllm/srt-slurm based) are left for follow-up since several of the underlying model entries were restructured by main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runners: thread SCENARIO_SUBDIR through B200/B300 dispatch The agentic-coding scenario type uses benchmarks/single_node/agentic/ launchers, gated by SCENARIO_SUBDIR='agentic/' from benchmark-tmpl.yml. b200-cw, b200-dgxc, b200-nb, and b300-nv all built BENCH_BASE without honoring SCENARIO_SUBDIR, so dispatch always landed in single_node/ even for agentic runs. Other runners (h100-*, h200-*, mi*) already had this plumbing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic: add launchers + master configs for 4 model families on B200/H200 - minimaxm2.5-fp8-b200-vllm - qwen3.5-bf16-b200-sglang - glm5-fp8-b200-sglang - dsv4-fp8-h200-vllm Each launcher mirrors its fixed-seq-len sibling but: uses CONC env for max-num-seqs / cuda-graph-max-bs, sources benchmark_lib.sh, calls the trace replayer via build_replay_cmd, and emits the agentic result JSON. Master config gets an agentic-coding scenario block sweeping conc 1..32 at offloading=none; b200-dsv4 entries left untouched since that runner type isn't registered in runners.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic: add mi355x launchers for minimaxm2.5/qwen3.5/glm5.1/kimik2.5 - minimaxm2.5-fp8-mi355x-vllm - qwen3.5-fp8-mi355x-sglang - glm5.1-fp4-mi355x-sglang - kimik2.5-fp4-mi355x-vllm Each mirrors its fixed-seq-len sibling with ROCm-specific tweaks (VLLM_ROCM_USE_AITER, ROCM_QUICK_REDUCE_QUANTIZATION, etc.) and feeds CONC into max-num-seqs / cuda-graph-max-bs. Master configs gain matching agentic-coding scenarios sweeping conc 1..32 at offloading=none. dsv4-fp8-mi355x is intentionally skipped since the existing fixed-seq launcher requires a bespoke vLLM PR rebuild that adds risk to trace-replayer testing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic: add b200 launchers for gptoss-fp4, kimik2.5-int4, minimaxm2.5-fp4 Phase-2 coverage extension across precision (int4 vs fp4 for kimi, fp4 vs fp8 for minimax) and runner (b200 vs h100/h200 for gptoss). - gptoss-fp4-b200-vllm - kimik2.5-int4-b200-vllm - minimaxm2.5-fp4-b200-vllm Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic: add qwen3.5-fp8-b200-sglang variant (bf16 image is buggy) The bf16 image lmsysorg/sglang:nightly-dev-20260216-d3bae71e fails on B200 with PyTorch/CuDNN compatibility errors at server start. Add an fp8 variant using lmsysorg/sglang:v0.5.9-cu130-amd64 to provide a working qwen3.5 trace-replayer test on NVIDIA. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: add agentic trace replayer test coverage map Documents the launcher matrix at benchmarks/single_node/agentic/, how to dispatch debug runs via gh workflow run, and what fields in the result JSON to inspect for verification (num_requests_successful, total_generation_tokens, median_ttft, median_tpot, total_tput_tps, etc.). Notes the two known-failing configs (qwen3.5 sglang on B200 — pytorch/ pytorch#168167; dsv4-fp4-b200-sglang — runner b200-dsv4 not in runners.yaml) so future testers don't repeat them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: add agentic trace replayer coverage test results 15 debug runs across 7 model families × NVIDIA/AMD HW. 10 PASS / 5 FAIL (1 still in flight); failures are all image- or vLLM-parser-level, not replayer bugs. Replayer's per-model delta-field routing + long-prefill agentic flow verified end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: finalize agentic trace replayer test results All 16 dispatched runs are now complete. Final tally: 10 PASS, 6 FAIL. The 6 failures are all infrastructure or vLLM-side issues (PyTorch/CuDNN image incompatibility, vLLM deepseek_v4 reasoning parser bug, sglang-rocm qwen3.5 streaming, SLURM time limit) — none indicate a bug in the trace replayer itself. All 7 active model families have at least one PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(agentic): collect_sweep_results regex matches actual offload values The exp-name template emits offload{none|cpu|ssd} (per the matrix generator's f"{model_code}_tp{tp}_conc{conc}_offload{offloading}"), but the regex was looking for offload(on|off) — so every artifact directory failed to parse, the aggregator wrote nothing to aggregated/, and collect-agentic-results uploaded no files ("No files were found with the provided path: aggregated/"). Verified the fix matches real artifact names from this branch's runs (b200/h100, none/cpu). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic: expand sweep configs for the 10 verified models For the 5 vllm models (kimik2.5-fp4/int4-b200, minimaxm2.5-fp8-b200, gptoss-fp4-b200, kimik2.5-fp4-mi355x, minimaxm2.5-fp8-mi355x): add offloading=cpu at high concurrency (typically conc 64+) where KV cache pressure exceeds GPU HBM. Overlap at conc=64 between none and cpu so the crossover region is sampled by both. cpu-offload sweep tail uses larger conc points (96, 128, 192, 256) since the only reason to enable cpu offload is when concurrency stresses HBM. For glm5-fp8-b200-sglang and glm5.1-fp4-mi355x-sglang (sglang launchers without the OFFLOADING=cpu plumbing): expand the conc range on offloading=none. sglang manages its own KV eviction via the radix cache, so concurrency above HBM capacity is handled internally rather than via vLLM's --kv_offloading_backend. dsr1-fp4-{b200,mi355x}-sglang sweeps already cover conc 1..256 (b200 also has tp=4 ep=4 / tp=8 ep=8 split and tp=8 going to conc=512), so left as-is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runners(b200-dgxc): SLURM-exclude gpu-10/gpu-15 (stuck CUDA + full fs) Both nodes are currently dropping every job that lands on them: - NCCL barrier dies during sglang Scheduler.init_model_worker with RuntimeError: NCCL error: unhandled cuda error (stale CUDA contexts from a previous job that didn't tear down cleanly) - HuggingFace CAS download for moonshotai/Kimi-K2.5 fails with RuntimeError: Data processing error: CAS service error : IO Error: No space left on device (os error 28) Adding --exclude=gpu-10,gpu-15 to salloc keeps SLURM from allocating to them. Drop this once sa-shared admins clean up the nodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic: --disable-hybrid-kv-cache-manager when OFFLOADING=cpu vLLM's OffloadingConnector (--kv_offloading_backend native) is incompatible with the hybrid-KV-cache-manager (HMA) for models with mixed attention layouts. When HMA is enabled, the OffloadingConnector init fails with: RuntimeError: Worker failed with error 'Connector OffloadingConnector does not support HMA but HMA is enabled. Please set --disable-hybrid-kv-cache-manager'. This bit kimik2.5-fp4-mi355x's full sweep: every offload=cpu sub-job failed with the above error while every offload=none sub-job passed (see run 25117841192). Kimi-K2.5 uses hybrid attention so HMA kicks in. MiniMax-M2.5 doesn't, which is why its prior cpu-offload sweeps passed even with the broken flag. Switching all 11 cpu-offload launchers to --disable-hybrid-kv-cache-manager is correctness-safe across the board: HMA is a pure optimization, and disabling it is required for OffloadingConnector regardless of model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic-coding: bump vllm-openai images to v0.19.1 for cpu-offload configs KV offloading via OffloadingConnector hits multiple upstream bugs on older vllm tags: - v0.15.1 (gpt-oss-fp4-b200, kimi-int4-b200): flashinfer kv_cache_permute assertion in TRTLLM-attention path - v0.18.0-rocm (kimi-fp4-mi355x): HMA + OffloadingConnector incompat - v0.19.0 (minimaxm2.5-fp8 b200/mi355x): not yet verified clean Bumping to v0.19.1 (or v0.19.1-rocm) — proven-good on kimi-fp4-b200 (23/23 sweep PASS) and gptoss-fp4 h100/h200/mi300x/mi325x. * agentic: minimax-fp8 sweep across all 6 SKUs Add agentic-coding sections + launchers for MiniMax-M2.5 FP8 across H100, H200, B200, B300, MI300X, MI355X (excluding MI325X). Conc ranges sized from per-SKU GPU KV cache capacity: KV per token (fp8, 62 layers × 8 KV heads × 128 dim × 2): ~124 KB Per-SKU GPU cache cap with tp=4 + 0.90 mem-util: H100 58 GB -> 0.46M tok (saturate ~conc 6) H200 277 GB -> 2.19M tok (saturate ~conc 29) B200 461 GB -> 3.63M tok (saturate ~conc 48) B300 807 GB -> 6.35M tok (saturate ~conc 85) MI300X 500 GB -> 3.93M tok (saturate ~conc 52) MI355X 864 GB -> 6.81M tok (saturate ~conc 91) NVIDIA configs include offload=cpu starting at the saturation point (simple cpu offload via OffloadingConnector requires vllm ≥ 0.19.1). AMD configs do not enable cpu offload — vllm simple offloading isn't supported on the rocm build for these models. AMD pushes offload=none to a higher conc to demonstrate where GPU cache saturates. Image bumps: h100/h200/mi300x v0.18.0/v0.16.0 -> v0.19.1; b300 v0.19.0-cu130 -> v0.19.1. * agentic minimax-fp8: drop tp=8, follow fixed-seq-len TPs vllm v0.19.1 fp8 quantization rejects tp=8 for MiniMax-M2.5: gate/up weight output_size 1536 / tp=8 = 192, not divisible by block_n=128. Same constraint at vllm/model_executor/layers/quantization/fp8.py:638. Per fixed-seq-len reference TPs: H100 tp=4 ep=4 (tp=8 ep=8 commented out in fixed-seq-len for fp8) H200 fixed-seq-len has only tp=8 (broken on v0.19.1 fp8); winging tp=4 B200 tp=4 (fixed-seq-len has tp=2,4; tp=2 too tight for agentic ISL) B300 tp=4 (primary; fixed-seq-len has tp=1,2,4 with various ep) MI300X tp=4 (fixed-seq-len has tp=2,4) MI355X tp=4 ep=4 (fixed-seq-len has tp=2 ep=2, tp=4 ep=4, tp=8 ep=8) Concurrency expanded across the saturation cliff for each SKU; cpu offload range extended to 384/512 on NVIDIA where applicable. * agentic minimax-fp8: trim conc to creep up to per-SKU compute ceiling Per empirical compute ceilings observed in prior runs (mean in-flight reqs mid-test on each platform): H100 tp=4 ep=4 ceiling ~10 (KV cliff ~6 -> cpu zone 6-10) H200 tp=4 ceiling ~35 (KV cliff ~29 -> cpu zone 29-35) B200 tp=4 ceiling ~50 (KV cliff ~48 -> very narrow) B300 tp=4 ceiling ~60 (KV cliff ~85 -> compute saturates first) MI300X tp=4 ceiling ~20 (estimated) MI355X tp=4 ep=4 ceiling ~60 Previous conc lists (1..256, even up to 512) wasted 30-min slots on sub-jobs that just queue 200+ requests waiting on a server only running 4-50 in flight, leading to client-side 600s timeout cascades. New lists "creep up" to 2-3x the ceiling, then stop. NVIDIA cpu offload range narrowed to the zone between KV cliff and compute ceiling, where offloading can actually relieve KV pressure without compute already being the bottleneck. AMD (mi300x, mi355x) keeps offload=none only. * agentic minimax-fp8: cliff-dense conc ladders (v4) Per user feedback: past the compute ceiling, throughput plateaus and extra conc just adds queue depth and client timeouts -- wasted slots. Reallocate sampling budget to densify around the cliff(s) for each SKU. Per-SKU strategy (compute ceiling empirical, KV cliff analytical): H100 tp=4 ep=4 ceil 10 KV 6 -> dense 4-12 (sweet spot for cpu demo) H200 tp=4 ceil 35 KV 29 -> dense 24-40 (narrow cpu window) B200 tp=4 ceil 50 KV 48 -> dense 32-56 (cliffs colocated) B300 tp=4 ceil 60 KV 85 -> dense 48-72 (compute first; cpu won't help) MI300X tp=4 ceil 25 KV 52 -> dense 16-32 (compute first; AMD no cpu) MI355X tp=4 ep=4 ceil 60 KV 91 -> dense 48-72 (compute first; AMD no cpu) Dense step (every 4-8 conc) around the cliffs to resolve the inflection; sparse step (doubling) below the cliffs for baseline; one point ~1.3-1.5x ceiling to confirm plateau. NVIDIA cpu offload range overlaps with none from KV cliff to ~ceiling for direct same-conc comparison; doesn't extend past 1.3x ceiling. * agentic minimax: AMD native cpu offload + b300-p1 runner - AMD launchers (mi300x, mi355x) drop VLLM_USE_SIMPLE_KV_OFFLOAD env var. SimpleCPUOffloadConnector isn't supported on rocm; native OffloadingConnector works (still passes --kv_offloading_backend native flag). - Add cpu offload entries to AMD master configs (mi300x, mi355x). - Add b300-p1 runner group (subset of b300 nodes 13-17 with the b300-p1 label) and target it from the b300 minimax config. * agentic: drop --no-enable-prefix-caching from all launchers The agentic-coding benchmark IS a prefix-cache benchmark — the whole point is measuring KV reuse across multi-turn conversations and across users (with the per-user salt enabling deterministic prefix overlap). Disabling prefix caching defeats the entire purpose. Removed from 7 launchers that had it: dsv4_fp8_h200.sh gptoss_fp4_b200.sh (was in config.yaml) kimik2.5_fp4_mi355x.sh kimik2.5_int4_b200.sh minimaxm2.5_fp4_b200.sh minimaxm2.5_fp8_mi300x.sh minimaxm2.5_fp8_mi355x.sh vLLM defaults to prefix caching ON when no flag is passed. * agentic minimax mi300x/mi355x: switch attention backend to UNIFIED_ATTN ROCM_AITER_FA was the suspect for both: 1. Worker dies on cpu offload (gpt-oss using UNIFIED_ATTN works fine on the same launcher pattern + image) 2. Prefix-cache Prometheus counters never increment (observability gap on FA backend, while UNIFIED_ATTN reports correctly on mi300x) Swap to ROCM_AITER_UNIFIED_ATTN to test both fixes in one shot. * agentic minimax b200/b300: extend none past KV cliff for fall-off demo The cpu range needs full overlap with none past the KV cliff so the no-offload throughput collapse is visible at the same conc points where cpu offload sustains throughput. B200 tp=4 (KV cliff conc=48): none: [1,2,4,8,16,32,48,56,64,96,128] (was capped at 64) cpu: [48,56,64,96,128] (was capped at 64) B300 tp=4 (KV cliff conc=85): none: [1,2,4,8,16,32,48,64,96,128,192] (was capped at 96) cpu: [48,64,96,128,192] (was capped at 96) Past the cliff, the no-offload curve should collapse (recompute storm, client-side timeouts), while cpu-offload sustains the compute ceiling. * agentic minimax-fp8-b300: revert to standard b300 runner tag * agentic minimax-fp8-b300: bump cpu DRAM offload to 2.2 TB (B300 has plenty) * agentic minimax-fp8-b300: dense conc 100-124 to resolve cpu offload dropoff * agentic minimax-fp8-b200: bump cpu DRAM offload to 1.5 TB, target b200-dgxc - Add b200-dgxc runner pool (subset of b200 excluding b200-cw / b200-nb). - Switch minimax-fp8-b200-vllm runner from b200 to b200-dgxc. - Hardcode TOTAL_CPU_DRAM_GB=1500 in cpu branch of b200 launcher (1.95x HBM total at tp=4, comfortably above the 1.5x threshold so the offload tier doesn't hit a secondary cliff). * fix(matrix): drop duplicate agentic-coding loop from merge The merge with origin/main pulled in main's agentic-coding loop in generate_test_config_sweep alongside our pre-existing one — both blocks were byte-identical so every sub-job got emitted twice (e.g., b300 generated 60 entries instead of 30). Drop the duplicate block, restore the function's return statement that was lost in the dedup. * agentic: dsv4-fp4 B200/B300 initial sweep + restore SCENARIO_SUBDIR on b300-nv Adds agentic trace replay configs and launchers for DeepSeek-V4-Pro fp4 on B200 and B300 via vLLM, mirroring the fixed-seq-len recipe (tp=8 ep=1, no DP-attn) at the low-conc range. Initial conc list [1..64] for none and [16,32,64] for cpu offload; cpu DRAM defaults to 1.5 TB on B200 and 2.2 TB on B300 in the launcher (overrides the workflow 600 GB default). Switches dsv4-fp4-b200-vllm runner from b200-dsv4 (not in our runners.yaml) to b200-dgxc to match the established minimax B200 pattern. Also restores ${SCENARIO_SUBDIR} in launch_b300-nv.sh BENCH_BASE: the post-revert main state landed without it after the v0.1 squash merge, so agentic dispatch on B300 was resolving to benchmarks/single_node/ instead of benchmarks/single_node/agentic/. The b200-dgxc launcher already had this prefix; b300-nv was the asymmetry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic dsv4-fp4: switch B200/B300 to official blog recipe layout (DP=8 EP=8) The first attempt OOM'd at vLLM startup on every conc=64 cpu-offload job (and would have on conc=32 cpu) because I used TP=8 EP=1 with FULL_AND_PIECEWISE + max-num-batched-tokens=2048 + max-cudagraph-capture-size=2048 (copied from the fixed-seq-len recipe). At TP=8 every layer's attention output goes through an NCCL all-reduce; cudagraph capture pre-allocated activation/all-reduce workspace proportional to max-batched-tokens × hidden_dim × layers, consuming ~134 GiB per rank on top of the ~134 GiB DSv4-Pro fp4 weight footprint (1.6T-total / 49B-active model, 800 GiB checkpoint). KV cache profiling then had nothing left to allocate. The official vLLM blog recipe for 8xB200/8xB300 (https://vllm.ai/blog/deepseek-v4) uses DP=8 + EP=8 instead — each rank does its own attention on its own sequences (no per-layer TP all-reduce) and the MoE all-to-all is the only collective. Smaller activation workspace at capture time → cudagraph + KV cache both fit. Switching to that layout: - both launchers: drop the TP/DP-attn branching, always --data-parallel-size $TP --enable-expert-parallel; drop the max-cudagraph-capture-size and max-num-batched-tokens overrides (recipe doesn't set them, defaults are fine for DP-only collectives); keep FULL_AND_PIECEWISE + custom_ops=["all"] per recipe; max-model-len pinned at 1M (full DSv4 context — recipe suggests 800K but user wants 1M tested). - nvidia-master.yaml: agentic-coding entries become tp=8 ep=8 dp-attn=true for both B200 and B300; image at the config-block level switches from v0.20.0-cu130 to deepseekv4-cu130 (the DSv4-tuned tag from the recipe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic dsv4-fp4: keep image at v0.20.0-cu130 (deepseekv4-cu130 not pinned) Per user direction, stay on vllm/vllm-openai:v0.20.0-cu130 instead of the DSv4-tuned deepseekv4-cu130 tag from the blog recipe — that tag isn't currently pinned in this pipeline. Parallelism layout (DP=8 + EP=8) is unchanged from the prior commit since the OOM fix is what actually mattered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic dsv4-fp4: drop cpu-offload sweep entries (HMA conflict at 1M) cpu-offload jobs hit a clean ValueError at vLLM startup on B300: 442.99 GiB KV cache is needed [for max_model_len=1M], which is larger than the available KV cache memory (104.74 GiB). [...] estimated maximum model length is 236288. The cause is in the warning right above: SimpleCPUOffloadConnector forces --disable-hybrid-kv-cache-manager, which switches off DSv4's per-layer KV compaction (the "drop KV outside the local sliding window" optimization that gives DSv4 its "10% of V3.2's KV per token at 1M" claim). Without HMA, every layer stores full per-token KV and the per-rank budget blows up well below 1M context. HMA is DSv4's intended long-context mechanism — leave KV management to it and skip cpu offload until upstream supports HMA + KV connector together. Re-introduce a cpu-offload sweep at lower max-model-len in a follow-up if a meaningful KV cliff appears in the offload=none data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * rm diable hma connector * agentic dsv4-fp4: enable simple-offload + HMA, restore cpu-offload sweep Re-enables the cpu-offload path for DSv4-Pro on B200/B300 now that we understand SimpleCPUOffloadConnector (selected via VLLM_USE_SIMPLE_KV_OFFLOAD=1) already inherits SupportsHMA in v0.20.0 (PR #37160 by njhill, merged 2026-04-01). The earlier failure was caused by --disable-hybrid-kv-cache-manager in OFFLOAD_ARGS, which forced HMA off and made vLLM size the KV pool for full per-layer storage (442 GiB needed for 1M context vs 104 GiB available per rank). Changes: - Both launchers: drop --disable-hybrid-kv-cache-manager from cpu OFFLOAD_ARGS; add explicit --enable-prefix-caching and --no-disable-hybrid-kv-cache-manager to the vllm serve command (matches PR #37160's documented example). - nvidia-master.yaml: restore the offloading=cpu search-space entries on both dsv4-fp4-b200-vllm and dsv4-fp4-b300-vllm with conc-list [16, 32, 64], and rewrite the comment to reflect the actual mechanism rather than the prior (incorrect) "wait for upstream HMA + connector support" framing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runners(b200-dgxc): switch SLURM partition gpu -> gpu-2 (cluster re-partitioned) The b200-dgxc cluster was re-partitioned: the old "gpu" partition no longer exists. salloc now rejects with "invalid partition specified: gpu", breaking every B200 single-node agentic dispatch. Current sinfo: cpu cpu-[0-2] all* cpu-[0-2] + gpu-1-* + gpu-2-* (default, mixed) gpu-1 gpu-1-[0-3,5-7,9] (8 idle, gpu-1-4 / gpu-1-8 drained) gpu-2 gpu-2-[0-9] (10 idle, none drained) Land on gpu-2 since it's a clean GPU-only pool with no drained nodes. Drop the --exclude=gpu-10,gpu-15 list — those node names were from the pre-repartition layout (now gpu-1-* / gpu-2-*) and no longer match anything on the cluster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic dsv4-fp4: pre-divide kv_offloading_size by TP; cpu-only sweep Pre-divides TOTAL_CPU_DRAM_GB by $TP (= DP size, since the launcher passes --data-parallel-size $TP) so each DP engine ends up with its fair share. Without this, each of the 8 DP engines independently torch.zeros + pin_tensor its own ~1500/2200 GB region, blowing past the SLURM memory cgroup limit (direct dmesg evidence on gpu-2-6: 7 separate VLLM::Worker_DP processes OOM-killed in sequence by the cgroup OOM-killer at growing anon_rss values). Root cause is in vllm v0.20.0: - vllm/config/parallel.py defines world_size := TPxPP, with a separate world_size_across_dp := TPxPPxDP property - vllm/distributed/.../simple_cpu_offload_connector.py uses parallel_config .world_size for the divide, picking up TPxPP only - LMCacheConnector explicitly divides by num_kv_ranks (incl DP); Simple's path does not — see vllm/config/vllm.py So with DP=8 EP=8 TP=1, world_size=1 inside each engine, no DP-aware adjustment, and each DP engine commits the full --kv_offloading_size value to physical pinned host RAM. Also temporarily removes the offloading=none agentic-coding search-space entries on both dsv4-fp4-{b200,b300}-vllm — we already have that data from Friday's runs (25234821661, 25234822495). The next dispatch will be cpu-only to validate the host-budget fix without re-running the none cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic dsv4-fp4: align parallelism with fixed-seq-len; conditional offload sizing Mirrors the fixed-seq-len recipe's parallelism options for the agentic sweep — pure TP for low-conc / interactivity, DEP (DP-attn + EP-MoE) for high-conc / throughput per the vLLM blog recipe — and adapts the cpu offload sizing logic to the connector's actual divide-by-world_size behavior: - DP-attn=true (DEP modes): each DP engine has parallel_config.world_size=1 (TP×PP only — see vllm/config/parallel.py docstring), so the connector's internal divide is a no-op and each DP engine independently torch.zeros + pin_tensor allocates the full --kv_offloading_size value. Pre-divide TOTAL_CPU_DRAM_GB by $TP (the DP size in this layout) so 8 DP engines × (TOTAL/8) keeps aggregate host commit ≈ TOTAL. - DP-attn=false (pure TP, TP+EP): single engine with world_size=TP. Pass the full TOTAL — the connector's internal divide gives TOTAL/TP per rank and PR #37206's TP-shared mmap keeps the aggregate at TOTAL. Restored conditional PARALLEL_ARGS / EP_ARGS in both launchers (we had removed them when simplifying to DEP-only). Now handles all three modes (pure TP, TP+EP, DEP) cleanly via the matrix's tp / ep / dp-attn fields. Sweep coverage: - B200 (16 jobs): TP=8 + DEP=8, each with both offloading modes - B300 (32 jobs): TP=4, TP=8, DEP=4, DEP=8, each with both offloading modes Conc lists are agentic-scaled (smaller than fixed-seq-len): pure-TP modes sweep [1..32], DEP modes sweep [16..128] (none) and [64..256] / [128..512] (cpu offload, where the larger CPU pool extends the working-set ceiling). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic dsv4-fp4: enable lazy_offload to mitigate popleft_n assertion Server logs from the prior multi-parallelism run showed the cpu-offload failure mode is an AssertionError in vllm/v1/core/kv_cache_utils.py:269 (popleft_n: curr_block is not None) — the FreeKVCacheBlockQueue's linked list and num_free_blocks counter get out of sync under DSv4 + 1M max_model_len + cpu offload + sustained eviction pressure. The eager offload path (default) does the store bookkeeping inline with each step, which races with the scheduler's free-block accounting. Switch from --kv_offloading_size convenience flag to explicit --kv-transfer-config JSON so we can pass lazy_offload=true (PR #37160's documented option) alongside cpu_bytes_to_use. Lazy mode defers the store path and avoids the race that triggers the assertion. Also temporarily drop the offloading=none search-space entries — they already validated cleanly in run 25332045030 (B200 TP=8 + DEP=8 all 100%) so this iteration focuses solely on cpu offload paths to confirm the mitigation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic dsv4-fp4: bump image to v0.20.1, revert to eager offload lazy_offload (PR #37160 option) was a partial fix for the popleft_n assertion: across last run's 18 cpu jobs: - low/mid conc cases that were 0% in eager went to 80-100% - but high-conc DEP=8 cases regressed (256 went 992/992 -> 212/477, new failure mode: cuMemcpyBatchAsync err=719 cudaErrorIllegalAddress in the deferred-batch copy path of the simple connector's worker) So eager has a scheduler/eviction race (popleft_n at low conc, OK at very high conc), and lazy has a CUDA-async race (OK at low conc, illegal-address at very high conc). Different bugs in different code paths of the same connector. v0.20.1 was published today (2026-05-04) and includes all 13 parts of the [kv_offload+HMA][N/N] series cleanly merged. Try the upstream's own latest release with eager (default) to see if either bug is fixed. v0.20.1 only ships cu129 (no cu130 variant yet); cu129 supports Blackwell and should run on B200/B300. Revert OFFLOAD_ARGS to the --kv_offloading_size convenience flag (eager default; lazy_offload was the only reason we needed the JSON form). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic dsv4-fp4: revert to v0.20.0-cu130 + lazy_offload, scale max-num-seqs per-engine v0.20.1 (cu129) iteration was strictly worse: - Same popleft_n AssertionError still fires - Model load 12x slower on Blackwell (588s vs 46s on v0.20.0-cu130) - All 6 B200 cpu jobs got 0/9 trace-replay success Revert image to v0.20.0-cu130 and re-enable lazy_offload (the best run we had — B200 mixed 35-100%, B300 mostly 80-100%, with regressions only at very high conc DEP=8 cases). Add a per-engine --max-num-seqs scaling for DP-attn modes: the trace replay tool's CONC concurrent users load-balance across DP ranks, so each engine actually sees CONC/$TP sequences in steady state. Setting the per-engine cap to that (instead of the global CONC) avoids the scheduler reserving block-pool capacity for sequences that won't materialize on this engine — which may amplify the eviction race that hurt high-conc DEP cases in the prior lazy_offload run. Pure TP modes are a single engine and keep --max-num-seqs = $CONC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * add aiperf submodule (cquil11/aiperf @ ajc/inferencex-agentx-mvp) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic: migrate from kv-cache-tester to aiperf (live-assistant default) All 26 agentic launchers now drive aiperf via the inferencex-agentx-mvp scenario instead of trace_replay_tester.py. Live-assistant mode is on by default (AIPERF_DATASET_WEKA_LIVE_ASSISTANT_RESPONSES=1) so the server's just-generated KV blocks survive turn boundaries and the measured cache- hit rate reflects what a real agentic user would experience. Changes: - utils/aiperf submodule pointer bumped to cjq/weka-live-assistant-responses (29418ea6) and .gitmodules branch tracking updated. - benchmarks/benchmark_lib.sh: build_replay_cmd, install_agentic_deps, resolve_trace_source rewritten. The 26 single-node + 1 multinode launcher scripts inherit the change via sourced helpers — none of them need per-script edits. Helper signatures (REPLAY_CMD, TRACE_SOURCE_FLAG) preserved. - utils/process_agentic_result.py: full rewrite, consumes aiperf's profile_export.jsonl + profile_export_aiperf.json + server_metrics_export.json. Output JSON key schema preserved so utils/summarize.py and other aggregators keep working without edits. Theoretical cache-hit rate and output_tokens_expected are computed from trace metadata in the local HF cache (independent of which mode aiperf runs in). - utils/test_process_agentic_result.py: new fixture-driven unit test suite (6 tests) covering schema parity with summarize.py, ms→s unit conversion, throughput-per-GPU derivation, missing-server-metrics graceful path, response_cache_hit_rate from cached_tokens, and the per-run subdir layout that --num-profile-runs > 1 produces. The legacy utils/trace-replay submodule is left on disk for fallback; no scripts reference it anymore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic: --ignore-installed for aiperf editable install vLLM container has apt-managed `blinker` (and likely other distutils packages) that pip refuses to uninstall when one of aiperf's transitive deps tries to upgrade them, killing `pip install -e ./utils/aiperf` mid-install. `--ignore-installed` lets pip install our newer copy fresh into site-packages without touching the apt-managed version. Safe in this context — we own the container, system blinker isn't load-bearing for the benchmark, and pip's import order picks up the newer copy first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * agentic: pass --num-dataset-entries 739 to aiperf Default is 100 — without the flag the loader silently caps the weka corpus to the first 100 traces of 739, limiting diversity and making recycled-trace cache-hit math weird. The inferencex-agentx-mvp scenario doesn't lock this setting (only locks 7 things; this isn't one). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * bump aiperf: terminate trajectory on context-overflow Pulls in aiperf b9f44eac which makes AgenticReplayStrategy recycle a trajectory on the first context-length error instead of continuing to dispatch turns whose prompts will all also overflow. Matches kv-cache-tester's "user truncated" semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * agentic: pass --cache-bust system_prefix to aiperf The inferencex-agentx-mvp scenario validator requires cache_bust.target=system_prefix but doesn't auto-default it — it just checks that user-supplied config matches. Without the flag, scenario validation rejects the config at startup with a value error. The tutorial example also passes it explicitly; I dropped it earlier thinking the scenario auto-supplied the value, which it doesn't. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * bump aiperf: fix realtime-stats AttributeError crash loop Pulls in aiperf dc943e7e which fixes the every-30s AttributeError in _render_realtime_block (CreditPhaseStats vs PhaseRecordsStats type mismatch). Crash was non-fatal — benchmarks were running fine — but flooded logs with tracebacks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * agentic: auto --unsafe-override when DURATION < 900s The inferencex-agentx-mvp scenario enforces a 900s minimum benchmark duration. For smoke tests / iteration / debugging at shorter durations, auto-opt into --unsafe-override so the run starts. The result will have submission_valid=false with reason "unsafe_override" — that's the expected and documented behavior for non-canonical runs. Also support AIPERF_UNSAFE_OVERRIDE=true as an explicit toggle for durations >= 900s when the user wants to override other locked settings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * agentic CI: upload aiperf artifacts (not kv-cache-tester paths) The agentic-coding raw-results upload step was still listing kv-cache-tester filenames (detailed_results.csv, metrics_*.csv, etc.) which the new aiperf-driven pipeline doesn't produce. Replace with the aiperf artifact set under results/trace_replay/: - profile_export.jsonl -- per-record metrics stream - profile_export_aiperf.json/csv -- aggregate stats with metadata - profile_export_aiperf_timeslices -- windowed stats - profile_export_aiperf_aggregate -- multi-run aggregate (when N>1) - profile_export_aiperf_collated -- per-run collated payloads - profile_export_raw.jsonl -- raw request/response bodies - server_metrics_export.{json,csv,jsonl,parquet} -- Prometheus scrape - gpu_telemetry_export.jsonl -- GPU telemetry stream - inputs.json -- pre-formatted request bodies if-no-files-found: ignore is preserved so the step is robust when a specific output type isn't enabled in a given run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * agentic CI: drop multi-GB debug artifacts from upload inputs.json (multi-GB pre-formatted request bodies) and profile_export_raw.jsonl (full HTTP request/response capture) were inflating per-run artifact size to ~7 GB on the weka corpus. Neither is consumed by the post-processor or any downstream tool — they're offline forensics artifacts that can be rebuilt from --public-dataset + --random-seed when needed. Drops upload size to ~50-100 MB / run. Post-processing schema unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * process_agentic_result: fix server_metrics_export.json schema parsing The post-processor crashed with `AttributeError: 'str' object has no attribute 'get'` at the end of every run because _index_server_metrics iterated the top-level "metrics" value as if it were a list of metric dicts, when it's actually a dict keyed by metric name. Real aiperf v0.8 schema (per docs/server-metrics/server-metrics-json-schema.md): {"metrics": {<name>: {"type": ..., "series": [{"stats": {...}}]}}} Rewrite: - _index_server_metrics returns the metrics dict as-is - _final_value walks series[i].stats[stats_key] (counters use "total", gauges fall back to "max"/"avg") and aggregates across series so multi-endpoint deployments correctly sum counters Two new regression tests use the real schema shape (single-series and multi-series) so future schema drift fails fast in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * process_agentic_result: support traces.jsonl + handle vLLM no-cached-tokens Two fixes informed by the first successful smoke run (25404442007): 1. The published HF dataset ships a single traces.jsonl (one trace per line), not per-trace *.json files. _hf_traces_dir was filtering for *.json only so theoretical_cache_hit_rate stayed None even though the corpus was downloaded. Add a JSONL path and accept either layout. 2. vLLM v0.19.1 doesn't populate cached_tokens in the OpenAI usage response field, so usage_prompt_cache_read_tokens isn't in any per-record metric. response_cache_hit_rate stays None — the server-side Prometheus scrape (vllm:prefix_cache_{hits,queries}) is the actual source of truth for this benchmark, and that path now works (89.8% measured on iter1). Adds two unit tests covering the JSONL trace layout end-to-end (load, walk hash_ids per turn, derive hits / output_tokens_expected). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * agentic: generate metrics_plots.png from aiperf artifacts post-run kv-cache-tester's metrics_plots.png was a 6x2 grid showing TTFT, E2E, ITL, ISL/OSL distributions, and server-side cache + KV usage time series. Replicate the same visual at ~150 KB per run from aiperf's profile_export.jsonl + server_metrics_export.json. Panels: 1. TTFT vs request time (scatter + rolling avg) 2. E2E latency vs request time (scatter + rolling avg) 3. Inter-token latency vs request time (scatter + rolling avg) 4. ISL / OSL token-count distributions (overlaid histograms) 5. Server prefix-cache hit rate over time (timeslices when present, else flat-line aggregate fallback) 6. vLLM KV cache usage % over time (same fallback) Best-effort: matplotlib is loaded with the Agg backend (headless), and write_agentic_result_json swallows non-zero exits so missing matplotlib in stripped-down container images doesn't break the launcher's success gate. Tested locally against iter1's artifacts — produces a 149 KB PNG in <1s. The PNG is added to the agentic-coding artifact-upload list so the GH Actions run-page surfaces it directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * bump aiperf: enrich realtime stats line (input throughput + ISL/OSL) Pulls in aiperf 26d1e3ad which adds tput_in to the headline and an ISL/OSL average row to the realtime stats block — closer match to what kv-cache-tester showed at assessment periods. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * process_agentic_result: accept ``out`` alias in trace metadata The on-disk weka trace JSON uses the keys ``in`` / ``out`` — these are the Pydantic loader aliases for input_length / output_length. The metadata loader was only looking for the underscored field name, so ``output_length`` ended up zeroed out for every turn and mean_output_tokens_expected reported 0 in iter2's agg JSON. Fall back to ``out`` when ``output_length`` is missing. Test fixture updated to mix both spellings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * bump aiperf: realtime ITL row uses scalar inter_token_latency Pulls in aiperf 64a8b0a8. The renderer was looking up the per-record list metric ``inter_chunk_latency`` (which doesn't aggregate into realtime percentiles), so the ``itl`` row was always dashes mid-run. Switch to the scalar ``inter_token_latency`` so live ITL p50/p95/p99 populate the same way TTFT and E2E do. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * agentic: 6x2 metrics_plots.png matching kv-cache-tester layout Rewrites the post-run plotter to produce the same 12-panel figure kv-cache-tester emitted, panel-for-panel: Row 0: KV Cache Utilization | Request Queue Depth Row 1: Prefix Cache Hit Rate | Throughput (Total + Decode) Row 2: KV Offload Transfer Rate | Cumulative Prefill Source Breakdown Row 3: KV Offload GPU→CPU (cum.) | KV Offload CPU→GPU (cum.) Row 4: TTFT vs Time | Latency vs Time Row 5: Interactivity (1/TPOT) vs Time | Preemptions Over Time Time-series data comes from aiperf's server_metrics_export.json ``timeslices`` (per-series per-window stats). To populate them we now pass ``--slice-duration 1.0`` on every run — matches kv-cache-tester's poll_interval=1.0 cadence so the visualizations are directly comparable. Per-record TTFT/Latency/Interactivity panels read from profile_export.jsonl ``time_to_first_token`` / ``request_latency`` / ``inter_token_latency`` metrics. Offload-related panels (rows 2-3) render empty when the connector isn't enabled — vLLM doesn't expose vllm:kv_offload_bytes_* in that case. Same behavior as kv-cache-tester. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * bump aiperf: realtime tput row shows total_token_throughput Pulls in aiperf 5af870fb. Replaces the dashed ``tput_in=-`` field with ``tput_total=N/s`` (input + output tokens / wall-clock) since aiperf doesn't expose input throughput as its own aggregate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * bump aiperf: add input_token_throughput metric Pulls in aiperf 7e209913. The new system-level input_token_throughput metric (= total input tokens / benchmark duration) lets the realtime block show prefill TPS — was rendering as ``tput_in=-`` because the aggregate metric simply didn't exist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * bump aiperf: live server-metrics row in realtime block Pulls in aiperf b6ebc19f which adds a "srv" row to the realtime stats block showing live cumulative prefix-cache hit rate, KV cache usage, and preemption counts from the /metrics scrape — same as kv-cache-tester showed at assessment intervals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * agentic: bump aiperf setup timeout + cap reconstruction workers Root cause of mass cancellations on the parallel H100 sweep: 16 launchers each spawning ~16 weka-reconstruction subprocess workers (auto-picked based on CPU count) thrashed the shared HF cache, pushing the 300s default ``AIPERF_DATASET_CONFIGURATION_TIMEOUT`` past its limit. Symptom was identical TimeoutError -> AIPerfMultiError ("Failed to perform operation 'Configure Profiling'") on most jobs ~15 min in. Two env-var fixes in build_replay_cmd: - AIPERF_DATASET_CONFIGURATION_TIMEOUT=900 — give setup 15 min headroom. The 5-min smoke runs measured ~5 min for setup at low contention; at high parallel contention we expect 8-12 min. 900s is a safe ceiling. - AIPERF_DATASET_WEKA_PARALLEL_WORKERS=4 — cap per-job reconstruction worker count so 16 parallel jobs only ever spawn 64 reconstruction workers (vs the default 256). Smoke-run measurements showed reconstruction time was bottlenecked on shared filesystem I/O, not CPU — capping workers to 4 was within noise of the auto-picked 16. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * agentic: also bump AIPERF_SERVICE_PROFILE_CONFIGURE_TIMEOUT in lockstep aiperf validates at startup that SERVICE_PROFILE_CONFIGURE_TIMEOUT (default 300s) must be >= DATASET_CONFIGURATION_TIMEOUT. The previous commit bumped only DATASET to 900, so the validator rejected the config: AIPERF_SERVICE_PROFILE_CONFIGURE_TIMEOUT: 300.0 must be greater than or equal to AIPERF_DATASET_CONFIGURATION_TIMEOUT: 900.0 Bumps both to 900 so they're consistent. Same justification: parallel- job dataset reconstruction needs >5 min headroom on contended runners. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * bump aiperf: input_token_throughput shows in realtime (console_group fix) Pulls in aiperf 56e82092. The new input_token_throughput metric was being computed but ``filter_display_metrics`` dropped it because ``console_group = MetricConsoleGroup.NONE`` was set on the class. Removed the override so it matches OutputTokenThroughputMetric and flows through the realtime display path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * bump aiperf: SIGUSR1 stack-dump diagnostic + NameError fix Pulls in aiperf 95d25ccf: - SIGUSR1 -> faulthandler.dump_traceback_all on every aiperf process (system controller + bootstrap_and_run_service subprocesses) for diagnosing the high-conc warmup hang from outside the process. - Fix latent NameError in records_manager realtime_snapshot exception handler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com> * agentic: drop --apply-chat-template; expand log uploads The chat-template tokenization path in inference_result_parser.py serializes apply_chat_template through asyncio.to_thread on the default thread pool. On minimax-m2.5 (large vocab + complex template) at conc>=16 this wedges the records pipeline — only one record squeezed through warmup before silence in the 2026-05-06 conc16 repro. Server usage already covers ISL/OSL so the flag is lossless to drop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * bump aiperf: skip DAG spawn during WARMUP (fixes warmup hang) aiperf ddc9711d: orchestrator's intercept() now early-returns for WARMUP credits. Without this, the DAG-spawn path leaked _descendant_counts (warmup never advances child turns to is_final_turn) and wedged all_credits_returned_event indefinitely, manifesting as the "sent=16, completed=0, in_flight=16, then silence" hang reproduced 100% at conc=16 on H100 and b200-nb. Also add scripts/debug_*.sh to .gitignore so local debug scripts that may contain secrets can't be staged by accident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agentic: use server-reported token counts for ISL/OSL Adds --use-server-token-count to build_replay_cmd. Replaces the previous --apply-chat-template path (which was already removed) with the cleaner fix: bypass client-side tokenizer.encode entirely and trust the server's usage fields. Auto-enables stream_options.include_usage on chat endpoints. Eliminates the per-record tokenization CPU pin on large-vocab models like minimax-m2.5 and avoids any future ISL/OSL divergence between client tokenizer and server. ignore_eos is already enforced upstream by the inferencex-agentx-mvp scenario (require_ignore_eos=True auto-injects extra_inputs.ignore_eos=true). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * bump aiperf: realtime block adds p75 + per-user tput percentiles + cumulative totals aiperf 8157cb06: latency rows now include p75; new per-user throughput rows (tin/tout) show prefill/decode-per-user spread at p50/p75/p95/p99; new tot row shows cumulative total_isl / total_osl since phase start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * bump aiperf: realtime block uses raw metrics so tin row shows values aiperf 77e4d2f2: realtime block reads raw_metrics instead of the dashboard-filtered set, so prefill_throughput_per_user (which has console_group=NONE) is no longer dropped before render — the tin row now populates with p50/p75/p95/p99 instead of dashes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * amd: bump minimax mi355x/mi300x/mi325x to vllm rocm nightly with SimpleCPUOffload Pins vllm/vllm-openai-rocm:nightly-51f22dcfd068fe8f1e3192da2a1e825b930223cf, which includes vllm-project/vllm@20cac26b ("[Bug fix][KV Connector] add cpu_offload_blocks > 0 check before maybe_run_layer_kv_offload"). Without this commit, ROCm was forced onto a different KV-offload code path than NVIDIA, so cpu-offload sweep points weren't apples-to-apples across vendors. mi325x previously was on v0.18.0 — also bumps it forward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * amd: use SimpleCPUOffloadConnector path for minimax mi300x/mi325x/mi355x Sets VLLM_USE_SIMPLE_KV_OFFLOAD=1 in the cpu-offload branch of all three AMD agentic launchers so they exercise the same code path as the NVIDIA launchers. This requires vllm/vllm-openai-rocm:nightly-51f22dcfd0... (vllm-project/vllm@20cac26b) which is now pinned in amd-master.yaml. Also adds the missing minimaxm2.5_fp8_mi325x.sh that runners/launch_mi325x-amds.sh expects (cloned from mi300x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * amd: add agentic-coding scenario to mi325x minimax config The MI325X minimax config had only fixed-seq-len scenarios, so the e2e dispatch generated zero agentic configs (matrix expanded empty, run completed in 10s with nothing exercised). Mirror the MI300X agentic-coding search space (tp=4, conc 1..48 none + 16..32 cpu) so cross-vendor comparison stays consistent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * bump aiperf: realtime row adds cpu_kv_usage, queue depth, sglang preempt Picks up: 894062b6 realtime: surface cpu_kv_usage, queue depth, sglang retractions Realtime srv row now renders cpu_kv_usage / queue / preemptions parts conditionally (each token only when its backing /metrics gauge or counter was actually scraped). SGLang servers get preemption coverage via sglang:num_retracted_reqs fallback. Companion doc updates land in docs/server-metrics/* describing the new tokens and the underlying vllm:cpu_cache_usage_perc + vllm:external_prefix_cache_* metrics. Signed-off-by: Cam Quilici <cjquilici@gmail.com> * mi355x agentic: bump cpu-offload TOTAL_CPU_DRAM_GB to 2000 MI355X nodes have plenty of host DRAM; the workflow default (600 GB) clips the offload tier well below physical capacity. Bump to 2 TB to match the b300 path so we actually push the cpu-offload sweep into spillover regimes instead of capping out on the connector size. Signed-off-by: Cam Quilici <cjquilici@gmail.com> * mi300x agentic: bump cpu-offload TOTAL_CPU_DRAM_GB to 1000 MI300X nodes have ample host DRAM; the workflow default (600 GB) clips the offload tier below physical capacity. Bump to 1 TB to match the mi355x (2 TB) / b300 (2.2 TB) pattern so the cpu-offload sweep can actually push into spillover regimes. Signed-off-by: Cam Quilici <cjquilici@gmail.com> * bump aiperf: pick up AJC's gap-closer stack on inferencex-agentx-mvp FF to ai-dynamo/aiperf:ajc/inferencex-agentx-mvp tip (70fecb2e). Notable fixes layered on top of our cjq/weka-live-assistant-responses base (894062b65): - fix(composer): clamp max_tokens >= 1 so tool-only weka turns (out:0) don't get rejected by the server (matches kv-cache-tester semantics). - feat(input): --max-context-length pre-filters oversized conversations at dataset-load time, complementing the existing mid-turn overflow recycle. - agentic_replay: recycle queue now spans the full dataset with an active-trace skip set; recycled lanes draw from full diversity without ever spawning a duplicate concurrent session for the same trace. - fix(scenario): cache-bust target -> FIRST_TURN_PREFIX preserves the stable system+tools KV prefix while still differentiating lanes that land on the same trace_id; closes most of the measured-vs-theoretical cache-hit gap. - refactor(accumulator): hoist realtime_snapshot's inner closures to staticmethods to stay under the ergonomics cap (behavior unchanged). Signed-off-by: Cam Quilici <cjquilici@gmail.com> * kimi agentic: 2 TB cpu offload, rocm nightly image for MI355X - kimik2.5_fp4_mi355x.sh, kimik2.5_int4_b200.sh: hardcode TOTAL_CPU_DRAM_GB=2000 in the cpu offload branch (workflow default 600 GB clipped the offload tier well below physical capacity on both SKUs). - amd-master.yaml kimik2.5-fp4-mi355x-vllm: bump image to vllm/vllm-openai-rocm:nightly-51f22dcfd068fe8f1e3192da2a1e825b930223cf so SimpleCPUOffloadConnector actually works on ROCm — matches the minimax MI355X pattern. Signed-off-by: Cam Quilici <cjquilici@gmail.com> * kimi b200 vllm: pin runner to b200-dgxc The generic `b200` label pulls whatever B200 host has capacity at dispatch time; the agentic-coding sweep needs the DGXC node (where the minimax B200 and dsv4 B200 configs already land) so the cpu-offload sweep points have predictable host DRAM. Signed-off-by: Cam Quilici <cjquilici@gmail.com> * kimi agentic: right-size cpu offload to host DRAM envelope - B200: 1800 GB (was 2000 GB; host has ~1.8 TiB available for offload). - MI355X: 2700 GB (was 2000 GB; host has ~2.7 TiB available for offload). Signed-off-by: Cam Quilici <cjquilici@gmail.com> * agentic build_replay_cmd: drop hardcoded --cache-bust system_prefix The aiperf inferencex-agentx-mvp scenario was updated by AJC's stack to auto-inject locked scenario fields (commit 91651257) and to require cache_bust.target=first_turn_prefix (commit b55d8846). The line in build_replay_cmd was still passing --cache-bust system_prefix explicitly, which the new validator rejects with: Value error, Scenario invariants violated (1 conflict): - --cache-bust: got 'system_prefix', required 'first_turn_prefix' Universal failure mode: every agentic-coding sweep point on every SKU tripped this before vLLM even started. Drop the explicit flag and let the scenario plugin auto-inject the correct target. Signed-off-by: Cam Quilici <cjquilici@gmail.com> * agentic build_replay_cmd: pass --tokenizer-trust-remote-code aiperf's dataset manager loads the model tokenizer for trace-prompt tokenization independent of --use-server-token-count (which only disables the inference-side parser tokenizer). Models that ship a custom tokenizer in their HF repo — kimi (amd/Kimi-K2.5-MXFP4, moonshotai/Kimi-K2.5) is the immediate case — fail to load without trust_remote_code=True: TokenizerError: Failed to load tokenizer 'amd/Kimi-K2.5-MXFP4' ValueError: The repository ... contains custom code which must be executed to correctly load the model. Please pass trust_remote_code=True to allow custom code to be run. Set the flag unconditionally — it's benign for models that don't ship custom tokenizer code, and aiperf's own error panel suggests exactly this remedy. Signed-off-by: Cam Quilici <cjquilici@gmail.com> * kimi b200 int4: bump vllm to v0.20.2 to fix flashinfer MoE INT4 bug v0.19.1 trips a bug in flashinfer_trtllm_mxint4_moe during warmup profile_run on the agentic-coding path (max_model_len=131072 + prefix caching enabled): File "vllm/model_executor/layers/quantization/utils/flashinfer_mxint4_moe.py", line 264 ).to(x.dtype) AttributeError: 'list' object has no attribute 'to' 100% of agentic jobs failed at startup across all TP/conc/offload combinations on v0.19.1. v0.20.x is reported to carry the flashinfer return-shape fix. Trying v0.20.2 before falling back to disabling VLLM_USE_FLASHINFER_MOE_INT4 (which would cost throughput). Scope: only nvidia b200 kimi int4. MI355X is on the ROCm nightly image, unaffected. Signed-off-by: Cam Quilici <cjquilici@gmail.com> * aiperf: bump to cc-traces-weka-no-subagents-051226 dataset Submodule pointer bump only. Points the agentic-coding scenario at the new 949-trace no-subagents corpus uploaded to semianalysisai/cc-traces-weka-no-subagents-051226 in place of the prior 739-trace cc-traces-weka-042026 full-subagent dataset. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * kimi-fp4-b200: wire up agentic-coding scenario + bump dataset to 949 Three threads in one commit since they all gate the FP4 B200 Kimi run: - nvidia-master.yaml: add agentic-coding scenario to kimik2.5-fp4-b200-vllm mirroring its fixed-seq-len TP layouts (tp=8 conc=4, tp=4 conc 4..64) plus offloading variants for the much-larger agentic ISLs. Bump image v0.17.0 -> v0.20.2 to match the INT4 sibling (flashinfer fix) and runner b200 -> b200-dgxc. - kimik2.5_fp4_b200.sh: hardcode TOTAL_CPU_DRAM_GB=1800 in the cpu offload branch so the workflow input default (600) is overridden to the B200 DGXC node's actual capacity. Mirrors INT4 launcher. - benchmark_lib.sh: bump --num-dataset-entries 739 -> 949 and update comments / log lines to reference the new no-subagents corpus (semianalysisai/cc-traces-weka-no-subagents-051226). The aiperf loader registration handles the actual HF repo swap; this is the ceiling cap on traces loaded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * kimi-fp4-b200: drop tp=4 from agentic-coding (OOM at 131k context) Empirical: all 7 tp=4 jobs OOM'd in v1 dispatch (run 25759838764) with 'No available memory for the cache blocks'. At tp=4 the FP4 weights take ~62 GB / GPU on B200's 180 GB, leaving ~100 GB headroom. With --max-cudagraph-capture-size=2048 and max-model-len=131072 the graph buffer reservation exhausts that headroom before any KV blocks can be allocated. Drop tp=4 from agentic-coding and align the tp=8 conc-list with the INT4 B200 sibling (which has the same physical layout and works) so the FP4/INT4 sweeps are directly comparable. The fixed-seq-len entries keep tp=4 — they run at ISL=1024 where the KV footprint is tiny. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * agentic: remove AIPERF_DATASET_WEKA_PARALLEL_WORKERS=4 cap The cap was added defensively to avoid 16 jobs * 16 workers = 256 reconstruction processes thrashing a shared HF cache on busy slurm nodes. In practice each agentic job lands on its own --exclusive allocation and owns the node, so the contention scenario doesn't materialize. Letting aiperf fall back to its default auto-pick (min(cpu_count-1, 16, num_traces)) restores ~4x parallelism on the upfront tokenize-and-reconstruct phase, which is the dominant first-fill cost on the 949-trace corpus. The mmap cache (when enabled) makes this moot from the second job onward, but speeds up the first-fill path that every fresh runner still pays. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * aiperf: bump to flock-serialized mmap-cache populates Submodule pointer bump only. Picks up the cross-process populate lock so concurrent agentic jobs sharing a cache directory (e.g. all 10 jobs in a B200 FP4 sweep pointed at /lustre/fsw/aiperf_mmap_cache) serialize their tokenize+populate cycle instead of each repeating it. After the first job populates, the other 9 wake on the lock, observe the cached entry under the lock, and read from cache. Co-Authored-…
1 parent 3e4d6dd commit 370a162

47 files changed

Lines changed: 3752 additions & 513 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/configs/amd-master.yaml

Lines changed: 70 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,10 @@ qwen3.5-fp8-mi355x-sglang:
239239
search-space:
240240
- { tp: 2, ep: 2, conc-start: 4, conc-end: 32 }
241241
- { tp: 4, ep: 1, conc-start: 32, conc-end: 256 }
242+
agentic-coding:
243+
- duration: 1800
244+
search-space:
245+
- { tp: 8, ep: 1, offloading: none, conc-list: [1, 2, 4, 8, 16, 32] }
242246

243247
qwen3.5-fp8-mi355x-sglang-mtp:
244248
image: lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260414
@@ -327,27 +331,6 @@ qwen3.5-fp4-mi355x-sglang:
327331
- { tp: 2, conc-start: 4, conc-end: 256 }
328332
- { tp: 4, conc-start: 4, conc-end: 16 }
329333

330-
qwen3.5-fp4-mi355x-atom:
331-
image: rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post
332-
model: amd/Qwen3.5-397B-A17B-MXFP4
333-
model-prefix: qwen3.5
334-
runner: mi355x
335-
precision: fp4
336-
framework: atom
337-
multinode: false
338-
scenarios:
339-
fixed-seq-len:
340-
- isl: 1024
341-
osl: 1024
342-
search-space:
343-
- { tp: 2, conc-start: 4, conc-end: 256 }
344-
- { tp: 4, conc-start: 4, conc-end: 16 }
345-
- isl: 8192
346-
osl: 1024
347-
search-space:
348-
- { tp: 2, conc-start: 4, conc-end: 256 }
349-
- { tp: 4, conc-start: 4, conc-end: 16 }
350-
351334
qwen3.5-fp8-mi300x-sglang:
352335
image: lmsysorg/sglang:v0.5.10-rocm720-mi30x
353336
model: Qwen/Qwen3.5-397B-A17B-FP8
@@ -399,13 +382,11 @@ glm5-fp8-mi355x-sglang-mtp:
399382
- isl: 1024
400383
osl: 1024
401384
search-space:
402-
- { tp: 4, conc-start: 4, conc-end: 128, spec-decoding: mtp }
403-
- { tp: 8, conc-start: 4, conc-end: 8, spec-decoding: mtp }
385+
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp }
404386
- isl: 8192
405387
osl: 1024
406388
search-space:
407-
- { tp: 4, conc-start: 4, conc-end: 128, spec-decoding: mtp }
408-
- { tp: 8, conc-start: 4, conc-end: 8, spec-decoding: mtp }
389+
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp }
409390

410391
glm5-fp8-mi355x-atom:
411392
image: rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post
@@ -420,12 +401,10 @@ glm5-fp8-mi355x-atom:
420401
- isl: 1024
421402
osl: 1024
422403
search-space:
423-
- { tp: 4, conc-start: 4, conc-end: 256 }
424404
- { tp: 8, conc-start: 4, conc-end: 256 }
425405
- isl: 8192
426406
osl: 1024
427407
search-space:
428-
- { tp: 4, conc-start: 4, conc-end: 256 }
429408
- { tp: 8, conc-start: 4, conc-end: 256 }
430409

431410
glm5.1-fp4-mi355x-sglang:
@@ -448,6 +427,11 @@ glm5.1-fp4-mi355x-sglang:
448427
search-space:
449428
- { tp: 2, conc-start: 4, conc-end: 256 }
450429
- { tp: 4, conc-start: 4, conc-end: 16 }
430+
agentic-coding:
431+
- duration: 1800
432+
search-space:
433+
# sglang manages KV eviction; mi355x glm5.1 caps at tp=4 conc=16 in fixed-seq, so cap conservatively
434+
- { tp: 4, offloading: none, conc-list: [1, 2, 4, 8, 16, 32] }
451435

452436
glm5.1-fp4-mi355x-atom:
453437
image: rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post
@@ -526,7 +510,11 @@ kimik2.5-int4-mi300x-vllm:
526510
- { tp: 8, conc-start: 4, conc-end: 64 }
527511

528512
kimik2.5-fp4-mi355x-vllm:
529-
image: vllm/vllm-openai-rocm:v0.18.0
513+
# v0.21.0 (released 2026-05-14) supersedes the prior nightly pin
514+
# (51f22dcf...) which was carrying the SimpleCPUOffloadConnector ROCm
515+
# cpu_offload_blocks > 0 fix. v0.21.0 is much newer than that fix and
516+
# includes all subsequent ROCm offload work.
517+
image: vllm/vllm-openai-rocm:v0.21.0
530518
model: amd/Kimi-K2.5-MXFP4
531519
model-prefix: kimik2.5
532520
runner: mi355x
@@ -545,6 +533,18 @@ kimik2.5-fp4-mi355x-vllm:
545533
search-space:
546534
- { tp: 8, conc-start: 4, conc-end: 64 }
547535
- { tp: 4, conc-start: 4, conc-end: 64 }
536+
# MI355X has 288 GB HBM per GPU (vs MI300X/MI325X smaller, comparable to
537+
# B300). Extend the conc sweep upward to probe where the KV cliff sits
538+
# with the larger HBM envelope. Restrict to tp=8 for this sweep to halve
539+
# job count while still covering the main parallelism config.
540+
agentic-coding:
541+
- duration: 1800
542+
search-space:
543+
- { tp: 8, offloading: none, conc-list: [1, 2, 4, 8, 16, 24, 32, 40, 48] }
544+
# CPU offload only above the KV cliff. Lower concurrencies fit
545+
# entirely on-GPU, so paying the offload-path overhead there would
546+
# just slow them down without measuring anything new.
547+
- { tp: 8, offloading: cpu, conc-list: [32, 40, 48, 56] }
548548

549549
kimik2.5-fp4-mi355x-atom:
550550
image: rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2
@@ -568,7 +568,12 @@ kimik2.5-fp4-mi355x-atom:
568568
- { tp: 4, conc-start: 4, conc-end: 128 }
569569

570570
minimaxm2.5-fp8-mi355x-vllm:
571-
image: vllm/vllm-openai-rocm:v0.19.0
571+
# Nightly carrying vllm-project/vllm@20cac26b ("[Bug fix][KV Connector]
572+
# add cpu_offload_blocks > 0 check before maybe_run_layer_kv_offload"),
573+
# which enables SimpleCPUOffloadConnector on ROCm. Required for the
574+
# cpu-offload sweep points to use the same offload path as the NVIDIA
575+
# agentic-coding configs.
576+
image: vllm/vllm-openai-rocm:nightly-51f22dcfd068fe8f1e3192da2a1e825b930223cf
572577
model: MiniMaxAI/MiniMax-M2.5
573578
model-prefix: minimaxm2.5
574579
runner: mi355x
@@ -589,6 +594,14 @@ minimaxm2.5-fp8-mi355x-vllm:
589594
- { tp: 2, ep: 2, conc-start: 2, conc-end: 256 }
590595
- { tp: 4, ep: 4, conc-start: 4, conc-end: 512 }
591596
- { tp: 8, ep: 8, conc-start: 2, conc-end: 2 }
597+
agentic-coding:
598+
# MI355X tp=4 ep=4: compute ceiling ~60 (empirical), KV cliff ~91 (analytical).
599+
# Compute saturates first; cpu offload likely won't help, but worth confirming.
600+
# AMD uses native OffloadingConnector (NOT SimpleCPUOffloadConnector).
601+
- duration: 1800
602+
search-space:
603+
- { tp: 4, ep: 4, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 48, 56, 64, 72, 96] }
604+
- { tp: 4, ep: 4, offloading: cpu, conc-list: [48, 56, 64, 72, 96] }
592605

593606
minimaxm2.5-fp8-mi355x-atom:
594607
image: rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post
@@ -611,31 +624,6 @@ minimaxm2.5-fp8-mi355x-atom:
611624
- { tp: 2, conc-start: 4, conc-end: 256 }
612625
- { tp: 4, conc-start: 4, conc-end: 256 }
613626

614-
minimaxm2.5-fp4-mi355x-atom:
615-
image: rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post
616-
model: amd/MiniMax-M2.5-MXFP4
617-
model-prefix: minimaxm2.5
618-
runner: mi355x
619-
precision: fp4
620-
framework: atom
621-
multinode: false
622-
scenarios:
623-
fixed-seq-len:
624-
- isl: 1024
625-
osl: 1024
626-
search-space:
627-
- { tp: 1, conc-start: 4, conc-end: 1024 }
628-
- { tp: 2, conc-start: 4, conc-end: 1024 }
629-
- { tp: 4, conc-start: 4, conc-end: 128 }
630-
- { tp: 8, conc-start: 4, conc-end: 16 }
631-
- isl: 8192
632-
osl: 1024
633-
search-space:
634-
- { tp: 1, conc-start: 4, conc-end: 1024 }
635-
- { tp: 2, conc-start: 4, conc-end: 1024 }
636-
- { tp: 4, conc-start: 4, conc-end: 128 }
637-
- { tp: 8, conc-start: 4, conc-end: 16 }
638-
639627
minimaxm2.5-fp4-mi355x-vllm:
640628
image: vllm/vllm-openai-rocm:v0.19.1
641629
model: amd/MiniMax-M2.5-MXFP4
@@ -660,7 +648,8 @@ minimaxm2.5-fp4-mi355x-vllm:
660648
- { tp: 4, conc-start: 4, conc-end: 64 }
661649

662650
minimaxm2.5-fp8-mi300x-vllm:
663-
image: vllm/vllm-openai-rocm:v0.16.0
651+
# Nightly carrying vllm-project/vllm@20cac26b — see mi355x config above.
652+
image: vllm/vllm-openai-rocm:nightly-51f22dcfd068fe8f1e3192da2a1e825b930223cf
664653
model: MiniMaxAI/MiniMax-M2.5
665654
model-prefix: minimaxm2.5
666655
runner: mi300x
@@ -679,9 +668,18 @@ minimaxm2.5-fp8-mi300x-vllm:
679668
search-space:
680669
- { tp: 2, conc-start: 4, conc-end: 64 }
681670
- { tp: 4, conc-start: 4, conc-end: 64 }
671+
agentic-coding:
672+
# MI300X tp=4: compute ceiling ~25 (estimated, between H100 and H200);
673+
# KV cliff ~52. Compute saturates first.
674+
# AMD uses native OffloadingConnector (NOT SimpleCPUOffloadConnector).
675+
- duration: 1800
676+
search-space:
677+
- { tp: 4, offloading: none, conc-list: [1, 2, 4, 8, 16, 20, 24, 28, 32, 40, 48] }
678+
- { tp: 4, offloading: cpu, conc-list: [16, 20, 24, 28, 32] }
682679

683680
minimaxm2.5-fp8-mi325x-vllm:
684-
image: vllm/vllm-openai-rocm:v0.18.0
681+
# Nightly carrying vllm-project/vllm@20cac26b — see mi355x config above.
682+
image: vllm/vllm-openai-rocm:nightly-51f22dcfd068fe8f1e3192da2a1e825b930223cf
685683
model: MiniMaxAI/MiniMax-M2.5
686684
model-prefix: minimaxm2.5
687685
runner: mi325x
@@ -700,6 +698,15 @@ minimaxm2.5-fp8-mi325x-vllm:
700698
search-space:
701699
- { tp: 2, conc-start: 4, conc-end: 64 }
702700
- { tp: 8, ep: 8, conc-start: 4, conc-end: 256 }
701+
agentic-coding:
702+
# MI325X tp=4: cloned from MI300X recipe (slightly faster compute,
703+
# similar HBM profile). Compute saturates first; cpu-offload window
704+
# exercises the SimpleCPUOffloadConnector path enabled by the rocm
705+
# nightly. Mirror MI300X conc grid for cross-vendor comparability.
706+
- duration: 1800
707+
search-space:
708+
- { tp: 4, offloading: none, conc-list: [1, 2, 4, 8, 16, 20, 24, 28, 32, 40, 48] }
709+
- { tp: 4, offloading: cpu, conc-list: [16, 20, 24, 28, 32] }
703710

704711
gptoss-fp4-mi300x-vllm:
705712
image: vllm/vllm-openai-rocm:v0.17.0
@@ -1636,13 +1643,13 @@ dsv4-fp8-mi355x-vllm:
16361643
search-space:
16371644
- { tp: 8, conc-start: 1, conc-end: 1 }
16381645

1639-
# Day-0 single-sequence marker for DeepSeek-V4 on ATOM (ROCm/ATOM#650).
1640-
# PR1 of the ATOM DSv4 series still uses torch sparse-attention fallbacks
1641-
# that OOM once warmup/prefill batches multiple requests; keep CONC=1 until
1642-
# the AITER sparse-attention kernel / multi-request path lands upstream.
1643-
# --enforce-eager and ATOM_USE_TRITON_MOE=1 are required on gfx950. Image is
1644-
# the standard atom0.1.2.post MI355X base (matching qwen3.5-fp8-mi355x-atom);
1645-
# the DSv4 PR is overlaid at runtime by dsv4_fp4_mi355x_atom.sh at a pinned SHA.
1646+
# Day-0 single-sequence marker for DeepSeek-V4 on ATOM (ROCm/ATOM#650).
1647+
# PR1 of the ATOM DSv4 series — single-sequence only (kv_cache[:1,...]
1648+
# hardcode), --enforce-eager required, ATOM_USE_TRITON_MOE=1 required on
1649+
# gfx950. Image is the standard atom0.1.2.post MI355X base (matching
1650+
# qwen3.5-fp8-mi355x-atom); the DSv4 PR is overlaid at runtime by
1651+
# benchmarks/single_node/dsv4_fp4_mi355x_atom.sh at a pinned SHA. Sweep
1652+
# will expand once ATOM PR3 (multi-request) and PR4 (CUDAGraph) land.
16461653
dsv4-fp4-mi355x-atom:
16471654
image: rocm/atom-dev:nightly_202605130853
16481655
model: deepseek-ai/DeepSeek-V4-Pro

0 commit comments

Comments
 (0)