Commit 370a162
Migrate agentic-coding benchmarks to aiperf v0.2 (#1391)
* chore: agentic benchmark infrastructure (v0.1)
Adds end-to-end agentic-coding benchmark infrastructure on top of the
existing fixed-seq-len harness. New components:
Trace replayer
- New utils/trace-replay submodule (kv-cache-tester @ agentx-minimized)
driving multi-turn HF-dataset traces against any OpenAI-compatible
endpoint at fixed concurrency.
- --debug-trace captures full per-request prompt/response, every
streamed chunk via chunk.model_dump(), and integer token IDs
(apply_chat_template prompt + logprobs.content completion) into
debug_trace.jsonl.
- Per-model delta-field abstraction (gpt-oss → delta.reasoning, default
→ delta.reasoning_content) so reasoning-heavy responses are counted
and appended to conversation history correctly.
- Input-token metric reads server's usage.prompt_tokens (authoritative)
rather than the local apply_chat_template estimate which breaks for
gpt-oss harmony's chat template.
- Per-user 8-token salt prefix on conversation[0] so two in-flight
users replaying the same trace_id don't accidentally share KV-cache
blocks.
- Period summary: counts up elapsed instead of down remaining; replaces
the dispatch-jitter "Wait time" with the trace's true "Inter-turn
time" sourced from RequestMetrics.delay_expected.
- 5s quiesce between warmup completion and metrics-collector start so
warmup-tail prefill doesn't bleed into period 1.
Workflow plumbing
- e2e-tests.yml: workflow_dispatch + workflow_call inputs for
debug-trace (boolean) and duration-override (string seconds), forwarded
to test-sweep-agentic and test-sweep-multi-node-agentic jobs.
- benchmark-tmpl.yml + benchmark-multinode-tmpl.yml: debug-trace input
mapped to DEBUG_TRACE env var; duration override threads through to
matrix.config.duration.
- benchmark_lib.sh: build_replay_cmd / resolve_trace_source /
install_agentic_deps / write_agentic_result_json helpers; consumes
DEBUG_TRACE → --debug-trace.
- runners/launch_*.sh: shared agentic mode dispatch + scenario routing.
- runners/launch_b200-dgxc-slurm.sh → launch_b200-dgxc.sh rename to
match the actual runner.name observed by the workflow.
Result aggregation
- utils/agentic-benchmark/{bench,analysis,scripts}: metrics collector
(vllm/sglang Prometheus parsers), pareto plotter, per-config
distribution analyzer, sweep aggregator.
- utils/process_agentic_result.py: per-job results.json builder.
- utils/matrix_logic: agentic-coding scenario plumbing in
generate_sweep_configs.py + validation.py.
Examples (one per vendor)
- benchmarks/single_node/agentic/dsr1_fp4_b200.sh — NVIDIA.
- benchmarks/single_node/agentic/dsr1_fp4_mi355x.sh — AMD.
- Matching agentic-coding sections in nvidia-master.yaml
(dsr1-fp4-b200-sglang) and amd-master.yaml (dsr1-fp4-mi355x-sglang).
All other model-specific launchers and matrix entries are deliberately
left out of this PR; downstream PRs add them on a per-model basis.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* cleanup
* agentic: rename USERS/users → CONC/conc throughout
Same value, two names — collapse to one. Workflow templates already
exposed both CONC and USERS env vars (USERS was a mirror of inputs.conc),
and the agentic matrix entries carried both `users: int` and
`conc: [users]`. Drop the duplicates and standardize on conc/CONC:
- benchmark-tmpl.yml / benchmark-multinode-tmpl.yml: drop redundant
USERS env var (CONC remains)
- e2e-tests.yml / run-sweep.yml: pass `conc: ${{ matrix.config.conc }}`
to template; build agentic conc-list as `'[${{ matrix.config.conc }}]'`
since matrix.config.conc is now a scalar
- generate_sweep_configs.py: agentic entries emit Fields.CONC.value (int)
only; loop variable renamed from `users` to `conc`; exp-name template
now uses `_conc{N}` instead of `_users{N}`
- validation.py: drop Fields.USERS; agentic Pydantic models use `conc: int`
- process_agentic_result.py: read CONC env var, emit single `"conc"` key
- collect_sweep_results.py: regex updated to match `_conc{N}_offload`
- benchmark_lib.sh / agentic launcher scripts: $USERS → $CONC
The trace-replayer's --start-users / --max-users CLI flags are upstream's
API and are left unchanged; benchmark_lib.sh just passes $CONC into them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* bump trace-replay: kimi tokenizer + reasoning support
Pick up these submodule commits (callanjfox/kv-cache-tester):
- 7b7f883 silence kimi: target the actual loaded-tokenizer module logger
- 5b87e43 silence kimi: replace static logger lookup with content filter
- 3394450 silence Kimi tokenization_kimi.py per-call encode warning
- 7ad6a9e delta-field map: add 'kimi' substring (uses delta.reasoning like gpt-oss)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic: add gptoss + kimik2.5 single-node launchers
5 new agentic-coding launcher scripts brought over from
chore/agentx-integration, with USERS → CONC normalization:
- benchmarks/single_node/agentic/gptoss_fp4_h100.sh
- benchmarks/single_node/agentic/gptoss_fp4_h200.sh
- benchmarks/single_node/agentic/gptoss_fp4_mi300x.sh
- benchmarks/single_node/agentic/gptoss_fp4_mi325x.sh
- benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic: add pareto-plot analysis tooling + extra Python deps
Brings utils/agentic-benchmark/analysis/ (plot_pareto.py — sweep
visualizer for cross-config performance comparison) and updates
requirements.txt with transformers/xlsxwriter/tqdm/datasets/tiktoken
needed by the analyzer + by trace-replay's tokenizer paths.
The bench/ directory is intentionally NOT added: bench/metrics_collector.py
duplicated utils/trace-replay/server_metrics.py and was already removed
on this branch; bench/run_metrics_collector.py depends on it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* configs: add agentic-coding sections for kimik2.5 + gptoss
Adds agentic-coding scenario blocks to the master configs for the
five models whose launchers were just brought over:
- kimik2.5-fp4-b200-vllm (image bumped to v0.19.1)
- gptoss-fp4-h100-vllm
- gptoss-fp4-h200-vllm
- gptoss-fp4-mi300x-vllm
- gptoss-fp4-mi325x-vllm
Each scenario sweeps tp 4/8 (and 1/2 on AMD/H200) at offloading=none for
low/mid concurrency and offloading=cpu for high concurrency, with a
crossover at conc=64. Other agentic-coding sections present on
chore/agentx-integration (trtllm/srt-slurm based) are left for follow-up
since several of the underlying model entries were restructured by main.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* runners: thread SCENARIO_SUBDIR through B200/B300 dispatch
The agentic-coding scenario type uses benchmarks/single_node/agentic/
launchers, gated by SCENARIO_SUBDIR='agentic/' from benchmark-tmpl.yml.
b200-cw, b200-dgxc, b200-nb, and b300-nv all built BENCH_BASE without
honoring SCENARIO_SUBDIR, so dispatch always landed in single_node/
even for agentic runs. Other runners (h100-*, h200-*, mi*) already had
this plumbing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic: add launchers + master configs for 4 model families on B200/H200
- minimaxm2.5-fp8-b200-vllm
- qwen3.5-bf16-b200-sglang
- glm5-fp8-b200-sglang
- dsv4-fp8-h200-vllm
Each launcher mirrors its fixed-seq-len sibling but: uses CONC env for
max-num-seqs / cuda-graph-max-bs, sources benchmark_lib.sh, calls the
trace replayer via build_replay_cmd, and emits the agentic result JSON.
Master config gets an agentic-coding scenario block sweeping conc 1..32
at offloading=none; b200-dsv4 entries left untouched since that runner
type isn't registered in runners.yaml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic: add mi355x launchers for minimaxm2.5/qwen3.5/glm5.1/kimik2.5
- minimaxm2.5-fp8-mi355x-vllm
- qwen3.5-fp8-mi355x-sglang
- glm5.1-fp4-mi355x-sglang
- kimik2.5-fp4-mi355x-vllm
Each mirrors its fixed-seq-len sibling with ROCm-specific tweaks
(VLLM_ROCM_USE_AITER, ROCM_QUICK_REDUCE_QUANTIZATION, etc.) and feeds
CONC into max-num-seqs / cuda-graph-max-bs. Master configs gain matching
agentic-coding scenarios sweeping conc 1..32 at offloading=none.
dsv4-fp8-mi355x is intentionally skipped since the existing fixed-seq
launcher requires a bespoke vLLM PR rebuild that adds risk to
trace-replayer testing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic: add b200 launchers for gptoss-fp4, kimik2.5-int4, minimaxm2.5-fp4
Phase-2 coverage extension across precision (int4 vs fp4 for kimi,
fp4 vs fp8 for minimax) and runner (b200 vs h100/h200 for gptoss).
- gptoss-fp4-b200-vllm
- kimik2.5-int4-b200-vllm
- minimaxm2.5-fp4-b200-vllm
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic: add qwen3.5-fp8-b200-sglang variant (bf16 image is buggy)
The bf16 image lmsysorg/sglang:nightly-dev-20260216-d3bae71e fails on
B200 with PyTorch/CuDNN compatibility errors at server start. Add an
fp8 variant using lmsysorg/sglang:v0.5.9-cu130-amd64 to provide a
working qwen3.5 trace-replayer test on NVIDIA.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: add agentic trace replayer test coverage map
Documents the launcher matrix at benchmarks/single_node/agentic/, how to
dispatch debug runs via gh workflow run, and what fields in the result
JSON to inspect for verification (num_requests_successful,
total_generation_tokens, median_ttft, median_tpot, total_tput_tps, etc.).
Notes the two known-failing configs (qwen3.5 sglang on B200 — pytorch/
pytorch#168167; dsv4-fp4-b200-sglang — runner b200-dsv4 not in
runners.yaml) so future testers don't repeat them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: add agentic trace replayer coverage test results
15 debug runs across 7 model families × NVIDIA/AMD HW. 10 PASS / 5 FAIL
(1 still in flight); failures are all image- or vLLM-parser-level, not
replayer bugs. Replayer's per-model delta-field routing + long-prefill
agentic flow verified end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: finalize agentic trace replayer test results
All 16 dispatched runs are now complete. Final tally: 10 PASS, 6 FAIL.
The 6 failures are all infrastructure or vLLM-side issues (PyTorch/CuDNN
image incompatibility, vLLM deepseek_v4 reasoning parser bug, sglang-rocm
qwen3.5 streaming, SLURM time limit) — none indicate a bug in the trace
replayer itself. All 7 active model families have at least one PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(agentic): collect_sweep_results regex matches actual offload values
The exp-name template emits offload{none|cpu|ssd} (per the matrix
generator's f"{model_code}_tp{tp}_conc{conc}_offload{offloading}"),
but the regex was looking for offload(on|off) — so every artifact
directory failed to parse, the aggregator wrote nothing to aggregated/,
and collect-agentic-results uploaded no files ("No files were found
with the provided path: aggregated/").
Verified the fix matches real artifact names from this branch's runs
(b200/h100, none/cpu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic: expand sweep configs for the 10 verified models
For the 5 vllm models (kimik2.5-fp4/int4-b200, minimaxm2.5-fp8-b200,
gptoss-fp4-b200, kimik2.5-fp4-mi355x, minimaxm2.5-fp8-mi355x): add
offloading=cpu at high concurrency (typically conc 64+) where KV cache
pressure exceeds GPU HBM. Overlap at conc=64 between none and cpu so
the crossover region is sampled by both. cpu-offload sweep tail uses
larger conc points (96, 128, 192, 256) since the only reason to enable
cpu offload is when concurrency stresses HBM.
For glm5-fp8-b200-sglang and glm5.1-fp4-mi355x-sglang (sglang launchers
without the OFFLOADING=cpu plumbing): expand the conc range on
offloading=none. sglang manages its own KV eviction via the radix
cache, so concurrency above HBM capacity is handled internally rather
than via vLLM's --kv_offloading_backend.
dsr1-fp4-{b200,mi355x}-sglang sweeps already cover conc 1..256 (b200
also has tp=4 ep=4 / tp=8 ep=8 split and tp=8 going to conc=512), so
left as-is.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* runners(b200-dgxc): SLURM-exclude gpu-10/gpu-15 (stuck CUDA + full fs)
Both nodes are currently dropping every job that lands on them:
- NCCL barrier dies during sglang Scheduler.init_model_worker with
RuntimeError: NCCL error: unhandled cuda error (stale CUDA contexts
from a previous job that didn't tear down cleanly)
- HuggingFace CAS download for moonshotai/Kimi-K2.5 fails with
RuntimeError: Data processing error: CAS service error : IO Error:
No space left on device (os error 28)
Adding --exclude=gpu-10,gpu-15 to salloc keeps SLURM from allocating to
them. Drop this once sa-shared admins clean up the nodes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic: --disable-hybrid-kv-cache-manager when OFFLOADING=cpu
vLLM's OffloadingConnector (--kv_offloading_backend native) is incompatible
with the hybrid-KV-cache-manager (HMA) for models with mixed attention
layouts. When HMA is enabled, the OffloadingConnector init fails with:
RuntimeError: Worker failed with error 'Connector OffloadingConnector
does not support HMA but HMA is enabled.
Please set --disable-hybrid-kv-cache-manager'.
This bit kimik2.5-fp4-mi355x's full sweep: every offload=cpu sub-job
failed with the above error while every offload=none sub-job passed
(see run 25117841192). Kimi-K2.5 uses hybrid attention so HMA kicks in.
MiniMax-M2.5 doesn't, which is why its prior cpu-offload sweeps passed
even with the broken flag.
Switching all 11 cpu-offload launchers to --disable-hybrid-kv-cache-manager
is correctness-safe across the board: HMA is a pure optimization, and
disabling it is required for OffloadingConnector regardless of model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic-coding: bump vllm-openai images to v0.19.1 for cpu-offload configs
KV offloading via OffloadingConnector hits multiple upstream bugs on
older vllm tags:
- v0.15.1 (gpt-oss-fp4-b200, kimi-int4-b200): flashinfer kv_cache_permute
assertion in TRTLLM-attention path
- v0.18.0-rocm (kimi-fp4-mi355x): HMA + OffloadingConnector incompat
- v0.19.0 (minimaxm2.5-fp8 b200/mi355x): not yet verified clean
Bumping to v0.19.1 (or v0.19.1-rocm) — proven-good on kimi-fp4-b200
(23/23 sweep PASS) and gptoss-fp4 h100/h200/mi300x/mi325x.
* agentic: minimax-fp8 sweep across all 6 SKUs
Add agentic-coding sections + launchers for MiniMax-M2.5 FP8 across
H100, H200, B200, B300, MI300X, MI355X (excluding MI325X). Conc ranges
sized from per-SKU GPU KV cache capacity:
KV per token (fp8, 62 layers × 8 KV heads × 128 dim × 2): ~124 KB
Per-SKU GPU cache cap with tp=4 + 0.90 mem-util:
H100 58 GB -> 0.46M tok (saturate ~conc 6)
H200 277 GB -> 2.19M tok (saturate ~conc 29)
B200 461 GB -> 3.63M tok (saturate ~conc 48)
B300 807 GB -> 6.35M tok (saturate ~conc 85)
MI300X 500 GB -> 3.93M tok (saturate ~conc 52)
MI355X 864 GB -> 6.81M tok (saturate ~conc 91)
NVIDIA configs include offload=cpu starting at the saturation point
(simple cpu offload via OffloadingConnector requires vllm ≥ 0.19.1).
AMD configs do not enable cpu offload — vllm simple offloading isn't
supported on the rocm build for these models. AMD pushes offload=none
to a higher conc to demonstrate where GPU cache saturates.
Image bumps: h100/h200/mi300x v0.18.0/v0.16.0 -> v0.19.1; b300
v0.19.0-cu130 -> v0.19.1.
* agentic minimax-fp8: drop tp=8, follow fixed-seq-len TPs
vllm v0.19.1 fp8 quantization rejects tp=8 for MiniMax-M2.5: gate/up
weight output_size 1536 / tp=8 = 192, not divisible by block_n=128.
Same constraint at vllm/model_executor/layers/quantization/fp8.py:638.
Per fixed-seq-len reference TPs:
H100 tp=4 ep=4 (tp=8 ep=8 commented out in fixed-seq-len for fp8)
H200 fixed-seq-len has only tp=8 (broken on v0.19.1 fp8); winging tp=4
B200 tp=4 (fixed-seq-len has tp=2,4; tp=2 too tight for agentic ISL)
B300 tp=4 (primary; fixed-seq-len has tp=1,2,4 with various ep)
MI300X tp=4 (fixed-seq-len has tp=2,4)
MI355X tp=4 ep=4 (fixed-seq-len has tp=2 ep=2, tp=4 ep=4, tp=8 ep=8)
Concurrency expanded across the saturation cliff for each SKU; cpu
offload range extended to 384/512 on NVIDIA where applicable.
* agentic minimax-fp8: trim conc to creep up to per-SKU compute ceiling
Per empirical compute ceilings observed in prior runs (mean in-flight reqs
mid-test on each platform):
H100 tp=4 ep=4 ceiling ~10 (KV cliff ~6 -> cpu zone 6-10)
H200 tp=4 ceiling ~35 (KV cliff ~29 -> cpu zone 29-35)
B200 tp=4 ceiling ~50 (KV cliff ~48 -> very narrow)
B300 tp=4 ceiling ~60 (KV cliff ~85 -> compute saturates first)
MI300X tp=4 ceiling ~20 (estimated)
MI355X tp=4 ep=4 ceiling ~60
Previous conc lists (1..256, even up to 512) wasted 30-min slots on
sub-jobs that just queue 200+ requests waiting on a server only running
4-50 in flight, leading to client-side 600s timeout cascades. New lists
"creep up" to 2-3x the ceiling, then stop.
NVIDIA cpu offload range narrowed to the zone between KV cliff and
compute ceiling, where offloading can actually relieve KV pressure
without compute already being the bottleneck.
AMD (mi300x, mi355x) keeps offload=none only.
* agentic minimax-fp8: cliff-dense conc ladders (v4)
Per user feedback: past the compute ceiling, throughput plateaus and
extra conc just adds queue depth and client timeouts -- wasted slots.
Reallocate sampling budget to densify around the cliff(s) for each SKU.
Per-SKU strategy (compute ceiling empirical, KV cliff analytical):
H100 tp=4 ep=4 ceil 10 KV 6 -> dense 4-12 (sweet spot for cpu demo)
H200 tp=4 ceil 35 KV 29 -> dense 24-40 (narrow cpu window)
B200 tp=4 ceil 50 KV 48 -> dense 32-56 (cliffs colocated)
B300 tp=4 ceil 60 KV 85 -> dense 48-72 (compute first; cpu won't help)
MI300X tp=4 ceil 25 KV 52 -> dense 16-32 (compute first; AMD no cpu)
MI355X tp=4 ep=4 ceil 60 KV 91 -> dense 48-72 (compute first; AMD no cpu)
Dense step (every 4-8 conc) around the cliffs to resolve the inflection;
sparse step (doubling) below the cliffs for baseline; one point ~1.3-1.5x
ceiling to confirm plateau.
NVIDIA cpu offload range overlaps with none from KV cliff to ~ceiling
for direct same-conc comparison; doesn't extend past 1.3x ceiling.
* agentic minimax: AMD native cpu offload + b300-p1 runner
- AMD launchers (mi300x, mi355x) drop VLLM_USE_SIMPLE_KV_OFFLOAD env
var. SimpleCPUOffloadConnector isn't supported on rocm; native
OffloadingConnector works (still passes --kv_offloading_backend
native flag).
- Add cpu offload entries to AMD master configs (mi300x, mi355x).
- Add b300-p1 runner group (subset of b300 nodes 13-17 with the
b300-p1 label) and target it from the b300 minimax config.
* agentic: drop --no-enable-prefix-caching from all launchers
The agentic-coding benchmark IS a prefix-cache benchmark — the whole
point is measuring KV reuse across multi-turn conversations and
across users (with the per-user salt enabling deterministic prefix
overlap). Disabling prefix caching defeats the entire purpose.
Removed from 7 launchers that had it:
dsv4_fp8_h200.sh
gptoss_fp4_b200.sh (was in config.yaml)
kimik2.5_fp4_mi355x.sh
kimik2.5_int4_b200.sh
minimaxm2.5_fp4_b200.sh
minimaxm2.5_fp8_mi300x.sh
minimaxm2.5_fp8_mi355x.sh
vLLM defaults to prefix caching ON when no flag is passed.
* agentic minimax mi300x/mi355x: switch attention backend to UNIFIED_ATTN
ROCM_AITER_FA was the suspect for both:
1. Worker dies on cpu offload (gpt-oss using UNIFIED_ATTN works fine
on the same launcher pattern + image)
2. Prefix-cache Prometheus counters never increment (observability gap
on FA backend, while UNIFIED_ATTN reports correctly on mi300x)
Swap to ROCM_AITER_UNIFIED_ATTN to test both fixes in one shot.
* agentic minimax b200/b300: extend none past KV cliff for fall-off demo
The cpu range needs full overlap with none past the KV cliff so the
no-offload throughput collapse is visible at the same conc points
where cpu offload sustains throughput.
B200 tp=4 (KV cliff conc=48):
none: [1,2,4,8,16,32,48,56,64,96,128] (was capped at 64)
cpu: [48,56,64,96,128] (was capped at 64)
B300 tp=4 (KV cliff conc=85):
none: [1,2,4,8,16,32,48,64,96,128,192] (was capped at 96)
cpu: [48,64,96,128,192] (was capped at 96)
Past the cliff, the no-offload curve should collapse (recompute storm,
client-side timeouts), while cpu-offload sustains the compute ceiling.
* agentic minimax-fp8-b300: revert to standard b300 runner tag
* agentic minimax-fp8-b300: bump cpu DRAM offload to 2.2 TB (B300 has plenty)
* agentic minimax-fp8-b300: dense conc 100-124 to resolve cpu offload dropoff
* agentic minimax-fp8-b200: bump cpu DRAM offload to 1.5 TB, target b200-dgxc
- Add b200-dgxc runner pool (subset of b200 excluding b200-cw / b200-nb).
- Switch minimax-fp8-b200-vllm runner from b200 to b200-dgxc.
- Hardcode TOTAL_CPU_DRAM_GB=1500 in cpu branch of b200 launcher
(1.95x HBM total at tp=4, comfortably above the 1.5x threshold so
the offload tier doesn't hit a secondary cliff).
* fix(matrix): drop duplicate agentic-coding loop from merge
The merge with origin/main pulled in main's agentic-coding loop in
generate_test_config_sweep alongside our pre-existing one — both blocks
were byte-identical so every sub-job got emitted twice (e.g., b300
generated 60 entries instead of 30).
Drop the duplicate block, restore the function's return statement that
was lost in the dedup.
* agentic: dsv4-fp4 B200/B300 initial sweep + restore SCENARIO_SUBDIR on b300-nv
Adds agentic trace replay configs and launchers for DeepSeek-V4-Pro fp4 on
B200 and B300 via vLLM, mirroring the fixed-seq-len recipe (tp=8 ep=1, no
DP-attn) at the low-conc range. Initial conc list [1..64] for none and
[16,32,64] for cpu offload; cpu DRAM defaults to 1.5 TB on B200 and 2.2 TB
on B300 in the launcher (overrides the workflow 600 GB default).
Switches dsv4-fp4-b200-vllm runner from b200-dsv4 (not in our runners.yaml)
to b200-dgxc to match the established minimax B200 pattern.
Also restores ${SCENARIO_SUBDIR} in launch_b300-nv.sh BENCH_BASE: the
post-revert main state landed without it after the v0.1 squash merge, so
agentic dispatch on B300 was resolving to benchmarks/single_node/ instead
of benchmarks/single_node/agentic/. The b200-dgxc launcher already had
this prefix; b300-nv was the asymmetry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic dsv4-fp4: switch B200/B300 to official blog recipe layout (DP=8 EP=8)
The first attempt OOM'd at vLLM startup on every conc=64 cpu-offload job
(and would have on conc=32 cpu) because I used TP=8 EP=1 with FULL_AND_PIECEWISE
+ max-num-batched-tokens=2048 + max-cudagraph-capture-size=2048 (copied from the
fixed-seq-len recipe). At TP=8 every layer's attention output goes through an
NCCL all-reduce; cudagraph capture pre-allocated activation/all-reduce workspace
proportional to max-batched-tokens × hidden_dim × layers, consuming ~134 GiB
per rank on top of the ~134 GiB DSv4-Pro fp4 weight footprint (1.6T-total /
49B-active model, 800 GiB checkpoint). KV cache profiling then had nothing
left to allocate.
The official vLLM blog recipe for 8xB200/8xB300
(https://vllm.ai/blog/deepseek-v4) uses DP=8 + EP=8 instead — each rank does
its own attention on its own sequences (no per-layer TP all-reduce) and the
MoE all-to-all is the only collective. Smaller activation workspace at capture
time → cudagraph + KV cache both fit. Switching to that layout:
- both launchers: drop the TP/DP-attn branching, always
--data-parallel-size $TP --enable-expert-parallel; drop the
max-cudagraph-capture-size and max-num-batched-tokens overrides (recipe
doesn't set them, defaults are fine for DP-only collectives); keep
FULL_AND_PIECEWISE + custom_ops=["all"] per recipe; max-model-len pinned
at 1M (full DSv4 context — recipe suggests 800K but user wants 1M tested).
- nvidia-master.yaml: agentic-coding entries become tp=8 ep=8 dp-attn=true
for both B200 and B300; image at the config-block level switches from
v0.20.0-cu130 to deepseekv4-cu130 (the DSv4-tuned tag from the recipe).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic dsv4-fp4: keep image at v0.20.0-cu130 (deepseekv4-cu130 not pinned)
Per user direction, stay on vllm/vllm-openai:v0.20.0-cu130 instead of the
DSv4-tuned deepseekv4-cu130 tag from the blog recipe — that tag isn't
currently pinned in this pipeline. Parallelism layout (DP=8 + EP=8) is
unchanged from the prior commit since the OOM fix is what actually mattered.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic dsv4-fp4: drop cpu-offload sweep entries (HMA conflict at 1M)
cpu-offload jobs hit a clean ValueError at vLLM startup on B300:
442.99 GiB KV cache is needed [for max_model_len=1M], which is larger
than the available KV cache memory (104.74 GiB). [...] estimated
maximum model length is 236288.
The cause is in the warning right above: SimpleCPUOffloadConnector forces
--disable-hybrid-kv-cache-manager, which switches off DSv4's per-layer KV
compaction (the "drop KV outside the local sliding window" optimization
that gives DSv4 its "10% of V3.2's KV per token at 1M" claim). Without
HMA, every layer stores full per-token KV and the per-rank budget blows
up well below 1M context.
HMA is DSv4's intended long-context mechanism — leave KV management to
it and skip cpu offload until upstream supports HMA + KV connector
together. Re-introduce a cpu-offload sweep at lower max-model-len in a
follow-up if a meaningful KV cliff appears in the offload=none data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* rm diable hma connector
* agentic dsv4-fp4: enable simple-offload + HMA, restore cpu-offload sweep
Re-enables the cpu-offload path for DSv4-Pro on B200/B300 now that we
understand SimpleCPUOffloadConnector (selected via VLLM_USE_SIMPLE_KV_OFFLOAD=1)
already inherits SupportsHMA in v0.20.0 (PR #37160 by njhill, merged
2026-04-01). The earlier failure was caused by --disable-hybrid-kv-cache-manager
in OFFLOAD_ARGS, which forced HMA off and made vLLM size the KV pool for full
per-layer storage (442 GiB needed for 1M context vs 104 GiB available per rank).
Changes:
- Both launchers: drop --disable-hybrid-kv-cache-manager from cpu OFFLOAD_ARGS;
add explicit --enable-prefix-caching and --no-disable-hybrid-kv-cache-manager
to the vllm serve command (matches PR #37160's documented example).
- nvidia-master.yaml: restore the offloading=cpu search-space entries on both
dsv4-fp4-b200-vllm and dsv4-fp4-b300-vllm with conc-list [16, 32, 64], and
rewrite the comment to reflect the actual mechanism rather than the prior
(incorrect) "wait for upstream HMA + connector support" framing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* runners(b200-dgxc): switch SLURM partition gpu -> gpu-2 (cluster re-partitioned)
The b200-dgxc cluster was re-partitioned: the old "gpu" partition no longer
exists. salloc now rejects with "invalid partition specified: gpu",
breaking every B200 single-node agentic dispatch. Current sinfo:
cpu cpu-[0-2]
all* cpu-[0-2] + gpu-1-* + gpu-2-* (default, mixed)
gpu-1 gpu-1-[0-3,5-7,9] (8 idle, gpu-1-4 / gpu-1-8 drained)
gpu-2 gpu-2-[0-9] (10 idle, none drained)
Land on gpu-2 since it's a clean GPU-only pool with no drained nodes.
Drop the --exclude=gpu-10,gpu-15 list — those node names were from the
pre-repartition layout (now gpu-1-* / gpu-2-*) and no longer match
anything on the cluster.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic dsv4-fp4: pre-divide kv_offloading_size by TP; cpu-only sweep
Pre-divides TOTAL_CPU_DRAM_GB by $TP (= DP size, since the launcher passes
--data-parallel-size $TP) so each DP engine ends up with its fair share.
Without this, each of the 8 DP engines independently torch.zeros + pin_tensor
its own ~1500/2200 GB region, blowing past the SLURM memory cgroup limit
(direct dmesg evidence on gpu-2-6: 7 separate VLLM::Worker_DP processes
OOM-killed in sequence by the cgroup OOM-killer at growing anon_rss values).
Root cause is in vllm v0.20.0:
- vllm/config/parallel.py defines world_size := TPxPP, with a separate
world_size_across_dp := TPxPPxDP property
- vllm/distributed/.../simple_cpu_offload_connector.py uses parallel_config
.world_size for the divide, picking up TPxPP only
- LMCacheConnector explicitly divides by num_kv_ranks (incl DP); Simple's
path does not — see vllm/config/vllm.py
So with DP=8 EP=8 TP=1, world_size=1 inside each engine, no DP-aware
adjustment, and each DP engine commits the full --kv_offloading_size value
to physical pinned host RAM.
Also temporarily removes the offloading=none agentic-coding search-space
entries on both dsv4-fp4-{b200,b300}-vllm — we already have that data from
Friday's runs (25234821661, 25234822495). The next dispatch will be
cpu-only to validate the host-budget fix without re-running the none cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic dsv4-fp4: align parallelism with fixed-seq-len; conditional offload sizing
Mirrors the fixed-seq-len recipe's parallelism options for the agentic
sweep — pure TP for low-conc / interactivity, DEP (DP-attn + EP-MoE) for
high-conc / throughput per the vLLM blog recipe — and adapts the cpu
offload sizing logic to the connector's actual divide-by-world_size
behavior:
- DP-attn=true (DEP modes): each DP engine has parallel_config.world_size=1
(TP×PP only — see vllm/config/parallel.py docstring), so the connector's
internal divide is a no-op and each DP engine independently torch.zeros +
pin_tensor allocates the full --kv_offloading_size value. Pre-divide
TOTAL_CPU_DRAM_GB by $TP (the DP size in this layout) so 8 DP engines ×
(TOTAL/8) keeps aggregate host commit ≈ TOTAL.
- DP-attn=false (pure TP, TP+EP): single engine with world_size=TP. Pass
the full TOTAL — the connector's internal divide gives TOTAL/TP per rank
and PR #37206's TP-shared mmap keeps the aggregate at TOTAL.
Restored conditional PARALLEL_ARGS / EP_ARGS in both launchers (we had
removed them when simplifying to DEP-only). Now handles all three modes
(pure TP, TP+EP, DEP) cleanly via the matrix's tp / ep / dp-attn fields.
Sweep coverage:
- B200 (16 jobs): TP=8 + DEP=8, each with both offloading modes
- B300 (32 jobs): TP=4, TP=8, DEP=4, DEP=8, each with both offloading modes
Conc lists are agentic-scaled (smaller than fixed-seq-len): pure-TP modes
sweep [1..32], DEP modes sweep [16..128] (none) and [64..256] / [128..512]
(cpu offload, where the larger CPU pool extends the working-set ceiling).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic dsv4-fp4: enable lazy_offload to mitigate popleft_n assertion
Server logs from the prior multi-parallelism run showed the cpu-offload
failure mode is an AssertionError in vllm/v1/core/kv_cache_utils.py:269
(popleft_n: curr_block is not None) — the FreeKVCacheBlockQueue's linked
list and num_free_blocks counter get out of sync under DSv4 + 1M
max_model_len + cpu offload + sustained eviction pressure. The eager
offload path (default) does the store bookkeeping inline with each step,
which races with the scheduler's free-block accounting.
Switch from --kv_offloading_size convenience flag to explicit
--kv-transfer-config JSON so we can pass lazy_offload=true (PR #37160's
documented option) alongside cpu_bytes_to_use. Lazy mode defers the
store path and avoids the race that triggers the assertion.
Also temporarily drop the offloading=none search-space entries — they
already validated cleanly in run 25332045030 (B200 TP=8 + DEP=8 all 100%)
so this iteration focuses solely on cpu offload paths to confirm the
mitigation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic dsv4-fp4: bump image to v0.20.1, revert to eager offload
lazy_offload (PR #37160 option) was a partial fix for the popleft_n
assertion: across last run's 18 cpu jobs:
- low/mid conc cases that were 0% in eager went to 80-100%
- but high-conc DEP=8 cases regressed (256 went 992/992 -> 212/477,
new failure mode: cuMemcpyBatchAsync err=719 cudaErrorIllegalAddress
in the deferred-batch copy path of the simple connector's worker)
So eager has a scheduler/eviction race (popleft_n at low conc, OK at
very high conc), and lazy has a CUDA-async race (OK at low conc,
illegal-address at very high conc). Different bugs in different code
paths of the same connector.
v0.20.1 was published today (2026-05-04) and includes all 13 parts of
the [kv_offload+HMA][N/N] series cleanly merged. Try the upstream's
own latest release with eager (default) to see if either bug is fixed.
v0.20.1 only ships cu129 (no cu130 variant yet); cu129 supports
Blackwell and should run on B200/B300.
Revert OFFLOAD_ARGS to the --kv_offloading_size convenience flag
(eager default; lazy_offload was the only reason we needed the JSON
form).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic dsv4-fp4: revert to v0.20.0-cu130 + lazy_offload, scale max-num-seqs per-engine
v0.20.1 (cu129) iteration was strictly worse:
- Same popleft_n AssertionError still fires
- Model load 12x slower on Blackwell (588s vs 46s on v0.20.0-cu130)
- All 6 B200 cpu jobs got 0/9 trace-replay success
Revert image to v0.20.0-cu130 and re-enable lazy_offload (the best run
we had — B200 mixed 35-100%, B300 mostly 80-100%, with regressions only
at very high conc DEP=8 cases).
Add a per-engine --max-num-seqs scaling for DP-attn modes: the trace
replay tool's CONC concurrent users load-balance across DP ranks, so
each engine actually sees CONC/$TP sequences in steady state. Setting
the per-engine cap to that (instead of the global CONC) avoids the
scheduler reserving block-pool capacity for sequences that won't
materialize on this engine — which may amplify the eviction race that
hurt high-conc DEP cases in the prior lazy_offload run.
Pure TP modes are a single engine and keep --max-num-seqs = $CONC.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* add aiperf submodule (cquil11/aiperf @ ajc/inferencex-agentx-mvp)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic: migrate from kv-cache-tester to aiperf (live-assistant default)
All 26 agentic launchers now drive aiperf via the inferencex-agentx-mvp
scenario instead of trace_replay_tester.py. Live-assistant mode is on by
default (AIPERF_DATASET_WEKA_LIVE_ASSISTANT_RESPONSES=1) so the server's
just-generated KV blocks survive turn boundaries and the measured cache-
hit rate reflects what a real agentic user would experience.
Changes:
- utils/aiperf submodule pointer bumped to cjq/weka-live-assistant-responses
(29418ea6) and .gitmodules branch tracking updated.
- benchmarks/benchmark_lib.sh: build_replay_cmd, install_agentic_deps,
resolve_trace_source rewritten. The 26 single-node + 1 multinode
launcher scripts inherit the change via sourced helpers — none of them
need per-script edits. Helper signatures (REPLAY_CMD, TRACE_SOURCE_FLAG)
preserved.
- utils/process_agentic_result.py: full rewrite, consumes aiperf's
profile_export.jsonl + profile_export_aiperf.json + server_metrics_export.json.
Output JSON key schema preserved so utils/summarize.py and other
aggregators keep working without edits. Theoretical cache-hit rate and
output_tokens_expected are computed from trace metadata in the local HF
cache (independent of which mode aiperf runs in).
- utils/test_process_agentic_result.py: new fixture-driven unit test
suite (6 tests) covering schema parity with summarize.py, ms→s unit
conversion, throughput-per-GPU derivation, missing-server-metrics
graceful path, response_cache_hit_rate from cached_tokens, and the
per-run subdir layout that --num-profile-runs > 1 produces.
The legacy utils/trace-replay submodule is left on disk for fallback;
no scripts reference it anymore.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic: --ignore-installed for aiperf editable install
vLLM container has apt-managed `blinker` (and likely other distutils
packages) that pip refuses to uninstall when one of aiperf's transitive
deps tries to upgrade them, killing `pip install -e ./utils/aiperf`
mid-install. `--ignore-installed` lets pip install our newer copy fresh
into site-packages without touching the apt-managed version. Safe in
this context — we own the container, system blinker isn't load-bearing
for the benchmark, and pip's import order picks up the newer copy first.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* agentic: pass --num-dataset-entries 739 to aiperf
Default is 100 — without the flag the loader silently caps the weka
corpus to the first 100 traces of 739, limiting diversity and making
recycled-trace cache-hit math weird. The inferencex-agentx-mvp scenario
doesn't lock this setting (only locks 7 things; this isn't one).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* bump aiperf: terminate trajectory on context-overflow
Pulls in aiperf b9f44eac which makes AgenticReplayStrategy recycle a
trajectory on the first context-length error instead of continuing to
dispatch turns whose prompts will all also overflow. Matches
kv-cache-tester's "user truncated" semantics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* agentic: pass --cache-bust system_prefix to aiperf
The inferencex-agentx-mvp scenario validator requires
cache_bust.target=system_prefix but doesn't auto-default it — it just
checks that user-supplied config matches. Without the flag, scenario
validation rejects the config at startup with a value error.
The tutorial example also passes it explicitly; I dropped it earlier
thinking the scenario auto-supplied the value, which it doesn't.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* bump aiperf: fix realtime-stats AttributeError crash loop
Pulls in aiperf dc943e7e which fixes the every-30s AttributeError in
_render_realtime_block (CreditPhaseStats vs PhaseRecordsStats type
mismatch). Crash was non-fatal — benchmarks were running fine — but
flooded logs with tracebacks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* agentic: auto --unsafe-override when DURATION < 900s
The inferencex-agentx-mvp scenario enforces a 900s minimum benchmark
duration. For smoke tests / iteration / debugging at shorter durations,
auto-opt into --unsafe-override so the run starts. The result will have
submission_valid=false with reason "unsafe_override" — that's the
expected and documented behavior for non-canonical runs.
Also support AIPERF_UNSAFE_OVERRIDE=true as an explicit toggle for
durations >= 900s when the user wants to override other locked
settings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* agentic CI: upload aiperf artifacts (not kv-cache-tester paths)
The agentic-coding raw-results upload step was still listing
kv-cache-tester filenames (detailed_results.csv, metrics_*.csv, etc.)
which the new aiperf-driven pipeline doesn't produce. Replace with the
aiperf artifact set under results/trace_replay/:
- profile_export.jsonl -- per-record metrics stream
- profile_export_aiperf.json/csv -- aggregate stats with metadata
- profile_export_aiperf_timeslices -- windowed stats
- profile_export_aiperf_aggregate -- multi-run aggregate (when N>1)
- profile_export_aiperf_collated -- per-run collated payloads
- profile_export_raw.jsonl -- raw request/response bodies
- server_metrics_export.{json,csv,jsonl,parquet} -- Prometheus scrape
- gpu_telemetry_export.jsonl -- GPU telemetry stream
- inputs.json -- pre-formatted request bodies
if-no-files-found: ignore is preserved so the step is robust when a
specific output type isn't enabled in a given run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* agentic CI: drop multi-GB debug artifacts from upload
inputs.json (multi-GB pre-formatted request bodies) and
profile_export_raw.jsonl (full HTTP request/response capture) were
inflating per-run artifact size to ~7 GB on the weka corpus. Neither is
consumed by the post-processor or any downstream tool — they're offline
forensics artifacts that can be rebuilt from --public-dataset +
--random-seed when needed.
Drops upload size to ~50-100 MB / run. Post-processing schema unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* process_agentic_result: fix server_metrics_export.json schema parsing
The post-processor crashed with `AttributeError: 'str' object has no
attribute 'get'` at the end of every run because _index_server_metrics
iterated the top-level "metrics" value as if it were a list of metric
dicts, when it's actually a dict keyed by metric name.
Real aiperf v0.8 schema (per
docs/server-metrics/server-metrics-json-schema.md):
{"metrics": {<name>: {"type": ..., "series": [{"stats": {...}}]}}}
Rewrite:
- _index_server_metrics returns the metrics dict as-is
- _final_value walks series[i].stats[stats_key] (counters use "total",
gauges fall back to "max"/"avg") and aggregates across series so
multi-endpoint deployments correctly sum counters
Two new regression tests use the real schema shape (single-series and
multi-series) so future schema drift fails fast in CI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* process_agentic_result: support traces.jsonl + handle vLLM no-cached-tokens
Two fixes informed by the first successful smoke run (25404442007):
1. The published HF dataset ships a single traces.jsonl (one trace per
line), not per-trace *.json files. _hf_traces_dir was filtering for
*.json only so theoretical_cache_hit_rate stayed None even though
the corpus was downloaded. Add a JSONL path and accept either layout.
2. vLLM v0.19.1 doesn't populate cached_tokens in the OpenAI usage
response field, so usage_prompt_cache_read_tokens isn't in any
per-record metric. response_cache_hit_rate stays None — the
server-side Prometheus scrape (vllm:prefix_cache_{hits,queries})
is the actual source of truth for this benchmark, and that path
now works (89.8% measured on iter1).
Adds two unit tests covering the JSONL trace layout end-to-end (load,
walk hash_ids per turn, derive hits / output_tokens_expected).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* agentic: generate metrics_plots.png from aiperf artifacts post-run
kv-cache-tester's metrics_plots.png was a 6x2 grid showing TTFT, E2E,
ITL, ISL/OSL distributions, and server-side cache + KV usage time
series. Replicate the same visual at ~150 KB per run from aiperf's
profile_export.jsonl + server_metrics_export.json.
Panels:
1. TTFT vs request time (scatter + rolling avg)
2. E2E latency vs request time (scatter + rolling avg)
3. Inter-token latency vs request time (scatter + rolling avg)
4. ISL / OSL token-count distributions (overlaid histograms)
5. Server prefix-cache hit rate over time (timeslices when present,
else flat-line aggregate fallback)
6. vLLM KV cache usage % over time (same fallback)
Best-effort: matplotlib is loaded with the Agg backend (headless), and
write_agentic_result_json swallows non-zero exits so missing
matplotlib in stripped-down container images doesn't break the
launcher's success gate. Tested locally against iter1's artifacts —
produces a 149 KB PNG in <1s.
The PNG is added to the agentic-coding artifact-upload list so the
GH Actions run-page surfaces it directly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* bump aiperf: enrich realtime stats line (input throughput + ISL/OSL)
Pulls in aiperf 26d1e3ad which adds tput_in to the headline and an ISL/OSL
average row to the realtime stats block — closer match to what
kv-cache-tester showed at assessment periods.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* process_agentic_result: accept ``out`` alias in trace metadata
The on-disk weka trace JSON uses the keys ``in`` / ``out`` — these are
the Pydantic loader aliases for input_length / output_length. The
metadata loader was only looking for the underscored field name, so
``output_length`` ended up zeroed out for every turn and
mean_output_tokens_expected reported 0 in iter2's agg JSON.
Fall back to ``out`` when ``output_length`` is missing. Test fixture
updated to mix both spellings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* bump aiperf: realtime ITL row uses scalar inter_token_latency
Pulls in aiperf 64a8b0a8. The renderer was looking up the per-record
list metric ``inter_chunk_latency`` (which doesn't aggregate into
realtime percentiles), so the ``itl`` row was always dashes mid-run.
Switch to the scalar ``inter_token_latency`` so live ITL p50/p95/p99
populate the same way TTFT and E2E do.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* agentic: 6x2 metrics_plots.png matching kv-cache-tester layout
Rewrites the post-run plotter to produce the same 12-panel figure
kv-cache-tester emitted, panel-for-panel:
Row 0: KV Cache Utilization | Request Queue Depth
Row 1: Prefix Cache Hit Rate | Throughput (Total + Decode)
Row 2: KV Offload Transfer Rate | Cumulative Prefill Source Breakdown
Row 3: KV Offload GPU→CPU (cum.) | KV Offload CPU→GPU (cum.)
Row 4: TTFT vs Time | Latency vs Time
Row 5: Interactivity (1/TPOT) vs Time | Preemptions Over Time
Time-series data comes from aiperf's server_metrics_export.json
``timeslices`` (per-series per-window stats). To populate them we now
pass ``--slice-duration 1.0`` on every run — matches
kv-cache-tester's poll_interval=1.0 cadence so the visualizations are
directly comparable.
Per-record TTFT/Latency/Interactivity panels read from
profile_export.jsonl ``time_to_first_token`` / ``request_latency`` /
``inter_token_latency`` metrics.
Offload-related panels (rows 2-3) render empty when the connector
isn't enabled — vLLM doesn't expose vllm:kv_offload_bytes_* in that
case. Same behavior as kv-cache-tester.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* bump aiperf: realtime tput row shows total_token_throughput
Pulls in aiperf 5af870fb. Replaces the dashed ``tput_in=-`` field with
``tput_total=N/s`` (input + output tokens / wall-clock) since aiperf
doesn't expose input throughput as its own aggregate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* bump aiperf: add input_token_throughput metric
Pulls in aiperf 7e209913. The new system-level input_token_throughput
metric (= total input tokens / benchmark duration) lets the realtime
block show prefill TPS — was rendering as ``tput_in=-`` because the
aggregate metric simply didn't exist.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* bump aiperf: live server-metrics row in realtime block
Pulls in aiperf b6ebc19f which adds a "srv" row to the realtime stats
block showing live cumulative prefix-cache hit rate, KV cache usage,
and preemption counts from the /metrics scrape — same as
kv-cache-tester showed at assessment intervals.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* agentic: bump aiperf setup timeout + cap reconstruction workers
Root cause of mass cancellations on the parallel H100 sweep: 16 launchers
each spawning ~16 weka-reconstruction subprocess workers (auto-picked
based on CPU count) thrashed the shared HF cache, pushing the 300s
default ``AIPERF_DATASET_CONFIGURATION_TIMEOUT`` past its limit.
Symptom was identical TimeoutError -> AIPerfMultiError ("Failed to
perform operation 'Configure Profiling'") on most jobs ~15 min in.
Two env-var fixes in build_replay_cmd:
- AIPERF_DATASET_CONFIGURATION_TIMEOUT=900 — give setup 15 min headroom.
The 5-min smoke runs measured ~5 min for setup at low contention; at
high parallel contention we expect 8-12 min. 900s is a safe ceiling.
- AIPERF_DATASET_WEKA_PARALLEL_WORKERS=4 — cap per-job reconstruction
worker count so 16 parallel jobs only ever spawn 64 reconstruction
workers (vs the default 256). Smoke-run measurements showed
reconstruction time was bottlenecked on shared filesystem I/O, not
CPU — capping workers to 4 was within noise of the auto-picked 16.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* agentic: also bump AIPERF_SERVICE_PROFILE_CONFIGURE_TIMEOUT in lockstep
aiperf validates at startup that SERVICE_PROFILE_CONFIGURE_TIMEOUT (default
300s) must be >= DATASET_CONFIGURATION_TIMEOUT. The previous commit
bumped only DATASET to 900, so the validator rejected the config:
AIPERF_SERVICE_PROFILE_CONFIGURE_TIMEOUT: 300.0 must be greater than or
equal to AIPERF_DATASET_CONFIGURATION_TIMEOUT: 900.0
Bumps both to 900 so they're consistent. Same justification: parallel-
job dataset reconstruction needs >5 min headroom on contended runners.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* bump aiperf: input_token_throughput shows in realtime (console_group fix)
Pulls in aiperf 56e82092. The new input_token_throughput metric was
being computed but ``filter_display_metrics`` dropped it because
``console_group = MetricConsoleGroup.NONE`` was set on the class. Removed
the override so it matches OutputTokenThroughputMetric and flows through
the realtime display path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* bump aiperf: SIGUSR1 stack-dump diagnostic + NameError fix
Pulls in aiperf 95d25ccf:
- SIGUSR1 -> faulthandler.dump_traceback_all on every aiperf process
(system controller + bootstrap_and_run_service subprocesses) for
diagnosing the high-conc warmup hang from outside the process.
- Fix latent NameError in records_manager realtime_snapshot exception
handler.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* agentic: drop --apply-chat-template; expand log uploads
The chat-template tokenization path in inference_result_parser.py
serializes apply_chat_template through asyncio.to_thread on the default
thread pool. On minimax-m2.5 (large vocab + complex template) at
conc>=16 this wedges the records pipeline — only one record squeezed
through warmup before silence in the 2026-05-06 conc16 repro. Server
usage already covers ISL/OSL so the flag is lossless to drop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* bump aiperf: skip DAG spawn during WARMUP (fixes warmup hang)
aiperf ddc9711d: orchestrator's intercept() now early-returns for
WARMUP credits. Without this, the DAG-spawn path leaked
_descendant_counts (warmup never advances child turns to is_final_turn)
and wedged all_credits_returned_event indefinitely, manifesting as the
"sent=16, completed=0, in_flight=16, then silence" hang reproduced 100%
at conc=16 on H100 and b200-nb.
Also add scripts/debug_*.sh to .gitignore so local debug scripts that
may contain secrets can't be staged by accident.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agentic: use server-reported token counts for ISL/OSL
Adds --use-server-token-count to build_replay_cmd. Replaces the previous
--apply-chat-template path (which was already removed) with the cleaner
fix: bypass client-side tokenizer.encode entirely and trust the server's
usage fields. Auto-enables stream_options.include_usage on chat
endpoints. Eliminates the per-record tokenization CPU pin on
large-vocab models like minimax-m2.5 and avoids any future ISL/OSL
divergence between client tokenizer and server.
ignore_eos is already enforced upstream by the inferencex-agentx-mvp
scenario (require_ignore_eos=True auto-injects extra_inputs.ignore_eos=true).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* bump aiperf: realtime block adds p75 + per-user tput percentiles + cumulative totals
aiperf 8157cb06: latency rows now include p75; new per-user throughput
rows (tin/tout) show prefill/decode-per-user spread at p50/p75/p95/p99;
new tot row shows cumulative total_isl / total_osl since phase start.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* bump aiperf: realtime block uses raw metrics so tin row shows values
aiperf 77e4d2f2: realtime block reads raw_metrics instead of the
dashboard-filtered set, so prefill_throughput_per_user (which has
console_group=NONE) is no longer dropped before render — the tin row
now populates with p50/p75/p95/p99 instead of dashes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* amd: bump minimax mi355x/mi300x/mi325x to vllm rocm nightly with SimpleCPUOffload
Pins vllm/vllm-openai-rocm:nightly-51f22dcfd068fe8f1e3192da2a1e825b930223cf,
which includes vllm-project/vllm@20cac26b ("[Bug fix][KV Connector] add
cpu_offload_blocks > 0 check before maybe_run_layer_kv_offload"). Without
this commit, ROCm was forced onto a different KV-offload code path than
NVIDIA, so cpu-offload sweep points weren't apples-to-apples across
vendors. mi325x previously was on v0.18.0 — also bumps it forward.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* amd: use SimpleCPUOffloadConnector path for minimax mi300x/mi325x/mi355x
Sets VLLM_USE_SIMPLE_KV_OFFLOAD=1 in the cpu-offload branch of all three
AMD agentic launchers so they exercise the same code path as the NVIDIA
launchers. This requires vllm/vllm-openai-rocm:nightly-51f22dcfd0...
(vllm-project/vllm@20cac26b) which is now pinned in amd-master.yaml.
Also adds the missing minimaxm2.5_fp8_mi325x.sh that
runners/launch_mi325x-amds.sh expects (cloned from mi300x).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* amd: add agentic-coding scenario to mi325x minimax config
The MI325X minimax config had only fixed-seq-len scenarios, so the
e2e dispatch generated zero agentic configs (matrix expanded empty,
run completed in 10s with nothing exercised). Mirror the MI300X
agentic-coding search space (tp=4, conc 1..48 none + 16..32 cpu) so
cross-vendor comparison stays consistent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* bump aiperf: realtime row adds cpu_kv_usage, queue depth, sglang preempt
Picks up:
894062b6 realtime: surface cpu_kv_usage, queue depth, sglang retractions
Realtime srv row now renders cpu_kv_usage / queue / preemptions parts
conditionally (each token only when its backing /metrics gauge or
counter was actually scraped). SGLang servers get preemption coverage
via sglang:num_retracted_reqs fallback. Companion doc updates land in
docs/server-metrics/* describing the new tokens and the underlying
vllm:cpu_cache_usage_perc + vllm:external_prefix_cache_* metrics.
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* mi355x agentic: bump cpu-offload TOTAL_CPU_DRAM_GB to 2000
MI355X nodes have plenty of host DRAM; the workflow default (600 GB)
clips the offload tier well below physical capacity. Bump to 2 TB to
match the b300 path so we actually push the cpu-offload sweep into
spillover regimes instead of capping out on the connector size.
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* mi300x agentic: bump cpu-offload TOTAL_CPU_DRAM_GB to 1000
MI300X nodes have ample host DRAM; the workflow default (600 GB) clips
the offload tier below physical capacity. Bump to 1 TB to match the
mi355x (2 TB) / b300 (2.2 TB) pattern so the cpu-offload sweep can
actually push into spillover regimes.
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* bump aiperf: pick up AJC's gap-closer stack on inferencex-agentx-mvp
FF to ai-dynamo/aiperf:ajc/inferencex-agentx-mvp tip (70fecb2e). Notable
fixes layered on top of our cjq/weka-live-assistant-responses base
(894062b65):
- fix(composer): clamp max_tokens >= 1 so tool-only weka turns (out:0)
don't get rejected by the server (matches kv-cache-tester semantics).
- feat(input): --max-context-length pre-filters oversized conversations
at dataset-load time, complementing the existing mid-turn overflow
recycle.
- agentic_replay: recycle queue now spans the full dataset with an
active-trace skip set; recycled lanes draw from full diversity without
ever spawning a duplicate concurrent session for the same trace.
- fix(scenario): cache-bust target -> FIRST_TURN_PREFIX preserves the
stable system+tools KV prefix while still differentiating lanes that
land on the same trace_id; closes most of the measured-vs-theoretical
cache-hit gap.
- refactor(accumulator): hoist realtime_snapshot's inner closures to
staticmethods to stay under the ergonomics cap (behavior unchanged).
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* kimi agentic: 2 TB cpu offload, rocm nightly image for MI355X
- kimik2.5_fp4_mi355x.sh, kimik2.5_int4_b200.sh: hardcode
TOTAL_CPU_DRAM_GB=2000 in the cpu offload branch (workflow default
600 GB clipped the offload tier well below physical capacity on both
SKUs).
- amd-master.yaml kimik2.5-fp4-mi355x-vllm: bump image to
vllm/vllm-openai-rocm:nightly-51f22dcfd068fe8f1e3192da2a1e825b930223cf
so SimpleCPUOffloadConnector actually works on ROCm — matches the
minimax MI355X pattern.
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* kimi b200 vllm: pin runner to b200-dgxc
The generic `b200` label pulls whatever B200 host has capacity at
dispatch time; the agentic-coding sweep needs the DGXC node (where the
minimax B200 and dsv4 B200 configs already land) so the cpu-offload
sweep points have predictable host DRAM.
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* kimi agentic: right-size cpu offload to host DRAM envelope
- B200: 1800 GB (was 2000 GB; host has ~1.8 TiB available for offload).
- MI355X: 2700 GB (was 2000 GB; host has ~2.7 TiB available for offload).
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* agentic build_replay_cmd: drop hardcoded --cache-bust system_prefix
The aiperf inferencex-agentx-mvp scenario was updated by AJC's stack to
auto-inject locked scenario fields (commit 91651257) and to require
cache_bust.target=first_turn_prefix (commit b55d8846). The line in
build_replay_cmd was still passing --cache-bust system_prefix
explicitly, which the new validator rejects with:
Value error, Scenario invariants violated (1 conflict):
- --cache-bust: got 'system_prefix', required 'first_turn_prefix'
Universal failure mode: every agentic-coding sweep point on every SKU
tripped this before vLLM even started. Drop the explicit flag and let
the scenario plugin auto-inject the correct target.
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* agentic build_replay_cmd: pass --tokenizer-trust-remote-code
aiperf's dataset manager loads the model tokenizer for trace-prompt
tokenization independent of --use-server-token-count (which only
disables the inference-side parser tokenizer). Models that ship a
custom tokenizer in their HF repo — kimi (amd/Kimi-K2.5-MXFP4,
moonshotai/Kimi-K2.5) is the immediate case — fail to load without
trust_remote_code=True:
TokenizerError: Failed to load tokenizer 'amd/Kimi-K2.5-MXFP4'
ValueError: The repository ... contains custom code which must be
executed to correctly load the model. Please pass
trust_remote_code=True to allow custom code to be run.
Set the flag unconditionally — it's benign for models that don't ship
custom tokenizer code, and aiperf's own error panel suggests exactly
this remedy.
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* kimi b200 int4: bump vllm to v0.20.2 to fix flashinfer MoE INT4 bug
v0.19.1 trips a bug in flashinfer_trtllm_mxint4_moe during warmup
profile_run on the agentic-coding path (max_model_len=131072 + prefix
caching enabled):
File "vllm/model_executor/layers/quantization/utils/flashinfer_mxint4_moe.py", line 264
).to(x.dtype)
AttributeError: 'list' object has no attribute 'to'
100% of agentic jobs failed at startup across all TP/conc/offload
combinations on v0.19.1. v0.20.x is reported to carry the flashinfer
return-shape fix. Trying v0.20.2 before falling back to disabling
VLLM_USE_FLASHINFER_MOE_INT4 (which would cost throughput).
Scope: only nvidia b200 kimi int4. MI355X is on the ROCm nightly image,
unaffected.
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
* aiperf: bump to cc-traces-weka-no-subagents-051226 dataset
Submodule pointer bump only. Points the agentic-coding scenario at the
new 949-trace no-subagents corpus uploaded to
semianalysisai/cc-traces-weka-no-subagents-051226 in place of the prior
739-trace cc-traces-weka-042026 full-subagent dataset.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* kimi-fp4-b200: wire up agentic-coding scenario + bump dataset to 949
Three threads in one commit since they all gate the FP4 B200 Kimi run:
- nvidia-master.yaml: add agentic-coding scenario to
kimik2.5-fp4-b200-vllm mirroring its fixed-seq-len TP layouts
(tp=8 conc=4, tp=4 conc 4..64) plus offloading variants for the
much-larger agentic ISLs. Bump image v0.17.0 -> v0.20.2 to match
the INT4 sibling (flashinfer fix) and runner b200 -> b200-dgxc.
- kimik2.5_fp4_b200.sh: hardcode TOTAL_CPU_DRAM_GB=1800 in the cpu
offload branch so the workflow input default (600) is overridden
to the B200 DGXC node's actual capacity. Mirrors INT4 launcher.
- benchmark_lib.sh: bump --num-dataset-entries 739 -> 949 and update
comments / log lines to reference the new no-subagents corpus
(semianalysisai/cc-traces-weka-no-subagents-051226). The aiperf
loader registration handles the actual HF repo swap; this is the
ceiling cap on traces loaded.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* kimi-fp4-b200: drop tp=4 from agentic-coding (OOM at 131k context)
Empirical: all 7 tp=4 jobs OOM'd in v1 dispatch (run 25759838764) with
'No available memory for the cache blocks'. At tp=4 the FP4 weights
take ~62 GB / GPU on B200's 180 GB, leaving ~100 GB headroom. With
--max-cudagraph-capture-size=2048 and max-model-len=131072 the graph
buffer reservation exhausts that headroom before any KV blocks can be
allocated.
Drop tp=4 from agentic-coding and align the tp=8 conc-list with the
INT4 B200 sibling (which has the same physical layout and works) so
the FP4/INT4 sweeps are directly comparable. The fixed-seq-len entries
keep tp=4 — they run at ISL=1024 where the KV footprint is tiny.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* agentic: remove AIPERF_DATASET_WEKA_PARALLEL_WORKERS=4 cap
The cap was added defensively to avoid 16 jobs * 16 workers = 256
reconstruction processes thrashing a shared HF cache on busy slurm
nodes. In practice each agentic job lands on its own --exclusive
allocation and owns the node, so the contention scenario doesn't
materialize. Letting aiperf fall back to its default auto-pick
(min(cpu_count-1, 16, num_traces)) restores ~4x parallelism on the
upfront tokenize-and-reconstruct phase, which is the dominant
first-fill cost on the 949-trace corpus.
The mmap cache (when enabled) makes this moot from the second job
onward, but speeds up the first-fill path that every fresh runner
still pays.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* aiperf: bump to flock-serialized mmap-cache populates
Submodule pointer bump only. Picks up the cross-process populate lock
so concurrent agentic jobs sharing a cache directory (e.g. all 10 jobs
in a B200 FP4 sweep pointed at /lustre/fsw/aiperf_mmap_cache) serialize
their tokenize+populate cycle instead of each repeating it. After the
first job populates, the other 9 wake on the lock, observe the cached
entry under the lock, and read from cache.
Co-Authored-…1 parent 3e4d6dd commit 370a162
47 files changed
Lines changed: 3752 additions & 513 deletions
File tree
- .github
- configs
- workflows
- benchmarks
- single_node/agentic
- runners
- utils
- agentic-benchmark
- analysis
- scripts
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
239 | 239 | | |
240 | 240 | | |
241 | 241 | | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
242 | 246 | | |
243 | 247 | | |
244 | 248 | | |
| |||
327 | 331 | | |
328 | 332 | | |
329 | 333 | | |
330 | | - | |
331 | | - | |
332 | | - | |
333 | | - | |
334 | | - | |
335 | | - | |
336 | | - | |
337 | | - | |
338 | | - | |
339 | | - | |
340 | | - | |
341 | | - | |
342 | | - | |
343 | | - | |
344 | | - | |
345 | | - | |
346 | | - | |
347 | | - | |
348 | | - | |
349 | | - | |
350 | | - | |
351 | 334 | | |
352 | 335 | | |
353 | 336 | | |
| |||
399 | 382 | | |
400 | 383 | | |
401 | 384 | | |
402 | | - | |
403 | | - | |
| 385 | + | |
404 | 386 | | |
405 | 387 | | |
406 | 388 | | |
407 | | - | |
408 | | - | |
| 389 | + | |
409 | 390 | | |
410 | 391 | | |
411 | 392 | | |
| |||
420 | 401 | | |
421 | 402 | | |
422 | 403 | | |
423 | | - | |
424 | 404 | | |
425 | 405 | | |
426 | 406 | | |
427 | 407 | | |
428 | | - | |
429 | 408 | | |
430 | 409 | | |
431 | 410 | | |
| |||
448 | 427 | | |
449 | 428 | | |
450 | 429 | | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
451 | 435 | | |
452 | 436 | | |
453 | 437 | | |
| |||
526 | 510 | | |
527 | 511 | | |
528 | 512 | | |
529 | | - | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
530 | 518 | | |
531 | 519 | | |
532 | 520 | | |
| |||
545 | 533 | | |
546 | 534 | | |
547 | 535 | | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
548 | 548 | | |
549 | 549 | | |
550 | 550 | | |
| |||
568 | 568 | | |
569 | 569 | | |
570 | 570 | | |
571 | | - | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
572 | 577 | | |
573 | 578 | | |
574 | 579 | | |
| |||
589 | 594 | | |
590 | 595 | | |
591 | 596 | | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
592 | 605 | | |
593 | 606 | | |
594 | 607 | | |
| |||
611 | 624 | | |
612 | 625 | | |
613 | 626 | | |
614 | | - | |
615 | | - | |
616 | | - | |
617 | | - | |
618 | | - | |
619 | | - | |
620 | | - | |
621 | | - | |
622 | | - | |
623 | | - | |
624 | | - | |
625 | | - | |
626 | | - | |
627 | | - | |
628 | | - | |
629 | | - | |
630 | | - | |
631 | | - | |
632 | | - | |
633 | | - | |
634 | | - | |
635 | | - | |
636 | | - | |
637 | | - | |
638 | | - | |
639 | 627 | | |
640 | 628 | | |
641 | 629 | | |
| |||
660 | 648 | | |
661 | 649 | | |
662 | 650 | | |
663 | | - | |
| 651 | + | |
| 652 | + | |
664 | 653 | | |
665 | 654 | | |
666 | 655 | | |
| |||
679 | 668 | | |
680 | 669 | | |
681 | 670 | | |
| 671 | + | |
| 672 | + | |
| 673 | + | |
| 674 | + | |
| 675 | + | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
682 | 679 | | |
683 | 680 | | |
684 | | - | |
| 681 | + | |
| 682 | + | |
685 | 683 | | |
686 | 684 | | |
687 | 685 | | |
| |||
700 | 698 | | |
701 | 699 | | |
702 | 700 | | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
703 | 710 | | |
704 | 711 | | |
705 | 712 | | |
| |||
1636 | 1643 | | |
1637 | 1644 | | |
1638 | 1645 | | |
1639 | | - | |
1640 | | - | |
1641 | | - | |
1642 | | - | |
1643 | | - | |
1644 | | - | |
1645 | | - | |
| 1646 | + | |
| 1647 | + | |
| 1648 | + | |
| 1649 | + | |
| 1650 | + | |
| 1651 | + | |
| 1652 | + | |
1646 | 1653 | | |
1647 | 1654 | | |
1648 | 1655 | | |
| |||
0 commit comments