Skip to content

Commit 72cf856

Browse files
cquil11claude
andcommitted
feat(agentic): add qwen3.5-fp8-h100-sglang-agentic recipe
New agentic-coding recipe targeting H100 (runner: h100-dgxc) running Qwen3.5-397B-A17B FP8 via SGLang v0.5.12-cu130. Mirrors the b300 SGLang agentic shape with H100-appropriate kernel flags: - attention-backend: flashinfer (sm_90; trtllm_mha is Blackwell-only). - mem-fraction-static 0.75 (vs 0.80 on B300) and chunked-prefill-size 8192 (vs 16384) to fit Qwen-397B FP8 weights + KV in H100's 80 GB HBM3 at TP=8. - conc-list capped at 16 across both arms; agentic ISLs hit ~80k-200k on the 256k corpus and Qwen at conc=32 OOM'd in the fixed_seq_len sweep at lower ISL too. Recipe wires WEKA_LOADER_OVERRIDE=semianalysis_cc_traces_weka_with_subagents_256k so the 256k-capped variant (470 traces, max in+out <= 256k) is used instead of the unfiltered 052726 corpus (which has up to ~1M-token requests the H100 max_model_len=131k server would reject). Two sweep arms: - none: --disable-radix-cache, conc-list [1, 2, 4, 8, 16] - hicache: --enable-hierarchical-cache + sized from TOTAL_CPU_DRAM_GB, conc-list [4, 8, 16] (capped where hicache stabilizes) Yaml key is qwen3.5-fp8-h100-sglang-agentic; script filename is the bare `qwen3.5_fp8_h100.sh` under benchmarks/single_node/agentic/ — the h100 launchers don't support framework-tagged script names, and this matches the precedent set by qwen3.5_fp8_b200.sh (which is the sglang-agentic recipe under the same bare name). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>
1 parent 88a1153 commit 72cf856

2 files changed

Lines changed: 153 additions & 0 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9399,6 +9399,31 @@ qwen3.5-fp8-h100-sglang:
93999399
search-space:
94009400
- { tp: 8, ep: 8, conc-start: 4, conc-end: 32 }
94019401

9402+
# Diverged from qwen3.5-fp8-h100-sglang (agentic-coding sibling). Reasons below;
9403+
# the original qwen3.5-fp8-h100-sglang entry stays byte-identical to origin/main
9404+
# so its fixed-seq-len sweep is unaffected.
9405+
# - scenarios: replaced fixed-seq-len with agentic-coding.
9406+
# - runner: 'h100' -> 'h100-dgxc' (agentic runs need the dgxc-slurm cluster).
9407+
# Image is identical to the base entry (lmsysorg/sglang:v0.5.12-cu130).
9408+
# CONC range conservative for H100's 80 GB HBM3 under the long-ISL with-
9409+
# subagents corpus. hicache arm capped at conc 16 since high-conc + hicache
9410+
# tends to flake on first runs and conc 16 covers the cliff. The bench script
9411+
# sets WEKA_LOADER_OVERRIDE to the 256k-capped corpus variant.
9412+
qwen3.5-fp8-h100-sglang-agentic:
9413+
image: lmsysorg/sglang:v0.5.12-cu130
9414+
model: Qwen/Qwen3.5-397B-A17B-FP8
9415+
model-prefix: qwen3.5
9416+
runner: h100-dgxc
9417+
precision: fp8
9418+
framework: sglang
9419+
multinode: false
9420+
scenarios:
9421+
agentic-coding:
9422+
- duration: 1800
9423+
search-space:
9424+
- { tp: 8, ep: 8, offloading: none, conc-list: [1, 2, 4, 8, 16] }
9425+
- { tp: 8, ep: 8, offloading: hicache, conc-list: [4, 8, 16] }
9426+
94029427
qwen3.5-fp8-h100-sglang-mtp:
94039428
image: lmsysorg/sglang:v0.5.12-cu130
94049429
model: Qwen/Qwen3.5-397B-A17B-FP8
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
set -x
4+
5+
# Agentic trace replay benchmark for Qwen3.5 FP8 on H100 using SGLang.
6+
#
7+
# H100 has 80 GB HBM3 (vs B300's 192 GB), so weights + KV fit tighter.
8+
# Mem-fraction-static lowered to 0.75 and chunked-prefill-size halved to
9+
# 8192 (mirrors fixed_seq_len/qwen3.5_fp8_h100.sh). Attention backend is
10+
# flashinfer (sm_90); the trtllm_mha path is Blackwell-only.
11+
#
12+
# Required env vars:
13+
# MODEL, TP, CONC, OFFLOADING, TOTAL_CPU_DRAM_GB, RESULT_DIR
14+
#
15+
# OFFLOADING values:
16+
# none - SGLang GPU KV only with radix cache disabled.
17+
# hicache - SGLang HiCache with local CPU hierarchical cache.
18+
19+
source "$(dirname "$0")/../../benchmark_lib.sh"
20+
21+
check_env_vars MODEL TP CONC OFFLOADING TOTAL_CPU_DRAM_GB RESULT_DIR DURATION EP_SIZE
22+
23+
SCHEDULER_RECV_INTERVAL=${SCHEDULER_RECV_INTERVAL:-10}
24+
if [ -z "${MAX_MODEL_LEN:-}" ] || [ "$MAX_MODEL_LEN" = "0" ]; then
25+
MAX_MODEL_LEN=131072
26+
fi
27+
28+
if [[ -n "${SLURM_JOB_ID:-}" ]]; then
29+
echo "JOB $SLURM_JOB_ID running on ${SLURMD_NODENAME:-unknown}"
30+
fi
31+
32+
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
33+
nvidia-smi
34+
35+
# ---- Resolve traces and install deps ----------------------------------------
36+
# H100 max_model_len caps at 131k (HBM-bound). The unfiltered with-subagents
37+
# corpus has requests up to ~1M proxy tokens that the server would reject.
38+
# Switch to the 256k-capped variant (470 traces, max in+out <= 256k); even
39+
# at 131k context, the rejection rate is much lower than against the
40+
# unfiltered corpus.
41+
export WEKA_LOADER_OVERRIDE=semianalysis_cc_traces_weka_with_subagents_256k
42+
43+
resolve_trace_source
44+
install_agentic_deps
45+
46+
# ---- Server config ----------------------------------------------------------
47+
SERVER_LOG="$RESULT_DIR/server.log"
48+
mkdir -p "$RESULT_DIR"
49+
50+
CACHE_ARGS=()
51+
case "$OFFLOADING" in
52+
none)
53+
CACHE_ARGS=(--disable-radix-cache)
54+
;;
55+
hicache)
56+
# HiCache extends RadixAttention, so do not pass --disable-radix-cache.
57+
# H100 nodes typically expose ~1.5-2 TB usable CPU DRAM; Qwen3.5's
58+
# hybrid GDN/Mamba path allocates two HiCache host pools per TP rank
59+
# (one KV, one Mamba). Workflow passes a generic TOTAL_CPU_DRAM_GB, so
60+
# keep the per-rank-per-pool conversion local to this script.
61+
TOTAL_CPU_DRAM_GB="${HICACHE_TOTAL_CPU_DRAM_GB:-1500}"
62+
HICACHE_HOST_POOL_COUNT="${HICACHE_HOST_POOL_COUNT:-2}"
63+
HICACHE_WRITE_POLICY="${HICACHE_WRITE_POLICY:-write_through_selective}"
64+
HICACHE_SIZE_GB="${HICACHE_SIZE_GB:-$((TOTAL_CPU_DRAM_GB / TP / HICACHE_HOST_POOL_COUNT))}"
65+
if [ "$HICACHE_SIZE_GB" -lt 1 ]; then
66+
echo "Error: computed HICACHE_SIZE_GB=$HICACHE_SIZE_GB from TOTAL_CPU_DRAM_GB=$TOTAL_CPU_DRAM_GB, TP=$TP, HICACHE_HOST_POOL_COUNT=$HICACHE_HOST_POOL_COUNT" >&2
67+
exit 1
68+
fi
69+
echo "HiCache CPU pool: ${HICACHE_SIZE_GB} GB per rank per host pool across TP=${TP}, host_pool_count=${HICACHE_HOST_POOL_COUNT}"
70+
CACHE_ARGS=(
71+
--page-size 64
72+
--enable-hierarchical-cache
73+
--hicache-size "$HICACHE_SIZE_GB"
74+
--hicache-io-backend kernel
75+
--hicache-mem-layout page_first
76+
--hicache-write-policy "$HICACHE_WRITE_POLICY"
77+
)
78+
;;
79+
*)
80+
echo "Error: unsupported OFFLOADING value '$OFFLOADING' (expected one of: none, hicache)" >&2
81+
exit 1
82+
;;
83+
esac
84+
85+
echo "Starting SGLang server..."
86+
export PYTHONNOUSERSITE=1
87+
88+
{ set +x; } 2>/dev/null
89+
SGLANG_CMD=(
90+
python3 -m sglang.launch_server
91+
--model-path="$MODEL"
92+
--host=0.0.0.0
93+
--port="$PORT"
94+
--served-model-name "Qwen/Qwen3.5-397B-A17B-FP8"
95+
--trust-remote-code
96+
--tensor-parallel-size="$TP"
97+
--data-parallel-size=1
98+
--expert-parallel-size="$EP_SIZE"
99+
--quantization fp8
100+
--kv-cache-dtype fp8_e4m3
101+
--mamba-ssm-dtype bfloat16
102+
--attention-backend flashinfer
103+
--enable-flashinfer-allreduce-fusion
104+
--cuda-graph-max-bs "$CONC"
105+
--max-running-requests "$CONC"
106+
--max-prefill-tokens 8192
107+
--chunked-prefill-size 8192
108+
--mem-fraction-static 0.75
109+
--stream-interval 50
110+
--scheduler-recv-interval "$SCHEDULER_RECV_INTERVAL"
111+
--tokenizer-worker-num 6
112+
--tokenizer-path "$MODEL"
113+
--context-length "$MAX_MODEL_LEN"
114+
--enable-metrics
115+
"${CACHE_ARGS[@]}"
116+
)
117+
printf '%q ' "${SGLANG_CMD[@]}" | tee "$RESULT_DIR/sglang_command.txt"
118+
printf '\n' | tee -a "$RESULT_DIR/sglang_command.txt"
119+
"${SGLANG_CMD[@]}" > "$SERVER_LOG" 2>&1 &
120+
SERVER_PID=$!
121+
echo "Server PID: $SERVER_PID"
122+
123+
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"
124+
125+
# ---- Run benchmark ----------------------------------------------------------
126+
build_replay_cmd "$RESULT_DIR"
127+
128+
run_agentic_replay_and_write_outputs "$RESULT_DIR"

0 commit comments

Comments
 (0)