Skip to content

Commit 405a86a

Browse files
cquil11claude
andcommitted
agentic: bump aiperf Configure Profiling timeout 900s -> 1800s
R3 H200 marathon hit aiperf's 900s Configure Profiling timeout on every job (14/14). Root cause: H200 /tmp is ~3x slower than B300's, and 14 parallel jobs contending for the same shared /tmp pushed dataset config (load + reconstruct + 62GB mmap + 4min inputs.json) past the 900s budget. vLLM started cleanly in every case — aiperf itself timed out during its own dataset materialization step. Bump the timeout to 1800s (30 min) so worst-case slow-/tmp + contention scenarios still fit. Post-setup measurement window unaffected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent cc922de commit 405a86a

1 file changed

Lines changed: 7 additions & 6 deletions

File tree

benchmarks/benchmark_lib.sh

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -956,14 +956,15 @@ build_replay_cmd() {
956956

957957
export AIPERF_DATASET_WEKA_LIVE_ASSISTANT_RESPONSES=1
958958
# Dataset configuration (load + reconstruct + inputs.json + mmap)
959-
# routinely takes 4-5 min for the 949-trace weka corpus. The default
960-
# 300s timeout flips parallel jobs into TimeoutError mid-setup when
961-
# many launchers contend for the shared HF cache + tmpfs. Bump to
962-
# 900s — the post-setup measurement window is unaffected.
963-
export AIPERF_DATASET_CONFIGURATION_TIMEOUT=900
959+
# routinely takes 4-5 min for the 949-trace weka corpus on fast /tmp
960+
# (B300) but can stretch to 14 min on slower /tmp + parallel contention
961+
# (observed on H200 where all 14 R3 jobs hit aiperf's 900s Configure
962+
# Profiling timeout simultaneously). Bump to 1800s to absorb 3x
963+
# worst-case slowdown — the post-setup measurement window is unaffected.
964+
export AIPERF_DATASET_CONFIGURATION_TIMEOUT=1800
964965
# aiperf validates that SERVICE_PROFILE_CONFIGURE_TIMEOUT >=
965966
# DATASET_CONFIGURATION_TIMEOUT at startup. Bump it in lockstep.
966-
export AIPERF_SERVICE_PROFILE_CONFIGURE_TIMEOUT=900
967+
export AIPERF_SERVICE_PROFILE_CONFIGURE_TIMEOUT=1800
967968
REPLAY_CMD="aiperf profile --scenario inferencex-agentx-mvp"
968969
REPLAY_CMD+=" --url http://localhost:$PORT"
969970
REPLAY_CMD+=" --endpoint /v1/chat/completions"

0 commit comments

Comments
 (0)