benchmarks(agentic): disable DCGM gpu_telemetry in aiperf invocation

cquil11 · claude · cquil11 · commit fc5a792a869d · 2026-06-03T12:34:14.000-05:00
aiperf's GpuMetricTimeSeries.append_snapshot freezes the metric schema on
the first DCGM scrape; any optional field that's None on the first scrape
(xid_errors most commonly, also power_violation, encoder_utilization)
then raises KeyError when it first appears mid-run. The exception is
caught at records_manager.py:609 so the run completes, but every late
telemetry sample is dropped silently and the error count grows.

We don't consume the gpu_telemetry_export.jsonl artifact in downstream
processing (process_agentic_result.py only reads aiperf's server-metrics
output and the per-request profile export). Server-side /metrics from
vLLM/sglang flows through a separate path and is unaffected — KV cache
usage, prefix cache hit rate, throughput etc. still populate.

Until the aiperf upstream patch lands (dynamic schema extension in
telemetry_models.py), --no-gpu-telemetry sidesteps the bug entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/benchmarks/benchmark_lib.sh b/benchmarks/benchmark_lib.sh
@@ -1048,6 +1048,14 @@ build_replay_cmd() {
     # CPU on minimax-m2.5 at high concurrency. Lossless for vLLM (server
     # usage is authoritative).
     REPLAY_CMD+=" --use-server-token-count"
+    # Disable DCGM GPU telemetry collection. aiperf's GpuMetricTimeSeries
+    # freezes its metric schema on the first DCGM scrape, then KeyErrors when
+    # an optional field (xid_errors, power_violation, encoder_utilization)
+    # first appears mid-run. We don't consume the gpu_telemetry artifact in
+    # downstream processing, and the server-metrics path (Prometheus /metrics
+    # from vLLM) is unaffected by this flag and still gives us KV usage,
+    # prefix cache hit rate, etc.
+    REPLAY_CMD+=" --no-gpu-telemetry"
     # aiperf's dataset manager (separate from the inference parser) loads
     # the model's tokenizer for trace-prompt tokenization regardless of
     # --use-server-token-count. Models like kimi (amd/Kimi-K2.5-MXFP4,