Commit fc5a792
benchmarks(agentic): disable DCGM gpu_telemetry in aiperf invocation
aiperf's GpuMetricTimeSeries.append_snapshot freezes the metric schema on
the first DCGM scrape; any optional field that's None on the first scrape
(xid_errors most commonly, also power_violation, encoder_utilization)
then raises KeyError when it first appears mid-run. The exception is
caught at records_manager.py:609 so the run completes, but every late
telemetry sample is dropped silently and the error count grows.
We don't consume the gpu_telemetry_export.jsonl artifact in downstream
processing (process_agentic_result.py only reads aiperf's server-metrics
output and the per-request profile export). Server-side /metrics from
vLLM/sglang flows through a separate path and is unaffected — KV cache
usage, prefix cache hit rate, throughput etc. still populate.
Until the aiperf upstream patch lands (dynamic schema extension in
telemetry_models.py), --no-gpu-telemetry sidesteps the bug entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent d7841d8 commit fc5a792
1 file changed
Lines changed: 8 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1048 | 1048 | | |
1049 | 1049 | | |
1050 | 1050 | | |
| 1051 | + | |
| 1052 | + | |
| 1053 | + | |
| 1054 | + | |
| 1055 | + | |
| 1056 | + | |
| 1057 | + | |
| 1058 | + | |
1051 | 1059 | | |
1052 | 1060 | | |
1053 | 1061 | | |
| |||
0 commit comments