Commit b28e464
committed
feat(gpu): add gpu.nccl.* metrics to metadata.csv
Adds 5 new GPU NCCL collective metrics emitted by the Datadog Agent
NCCL check (pkg/collector/corechecks/gpu/nccl) under the gpu.nccl.*
namespace (migrated from the legacy nccl.collective.* prefix):
gpu.nccl.collective.algo_bandwidth_gbps - GB/s algorithmic bandwidth per rank
gpu.nccl.collective.bus_bandwidth_gbps - GB/s bus bandwidth per rank
gpu.nccl.collective.exec_time_us - µs execution time per rank
gpu.nccl.collective.msg_size_bytes - bytes message size per rank
gpu.nccl.rank.seconds_since_last_event - seconds since last event (hang detection)
Inserted alphabetically between gpu.memory.temperature and gpu.nvlink.count.active.1 parent e2e1fb5 commit b28e464
1 file changed
Lines changed: 5 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
38 | 43 | | |
39 | 44 | | |
40 | 45 | | |
| |||
0 commit comments