Skip to content

Commit b28e464

Browse files
committed
feat(gpu): add gpu.nccl.* metrics to metadata.csv
Adds 5 new GPU NCCL collective metrics emitted by the Datadog Agent NCCL check (pkg/collector/corechecks/gpu/nccl) under the gpu.nccl.* namespace (migrated from the legacy nccl.collective.* prefix): gpu.nccl.collective.algo_bandwidth_gbps - GB/s algorithmic bandwidth per rank gpu.nccl.collective.bus_bandwidth_gbps - GB/s bus bandwidth per rank gpu.nccl.collective.exec_time_us - µs execution time per rank gpu.nccl.collective.msg_size_bytes - bytes message size per rank gpu.nccl.rank.seconds_since_last_event - seconds since last event (hang detection) Inserted alphabetically between gpu.memory.temperature and gpu.nvlink.count.active.
1 parent e2e1fb5 commit b28e464

1 file changed

Lines changed: 5 additions & 0 deletions

File tree

gpu/metadata.csv

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,11 @@ gpu.memory.free,gauge,16,byte,,Unallocated device memory (in bytes).,0,gpu,memor
3535
gpu.memory.limit,gauge,16,byte,,The maximum amount of memory a process/container/device could allocate,0,gpu,memory.limit,,
3636
gpu.memory.reserved,gauge,16,byte,,Device memory (in bytes) reserved for system use (driver or firmware)..,0,gpu,memory.reserved,,
3737
gpu.memory.temperature,gauge,16,degree celsius,,Temperature of the memory chip,0,gpu,memory.temperature,,
38+
gpu.nccl.collective.algo_bandwidth_gbps,gauge,16,gigabyte,second,Algorithmic bandwidth of a collective operation per rank,0,gpu,,,
39+
gpu.nccl.collective.bus_bandwidth_gbps,gauge,16,gigabyte,second,Bus bandwidth of a collective operation per rank,0,gpu,,,
40+
gpu.nccl.collective.exec_time_us,gauge,16,microsecond,,Execution time of a collective operation per rank,0,gpu,,,
41+
gpu.nccl.collective.msg_size_bytes,gauge,16,byte,,Message size of a collective operation per rank,0,gpu,,,
42+
gpu.nccl.rank.seconds_since_last_event,gauge,16,second,,Seconds since the last NCCL event was received for a rank. Used for hang detection.,0,gpu,,,
3843
gpu.nvlink.count.active,gauge,16,,,Number of active nvlinks for the device,0,gpu,,,
3944
gpu.nvlink.count.inactive,gauge,16,,,Number of inactive nvlinks for the device,0,gpu,,,
4045
gpu.nvlink.count.total,gauge,16,,,Number of total nvlinks for the device,0,gpu,,,

0 commit comments

Comments
 (0)