Skip to content

Latest commit

 

History

History
380 lines (285 loc) · 29.4 KB

File metadata and controls

380 lines (285 loc) · 29.4 KB

Semantic conventions for GPU metrics

Status: Development

GPU metrics hw.gpu.*

Graphics Processing Unit (discrete).

hw.type MUST be set to "gpu".

All GPU metrics may include the below attributes:

Attributes:

Key Stability Requirement Level Value Type Description Example Values
hw.id Development Required string An identifier for the hardware component, unique within the monitored host win32battery_battery_testsysa33_1
hw.driver_version Development Recommended string Driver version for the hardware component 10.2.1-3
hw.firmware_version Development Recommended string Firmware version of the hardware component 2.0.1
hw.model Development Recommended string Descriptive model name of the hardware component PERC H740P; Intel(R) Core(TM) i7-10700K; Dell XPS 15 Battery
hw.name Development Recommended string An easily-recognizable name for the hardware component eth0
hw.parent Development Recommended string Unique identifier of the parent component (typically the hw.id attribute of the enclosure, or disk controller) dellStorage_perc_0
hw.serial_number Development Recommended string Serial number of the hardware component CNFCP0123456789
hw.vendor Development Recommended string Vendor name of the hardware component Dell; HP; Intel; AMD; LSI; Lenovo

Metric: hw.errors (GPU)

This metric is recommended.

Number of errors encountered by the GPU.

When using this metric, the following attributes MUST be set:

  • hw.type MUST be set to "gpu" to indicate that the errors are from a GPU.
  • error.type SHOULD be set to one of the following values to indicate the type of error:
    • "corrected": Errors that were detected and corrected by the GPU.
    • "uncorrected": Errors that were detected but could not be corrected by the GPU.
Name Instrument Type Unit (UCUM) Description Stability Entity Associations
hw.errors Counter {error} Number of errors encountered by the component. Development

Attributes:

Key Stability Requirement Level Value Type Description Example Values
hw.id Development Required string An identifier for the hardware component, unique within the monitored host win32battery_battery_testsysa33_1
hw.type Development Required string Type of the component [1] battery; cpu; disk_controller
error.type Stable Conditionally Required if and only if an error has occurred string The type of error encountered by the component. [2] uncorrected; zero_buffer_credit; crc; bad_sector
hw.name Development Recommended string An easily-recognizable name for the hardware component eth0
hw.parent Development Recommended string Unique identifier of the parent component (typically the hw.id attribute of the enclosure, or disk controller) dellStorage_perc_0
network.io.direction Development Recommended string Direction of network traffic for network errors. [3] receive; transmit

[1] hw.type: Describes the category of the hardware component for which hw.state is being reported. For example, hw.type=temperature along with hw.state=degraded would indicate that the temperature of the hardware component has been reported as degraded.

[2] error.type: The error.type SHOULD match the error code reported by the component, the canonical name of the error, or another low-cardinality error identifier. Instrumentations SHOULD document the list of errors they report.

[3] network.io.direction: This attribute SHOULD only be used when hw.type is set to "network" to indicate the direction of the error.


error.type has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

Value Description Stability
_OTHER A fallback error value to be used when the instrumentation doesn't define a custom value. Stable

hw.type has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

Value Description Stability
battery Battery Development
cpu CPU Development
disk_controller Disk controller Development
enclosure Enclosure Development
fan Fan Development
gpu GPU Development
logical_disk Logical disk Development
memory Memory Development
network Network Development
physical_disk Physical disk Development
power_supply Power supply Development
tape_drive Tape drive Development
temperature Temperature Development
voltage Voltage Development

network.io.direction has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

Value Description Stability
receive receive Development
transmit transmit Development

Metric: hw.gpu.io

This metric is recommended.

Name Instrument Type Unit (UCUM) Description Stability Entity Associations
hw.gpu.io Counter By Received and transmitted bytes by the GPU. Development

Attributes:

Key Stability Requirement Level Value Type Description Example Values
hw.id Development Required string An identifier for the hardware component, unique within the monitored host win32battery_battery_testsysa33_1
network.io.direction Development Required string The network IO operation direction. receive; transmit
hw.driver_version Development Recommended string Driver version for the hardware component 10.2.1-3
hw.firmware_version Development Recommended string Firmware version of the hardware component 2.0.1
hw.model Development Recommended string Descriptive model name of the hardware component PERC H740P; Intel(R) Core(TM) i7-10700K; Dell XPS 15 Battery
hw.name Development Recommended string An easily-recognizable name for the hardware component eth0
hw.parent Development Recommended string Unique identifier of the parent component (typically the hw.id attribute of the enclosure, or disk controller) dellStorage_perc_0
hw.serial_number Development Recommended string Serial number of the hardware component CNFCP0123456789
hw.vendor Development Recommended string Vendor name of the hardware component Dell; HP; Intel; AMD; LSI; Lenovo

network.io.direction has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

Value Description Stability
receive receive Development
transmit transmit Development

Metric: hw.gpu.memory.limit

This metric is recommended.

Name Instrument Type Unit (UCUM) Description Stability Entity Associations
hw.gpu.memory.limit UpDownCounter By Size of the GPU memory. Development

Attributes:

Key Stability Requirement Level Value Type Description Example Values
hw.id Development Required string An identifier for the hardware component, unique within the monitored host win32battery_battery_testsysa33_1
hw.driver_version Development Recommended string Driver version for the hardware component 10.2.1-3
hw.firmware_version Development Recommended string Firmware version of the hardware component 2.0.1
hw.model Development Recommended string Descriptive model name of the hardware component PERC H740P; Intel(R) Core(TM) i7-10700K; Dell XPS 15 Battery
hw.name Development Recommended string An easily-recognizable name for the hardware component eth0
hw.parent Development Recommended string Unique identifier of the parent component (typically the hw.id attribute of the enclosure, or disk controller) dellStorage_perc_0
hw.serial_number Development Recommended string Serial number of the hardware component CNFCP0123456789
hw.vendor Development Recommended string Vendor name of the hardware component Dell; HP; Intel; AMD; LSI; Lenovo

Metric: hw.gpu.memory.utilization

This metric is recommended.

Name Instrument Type Unit (UCUM) Description Stability Entity Associations
hw.gpu.memory.utilization Gauge 1 Fraction of GPU memory used. Development

Attributes:

Key Stability Requirement Level Value Type Description Example Values
hw.id Development Required string An identifier for the hardware component, unique within the monitored host win32battery_battery_testsysa33_1
hw.driver_version Development Recommended string Driver version for the hardware component 10.2.1-3
hw.firmware_version Development Recommended string Firmware version of the hardware component 2.0.1
hw.model Development Recommended string Descriptive model name of the hardware component PERC H740P; Intel(R) Core(TM) i7-10700K; Dell XPS 15 Battery
hw.name Development Recommended string An easily-recognizable name for the hardware component eth0
hw.parent Development Recommended string Unique identifier of the parent component (typically the hw.id attribute of the enclosure, or disk controller) dellStorage_perc_0
hw.serial_number Development Recommended string Serial number of the hardware component CNFCP0123456789
hw.vendor Development Recommended string Vendor name of the hardware component Dell; HP; Intel; AMD; LSI; Lenovo

Metric: hw.gpu.memory.usage

This metric is recommended.

Name Instrument Type Unit (UCUM) Description Stability Entity Associations
hw.gpu.memory.usage UpDownCounter By GPU memory used. Development

Attributes:

Key Stability Requirement Level Value Type Description Example Values
hw.id Development Required string An identifier for the hardware component, unique within the monitored host win32battery_battery_testsysa33_1
hw.driver_version Development Recommended string Driver version for the hardware component 10.2.1-3
hw.firmware_version Development Recommended string Firmware version of the hardware component 2.0.1
hw.model Development Recommended string Descriptive model name of the hardware component PERC H740P; Intel(R) Core(TM) i7-10700K; Dell XPS 15 Battery
hw.name Development Recommended string An easily-recognizable name for the hardware component eth0
hw.parent Development Recommended string Unique identifier of the parent component (typically the hw.id attribute of the enclosure, or disk controller) dellStorage_perc_0
hw.serial_number Development Recommended string Serial number of the hardware component CNFCP0123456789
hw.vendor Development Recommended string Vendor name of the hardware component Dell; HP; Intel; AMD; LSI; Lenovo

Metric: hw.gpu.utilization

This metric is recommended.

Name Instrument Type Unit (UCUM) Description Stability Entity Associations
hw.gpu.utilization Gauge 1 Fraction of time spent in a specific task. Development

Attributes:

Key Stability Requirement Level Value Type Description Example Values
hw.id Development Required string An identifier for the hardware component, unique within the monitored host win32battery_battery_testsysa33_1
hw.driver_version Development Recommended string Driver version for the hardware component 10.2.1-3
hw.firmware_version Development Recommended string Firmware version of the hardware component 2.0.1
hw.gpu.task Development Recommended string Type of task the GPU is performing decoder; encoder; general
hw.model Development Recommended string Descriptive model name of the hardware component PERC H740P; Intel(R) Core(TM) i7-10700K; Dell XPS 15 Battery
hw.name Development Recommended string An easily-recognizable name for the hardware component eth0
hw.parent Development Recommended string Unique identifier of the parent component (typically the hw.id attribute of the enclosure, or disk controller) dellStorage_perc_0
hw.serial_number Development Recommended string Serial number of the hardware component CNFCP0123456789
hw.vendor Development Recommended string Vendor name of the hardware component Dell; HP; Intel; AMD; LSI; Lenovo

hw.gpu.task has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

Value Description Stability
decoder Decoder Development
encoder Encoder Development
general General Development

Metric: hw.status (GPU)

This metric is recommended.

Operational status: 1 (true) or 0 (false) for each of the possible states.

When using this metric for GPU status, the following attributes MUST be set:

  • hw.type MUST be set to "gpu" to indicate that the status is for a GPU.
  • hw.state MUST be set to one of the following values to indicate the GPU state:
    • "ok": The GPU is operating normally.
    • "degraded": The GPU is operating with reduced functionality or performance.
    • "failed": The GPU has failed and is not operational.
    • "predicted_failure": The GPU is currently operational but is predicted to fail soon.
Name Instrument Type Unit (UCUM) Description Stability Entity Associations
hw.status UpDownCounter 1 Operational status: 1 (true) or 0 (false) for each of the possible states. [1] Development

[1]: hw.status is currently specified as an UpDownCounter but would ideally be represented using a StateSet as defined in OpenMetrics. This semantic convention will be updated once StateSet is specified in OpenTelemetry. This planned change is not expected to have any consequence on the way users query their timeseries backend to retrieve the values of hw.status over time.

Attributes:

Key Stability Requirement Level Value Type Description Example Values
hw.id Development Required string An identifier for the hardware component, unique within the monitored host win32battery_battery_testsysa33_1
hw.state Development Required string The current state of the component degraded; failed; needs_cleaning
hw.type Development Required string Type of the component [1] battery; cpu; disk_controller
hw.name Development Recommended string An easily-recognizable name for the hardware component eth0
hw.parent Development Recommended string Unique identifier of the parent component (typically the hw.id attribute of the enclosure, or disk controller) dellStorage_perc_0

[1] hw.type: Describes the category of the hardware component for which hw.state is being reported. For example, hw.type=temperature along with hw.state=degraded would indicate that the temperature of the hardware component has been reported as degraded.


hw.state has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

Value Description Stability
degraded Degraded Development
failed Failed Development
needs_cleaning Needs Cleaning Development
ok OK Development
predicted_failure Predicted Failure Development

hw.type has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

Value Description Stability
battery Battery Development
cpu CPU Development
disk_controller Disk controller Development
enclosure Enclosure Development
fan Fan Development
gpu GPU Development
logical_disk Logical disk Development
memory Memory Development
network Network Development
physical_disk Physical disk Development
power_supply Power supply Development
tape_drive Tape drive Development
temperature Temperature Development
voltage Voltage Development