Skip to content

Latest commit

 

History

History
151 lines (110 loc) · 7.05 KB

File metadata and controls

151 lines (110 loc) · 7.05 KB

Workload Classification & Health States

KempnerPulse classifies each GPU into one of 12 workload categories and one of 4 health states every refresh cycle. The classification uses DCGM profiling counters and follows thresholds recommended by NVIDIA's DCGM profiling metric guidance.


Real Utilization

The dashboard computes a single Real Utilization score per GPU:

Real Util = clamp(0, 100,
              W_sm    × SM_ACTIVE
            + W_tensor × TENSOR_ACTIVE
            + W_dram   × DRAM_ACTIVE
            + W_gr     × GR_ENGINE_ACTIVE)

All four inputs are DCGM profiling-level hardware counters (0 to 1 range, displayed as 0 to 100 %). The weights are configurable via --weights or the convenience preset flags.

Weight Presets

Preset Flag W_sm W_tensor W_dram W_gr Best For
AI/ML (default) --ai-weights 0.35 0.35 0.20 0.10 Deep-learning training, LLM inference, transformers
HPC --hpc-weights 0.45 0.15 0.25 0.15 Scientific computing, mixed CUDA, simulations
Memory-bound --mem-weights 0.35 0.10 0.40 0.15 Bandwidth-heavy workloads, stencil codes

Custom weights: --weights W_SM,W_TENSOR,W_DRAM,W_GR (auto-normalized to sum to 1).

What the Components Mean

Component Metric Meaning
SM Active DCGM_FI_PROF_SM_ACTIVE Fraction of cycles with work assigned to streaming multiprocessors. Main compute signal.
Tensor Active DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Fraction of cycles tensor cores are running. Critical for mixed-precision and AI workloads.
DRAM Active DCGM_FI_PROF_DRAM_ACTIVE Fraction of cycles HBM is moving data. Practical peak ~80 %.
GR Engine Active DCGM_FI_PROF_GR_ENGINE_ACTIVE Fraction of time the graphics/compute engine is active. Falls back to GPU_UTIL when unavailable.

NVIDIA Reference Points

The classification thresholds are derived from NVIDIA documentation:

Metric Threshold NVIDIA Guidance
SM Active ≥ 80 % "Necessary, but not sufficient, for effective GPU use"
SM Active < 50 % "Likely indicates ineffective GPU usage"
DRAM Active ≥ 50 % Heavy memory traffic (practical peak ~80 %)
Tensor Active ~93 % Full saturation as measured by dcgmproftester

Workload Classification Table

Categories are evaluated in order; the first matching rule wins. This means a GPU running tensor-heavy compute will not also be labeled "compute-heavy", even if SM Active ≥ 80 %.

# Status Bottleneck Thresholds Rationale
1 idle idle Real Util < 5 %, GR < 5 %, DRAM < 5 %, no I/O Nothing is running on the GPU.
2 tensor-heavy compute compute Tensor ≥ 50 % and SM ≥ 60 % DL training or large-scale inference at peak tensor throughput.
3 tensor compute compute Tensor ≥ 15 % and SM ≥ 40 % Meaningful tensor-core activity: mixed precision, moderate load.
4 FP64 / HPC compute compute FP64 ≥ 20 % and SM ≥ 50 % Scientific double-precision workload.
5 I/O or data-loading io (Memcpy ≥ 40 % or PCIe RX/TX ≥ 1 GB/s) and SM < 30 % Heavy host ↔ device transfer; SMs mostly idle.
6 memory-bound memory DRAM ≥ 50 % and SM < 50 % Bandwidth limited. NVIDIA says SM < 50 % is likely ineffective.
7 compute-heavy compute SM ≥ 80 % SMs well utilized. NVIDIA says ≥ 80 % is necessary for effective use.
8 compute-active compute SM ≥ 50 % Moderate SM use, no tensor dominance.
9 memory-active memory DRAM ≥ 40 % Significant DRAM traffic with some SM activity.
10 busy, low SM use mixed GR ≥ 40 % and SM < 25 % Engine active but SMs underutilized. Likely overhead, sync, or small kernels.
11 low utilization mixed GR < 15 % and SM < 15 % and DRAM < 15 % Barely any measurable activity.
12 mixed / moderate mixed (fallthrough) No single dominant pattern.

Bottleneck Key

The bottleneck key is used for color-coding in the dashboard:

Key Color Meaning
idle dim GPU is not doing work.
compute green GPU is primarily compute-bound.
io cyan GPU is transfer/copy-bound.
memory magenta GPU is memory-bandwidth-bound.
mixed yellow No single dominant workload pattern.

Metric Profiles

Each workload category has a distinctive metric signature across the six axes: SM Active, Tensor Active, DRAM Active, GR Engine Active, Memcpy/IO, and FP64 Active.

Overlay shows how all 12 categories compare on a single chart:

Classification radar overlay

Individual profiles for each category:

Classification radar grid


Health States

Health is evaluated independently from workload classification. It checks error counters and temperatures against per-model thresholds.

Health Status Levels

Conditions are evaluated in order; the first matching condition wins.

Status Style Condition Action
CRIT bold red Row-remap failure > 0 or uncorrectable remapped rows > 0 GPU has hardware memory errors. Remove from production immediately.
WARN yellow PCIe replay rate > 0/s PCIe link quality issue; retransmissions occurring. Monitor closely.
HOT yellow GPU temp ≥ warning threshold or memory temp ≥ warning threshold Thermal throttling zone. Check cooling and airflow.
OK green (none of the above) Normal operating condition.

Temperature Thresholds by GPU Model

GPU Model Normal Warning Critical
A100 85 °C 93 °C 95 °C
H100 85 °C 95 °C 105 °C
H200 80 °C 95 °C 105 °C
RTX 6000 85 °C 92 °C 105 °C
Other / unknown 85 °C 93 °C 105 °C

Health Metrics

Check DCGM Metric Trigger Health Status
ECC Row Remap Failure DCGM_FI_DEV_ROW_REMAP_FAILURE > 0 CRIT
Uncorrectable Remapped Rows DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS > 0 CRIT
PCIe Replay Rate DCGM_FI_DEV_PCIE_REPLAY_COUNTER (rate) > 0/s WARN
GPU Temperature DCGM_FI_DEV_GPU_TEMP ≥ model warning threshold HOT
Memory Temperature DCGM_FI_DEV_MEMORY_TEMP ≥ model warning threshold HOT

Further Reading