KempnerPulse classifies each GPU into one of 12 workload categories and one of 4 health states every refresh cycle. The classification uses DCGM profiling counters and follows thresholds recommended by NVIDIA's DCGM profiling metric guidance.
The dashboard computes a single Real Utilization score per GPU:
Real Util = clamp(0, 100,
W_sm × SM_ACTIVE
+ W_tensor × TENSOR_ACTIVE
+ W_dram × DRAM_ACTIVE
+ W_gr × GR_ENGINE_ACTIVE)
All four inputs are DCGM profiling-level hardware counters (0 to 1 range,
displayed as 0 to 100 %). The weights are configurable via --weights or the
convenience preset flags.
| Preset | Flag | W_sm | W_tensor | W_dram | W_gr | Best For |
|---|---|---|---|---|---|---|
| AI/ML (default) | --ai-weights |
0.35 | 0.35 | 0.20 | 0.10 | Deep-learning training, LLM inference, transformers |
| HPC | --hpc-weights |
0.45 | 0.15 | 0.25 | 0.15 | Scientific computing, mixed CUDA, simulations |
| Memory-bound | --mem-weights |
0.35 | 0.10 | 0.40 | 0.15 | Bandwidth-heavy workloads, stencil codes |
Custom weights: --weights W_SM,W_TENSOR,W_DRAM,W_GR (auto-normalized to
sum to 1).
| Component | Metric | Meaning |
|---|---|---|
| SM Active | DCGM_FI_PROF_SM_ACTIVE |
Fraction of cycles with work assigned to streaming multiprocessors. Main compute signal. |
| Tensor Active | DCGM_FI_PROF_PIPE_TENSOR_ACTIVE |
Fraction of cycles tensor cores are running. Critical for mixed-precision and AI workloads. |
| DRAM Active | DCGM_FI_PROF_DRAM_ACTIVE |
Fraction of cycles HBM is moving data. Practical peak ~80 %. |
| GR Engine Active | DCGM_FI_PROF_GR_ENGINE_ACTIVE |
Fraction of time the graphics/compute engine is active. Falls back to GPU_UTIL when unavailable. |
The classification thresholds are derived from NVIDIA documentation:
| Metric | Threshold | NVIDIA Guidance |
|---|---|---|
| SM Active | ≥ 80 % | "Necessary, but not sufficient, for effective GPU use" |
| SM Active | < 50 % | "Likely indicates ineffective GPU usage" |
| DRAM Active | ≥ 50 % | Heavy memory traffic (practical peak ~80 %) |
| Tensor Active | ~93 % | Full saturation as measured by dcgmproftester |
Categories are evaluated in order; the first matching rule wins. This means a GPU running tensor-heavy compute will not also be labeled "compute-heavy", even if SM Active ≥ 80 %.
| # | Status | Bottleneck | Thresholds | Rationale |
|---|---|---|---|---|
| 1 | idle | idle | Real Util < 5 %, GR < 5 %, DRAM < 5 %, no I/O | Nothing is running on the GPU. |
| 2 | tensor-heavy compute | compute | Tensor ≥ 50 % and SM ≥ 60 % | DL training or large-scale inference at peak tensor throughput. |
| 3 | tensor compute | compute | Tensor ≥ 15 % and SM ≥ 40 % | Meaningful tensor-core activity: mixed precision, moderate load. |
| 4 | FP64 / HPC compute | compute | FP64 ≥ 20 % and SM ≥ 50 % | Scientific double-precision workload. |
| 5 | I/O or data-loading | io | (Memcpy ≥ 40 % or PCIe RX/TX ≥ 1 GB/s) and SM < 30 % | Heavy host ↔ device transfer; SMs mostly idle. |
| 6 | memory-bound | memory | DRAM ≥ 50 % and SM < 50 % | Bandwidth limited. NVIDIA says SM < 50 % is likely ineffective. |
| 7 | compute-heavy | compute | SM ≥ 80 % | SMs well utilized. NVIDIA says ≥ 80 % is necessary for effective use. |
| 8 | compute-active | compute | SM ≥ 50 % | Moderate SM use, no tensor dominance. |
| 9 | memory-active | memory | DRAM ≥ 40 % | Significant DRAM traffic with some SM activity. |
| 10 | busy, low SM use | mixed | GR ≥ 40 % and SM < 25 % | Engine active but SMs underutilized. Likely overhead, sync, or small kernels. |
| 11 | low utilization | mixed | GR < 15 % and SM < 15 % and DRAM < 15 % | Barely any measurable activity. |
| 12 | mixed / moderate | mixed | (fallthrough) | No single dominant pattern. |
The bottleneck key is used for color-coding in the dashboard:
| Key | Color | Meaning |
|---|---|---|
idle |
dim | GPU is not doing work. |
compute |
green | GPU is primarily compute-bound. |
io |
cyan | GPU is transfer/copy-bound. |
memory |
magenta | GPU is memory-bandwidth-bound. |
mixed |
yellow | No single dominant workload pattern. |
Each workload category has a distinctive metric signature across the six axes: SM Active, Tensor Active, DRAM Active, GR Engine Active, Memcpy/IO, and FP64 Active.
Overlay shows how all 12 categories compare on a single chart:
Individual profiles for each category:
Health is evaluated independently from workload classification. It checks error counters and temperatures against per-model thresholds.
Conditions are evaluated in order; the first matching condition wins.
| Status | Style | Condition | Action |
|---|---|---|---|
| CRIT | bold red | Row-remap failure > 0 or uncorrectable remapped rows > 0 | GPU has hardware memory errors. Remove from production immediately. |
| WARN | yellow | PCIe replay rate > 0/s | PCIe link quality issue; retransmissions occurring. Monitor closely. |
| HOT | yellow | GPU temp ≥ warning threshold or memory temp ≥ warning threshold | Thermal throttling zone. Check cooling and airflow. |
| OK | green | (none of the above) | Normal operating condition. |
| GPU Model | Normal | Warning | Critical |
|---|---|---|---|
| A100 | 85 °C | 93 °C | 95 °C |
| H100 | 85 °C | 95 °C | 105 °C |
| H200 | 80 °C | 95 °C | 105 °C |
| RTX 6000 | 85 °C | 92 °C | 105 °C |
| Other / unknown | 85 °C | 93 °C | 105 °C |
| Check | DCGM Metric | Trigger | Health Status |
|---|---|---|---|
| ECC Row Remap Failure | DCGM_FI_DEV_ROW_REMAP_FAILURE |
> 0 | CRIT |
| Uncorrectable Remapped Rows | DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS |
> 0 | CRIT |
| PCIe Replay Rate | DCGM_FI_DEV_PCIE_REPLAY_COUNTER (rate) |
> 0/s | WARN |
| GPU Temperature | DCGM_FI_DEV_GPU_TEMP |
≥ model warning threshold | HOT |
| Memory Temperature | DCGM_FI_DEV_MEMORY_TEMP |
≥ model warning threshold | HOT |