|
| 1 | +# Draft: Extensible CPU Power Meter Selection |
| 2 | + |
| 3 | +* **Status**: Draft |
| 4 | +* **Author**: Niki Manoledaki |
| 5 | +* **Created**: 2026-05-03 |
| 6 | + |
| 7 | +## Problem |
| 8 | + |
| 9 | +For context, `cmd/kepler/main.go` constructs CPU and GPU power meters in two different shapes: |
| 10 | + |
| 11 | +* `createCPUMeter` hardcodes RAPL with hwmon as a fallback. It returns `(device.CPUPowerMeter, error)`. This is strict by design: Kepler must produce a CPU meter or fail startup. |
| 12 | +* `createGPUMeters` fans out across registered vendors via `gpu.DiscoverAll`. It returns `[]gpu.GPUPowerMeter` with no error, which is a soft failure mode, also by design, as it assumes that GPU monitoring is not required for Kepler. |
| 13 | + |
| 14 | +Once we agree on these assumptions, here are the problems that follow. |
| 15 | + |
| 16 | +1. **Asymmetry.** Two patterns for the same job (turn config into hardware abstractions) makes `cmd/kepler/main.go` harder to read and audit. |
| 17 | +2. **Hidden GPU failures.** `gpu.Discover` (`internal/device/gpu/registry.go:48`) collapses three failure modes into one `nil` return: |
| 18 | + * factory error (driver missing or unsupported) |
| 19 | + * `Init()` error (driver present but broken) |
| 20 | + * empty `Devices()` (vendor registered, no hardware on this node) |
| 21 | + |
| 22 | + Only the third is a legitimate soft outcome. Real backend failures surface only as a `Warn` log line. |
| 23 | +3. **Closed CPU path.** RAPL and hwmon are hard-coded fallbacks with hwmon force-enable as the only fallback. Operators on infrastructure where neither is the right source cannot extend this model to select a different backend easily. |
| 24 | + |
| 25 | +## Goals |
| 26 | + |
| 27 | +* Extend CPU power meter backend selection to a config-driven list. |
| 28 | +* Unify more of the CPU and GPU lifecycle shape: discovery, failure, initialization. |
| 29 | +* Surface real GPU backend failures as errors instead of logging them. |
| 30 | +* Preserve default behaviour for healthy systems running default config. |
| 31 | + |
| 32 | +## Non-Goals |
| 33 | + |
| 34 | +* This proposal does not build a common interface **per device** across CPU zones and GPU devices. Their measurement shapes differ and a unifying interface would abstract hardware-specific concerns through `interface{}`, which is not ideal. The CPU and GPU attributes should remain separate. |
| 35 | +* This proposal does not replace `monitor.PowerDataProvider`, which will continue to be the consumer for exporters. This proposal unifies the step right before that, which is how to initialize the hardware backend. |
| 36 | + |
| 37 | +## Proposed Solution |
| 38 | + |
| 39 | +### Promote `device.PowerMeter` |
| 40 | + |
| 41 | +Promote the unexported `powerMeter` interface in [internal/device/power_meter.go](../../../internal/device/power_meter.go) to an external interface `PowerMeter`, like this: |
| 42 | + |
| 43 | +```go |
| 44 | +// PowerMeter reads power from a class of hardware (CPU, GPU, etc). |
| 45 | +// |
| 46 | +// Many PowerMeters can be selected. Each contributes its own readings. |
| 47 | +// Domain-specific methods live on subinterfaces that embed PowerMeter. |
| 48 | +type PowerMeter interface { |
| 49 | + service.Service // Name() |
| 50 | + service.Initializer // Init() |
| 51 | +} |
| 52 | +``` |
| 53 | + |
| 54 | +`Shutdown` is optional via `service.Shutdowner`. Some backends like NVML implement it. |
| 55 | + |
| 56 | +`device.CPUPowerMeter` and `gpu.GPUPowerMeter` embed `device.PowerMeter` and add their own methods. No change to those domain methods in this EP. |
| 57 | + |
| 58 | +### Switch-based CPU dispatch |
| 59 | + |
| 60 | +CPU gains a config field for ordering plus a switch that walks the priority list: |
| 61 | + |
| 62 | +```yaml |
| 63 | +cpu: |
| 64 | + meters: ["rapl", "hwmon"] # ordered priority |
| 65 | +``` |
| 66 | +
|
| 67 | +```go |
| 68 | +// internal/device/cpu_power_meter.go |
| 69 | + |
| 70 | +func CreateCPUMeter(logger *slog.Logger, cfg *config.Config) (CPUPowerMeter, error) { |
| 71 | + var errs []error |
| 72 | + for _, name := range cfg.Cpu.Meters { |
| 73 | + meter, err := buildCPUMeter(name, logger, cfg) |
| 74 | + if err != nil { /* aggregate, continue */ } |
| 75 | + if err := meter.Init(); err != nil { /* aggregate, continue */ } |
| 76 | + zones, _ := meter.Zones() |
| 77 | + if len(zones) == 0 { /* soft skip, continue */ } |
| 78 | + return meter, nil |
| 79 | + } |
| 80 | + return nil, errors.Join(errs...) |
| 81 | +} |
| 82 | + |
| 83 | +func buildCPUMeter(name string, logger *slog.Logger, cfg *config.Config) (CPUPowerMeter, error) { |
| 84 | + switch name { |
| 85 | + case "rapl": return NewCPUPowerMeter(...) |
| 86 | + case "hwmon": return NewHwmonPowerMeter(...) |
| 87 | + case "fake": return NewFakeCPUMeter(...) |
| 88 | + default: return nil, fmt.Errorf("unknown cpu meter %q", name) |
| 89 | + } |
| 90 | +} |
| 91 | +``` |
| 92 | + |
| 93 | +`CreateCPUMeter` owns the lifecycle (build, init, zones, error aggregation). `buildCPUMeter` is the dispatch. Adding a backend means a new case + a new constructor, which is straightforward. |
| 94 | + |
| 95 | +Default value `["rapl", "hwmon"]` preserves the prior fallback behaviour. Operators make a selection via the config. |
| 96 | + |
| 97 | +### GPU failure-mode split |
| 98 | + |
| 99 | +As part of this unification, `gpu.Discover` is refactored to split the three failure modes and return `(meters, error)`: |
| 100 | + |
| 101 | +| State | GPU result | |
| 102 | +|------------------------------|-------------------------------------------------| |
| 103 | +| Hardware works | `meters, nil` | |
| 104 | +| Hardware absent on this node | `nil, nil` | |
| 105 | +| Configured backend broken | `meters?, err` (per-vendor failures aggregated) | |
| 106 | +| Feature off | `nil, nil` | |
| 107 | + |
| 108 | +Result per scenario at startup: |
| 109 | + |
| 110 | +| Scenario | Result | |
| 111 | +|---------------------|--------------------------| |
| 112 | +| Factory error | Real failure. Aggregate. | |
| 113 | +| `Init()` error | Real failure. Aggregate. | |
| 114 | +| Empty `Devices()` | Soft skip. Not an error. | |
| 115 | +| Success | Append to result. | |
| 116 | + |
| 117 | +Error aggregation uses `errors.Join`. `gpu.Discover` returns the joined error only when the GPU feature is explicitly enabled and every registered vendor returned a real failure. "No GPU on this node" stays `(nil, nil)`. |
| 118 | + |
| 119 | +### Unified `cmd/kepler/main.go` |
| 120 | + |
| 121 | +End result after the refactor: |
| 122 | + |
| 123 | +```go |
| 124 | +cpuMeter, err := device.CreateCPUMeter(logger, cfg) |
| 125 | +if err != nil { ... } |
| 126 | + |
| 127 | +gpuMeters, err := gpu.Discover(logger, cfg) |
| 128 | +if err != nil { ... } |
| 129 | +``` |
| 130 | + |
| 131 | +Symmetric shape (`(meter(s), error)`), different failure semantics (CPU strict, GPU soft). |
| 132 | + |
| 133 | +## Detailed Design |
| 134 | + |
| 135 | +### Package layout |
| 136 | + |
| 137 | +```text |
| 138 | +internal/device/ |
| 139 | +├── power_meter.go # PowerMeter interface (promoted from private) |
| 140 | +├── cpu_power_meter.go # CPUPowerMeter, CreateCPUMeter, buildCPUMeter |
| 141 | +├── rapl_sysfs_power_meter.go # NewCPUPowerMeter (RAPL) |
| 142 | +├── hwmon_power_meter.go # NewHwmonPowerMeter |
| 143 | +├── fake_cpu_power_meter.go # NewFakeCPUMeter |
| 144 | +└── gpu/ |
| 145 | + ├── interface.go # GPUPowerMeter embeds device.PowerMeter, plus Vendor and GPUDevice types |
| 146 | + ├── registry.go # gpu.Register, gpu.Discover (refactored failure modes) |
| 147 | + └── nvidia/ # init() calls gpu.Register(NVIDIA, ...) |
| 148 | +``` |
| 149 | + |
| 150 | +No new file for CPU dispatch — `CreateCPUMeter` and `buildCPUMeter` live in `cpu_power_meter.go` next to the `CPUPowerMeter` interface they construct. |
| 151 | + |
| 152 | +### Logging |
| 153 | + |
| 154 | +Per-backend `Warn` on failure, `Info` on selection: |
| 155 | + |
| 156 | +```text |
| 157 | +INFO using rapl power meter |
| 158 | +``` |
| 159 | + |
| 160 | +```text |
| 161 | +WARN rapl not available, trying next backend error="..." |
| 162 | +INFO using hwmon power meter |
| 163 | +``` |
| 164 | + |
| 165 | +GPU keeps its existing per-vendor `Warn` plus a one-line summary: |
| 166 | + |
| 167 | +```text |
| 168 | +INFO gpu meter discovery ok=[nvidia] failed=[amd: factory: rocm-smi not found] |
| 169 | +``` |
| 170 | + |
| 171 | +## Configuration |
| 172 | + |
| 173 | +```yaml |
| 174 | +cpu: |
| 175 | + meters: ["rapl", "hwmon"] # ordered priority |
| 176 | +``` |
| 177 | +
|
| 178 | +Per-backend tuning keys (`rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fake-cpu-meter.zones`) are unchanged. Legacy selectors translate at startup to `cpu.meters` and emit a deprecation warning. See [Backward compatibility](#backward-compatibility) for the full migration. |
| 179 | + |
| 180 | +GPU config does not change in this EP. |
| 181 | + |
| 182 | +## Testing Strategy |
| 183 | + |
| 184 | +* `device.CreateCPUMeter` table-driven tests: unknown name, factory error, `Init()` error, empty zones, success, ordered priority, all-fail aggregation. |
| 185 | +* `gpu.Discover` table-driven tests: all-fail, mixed, all-empty, all-succeed; per-vendor failure modes (factory, `Init()`, empty `Devices()`). |
| 186 | + |
| 187 | +## Backward compatibility |
| 188 | + |
| 189 | +Default `cpu.meters: ["rapl", "hwmon"]` reproduces default behaviour. No breaking change for healthy systems running default config. |
| 190 | + |
| 191 | +Two legacy selectors require migration. They translate at startup to an effective `cpu.meters` value, log a deprecation warning, and stop working in a future release: |
| 192 | + |
| 193 | +* `experimental.hwmon.forceEnabled: true` (today: force hwmon, skip RAPL) → effective `cpu.meters: ["hwmon"]`. |
| 194 | +* `dev.fake-cpu-meter.enabled: true` (today: use fake meter, skip RAPL/hwmon) → effective `cpu.meters: ["fake"]`. |
| 195 | + |
| 196 | +Per-backend tuning keys are unchanged and remain valid: `rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fake-cpu-meter.zones`. |
| 197 | + |
| 198 | +Operators on broken systems will see `kepler` exit with a clearer error when every CPU backend fails (the prior path also exits, but with the hwmon error only, which can be misleading). |
| 199 | + |
| 200 | +GPU error semantics tighten. A node with `gpu.enabled=true` and a broken NVML driver will fail startup instead of silently running CPU-only. This matches the CPU contract. CPU-only is the default (`gpu.enabled=false`); the tightening only affects operators who explicitly opt in to GPU. |
| 201 | + |
| 202 | +## Implementation |
| 203 | + |
| 204 | +Single PR introduces: |
| 205 | + |
| 206 | +1. Promote `device.PowerMeter` interface (`Name`, `Init`). |
| 207 | +2. Add `cfg.Cpu.Meters` config field with default `["rapl", "hwmon"]`. |
| 208 | +3. Add `device.CreateCPUMeter` + `buildCPUMeter` switch dispatch. |
| 209 | +4. Add `Config.ApplyCpuMeterDeprecations` to translate the two legacy keys. |
| 210 | +5. Replace `cmd/kepler/main.go`'s `createCPUMeter` / `createHwmonMeter` with a call to `device.CreateCPUMeter`. |
| 211 | +6. Update sample configs and docs. |
| 212 | + |
| 213 | +GPU failure-mode split lands as a follow-up PR — separable since CPU and GPU paths are independent in `cmd/kepler/main.go`. |
| 214 | + |
| 215 | +## Risks and Mitigations |
| 216 | + |
| 217 | +### Operational risk: stricter GPU error path (follow-up PR) |
| 218 | + |
| 219 | +* **Risk**: Nodes with GPU explicitly enabled (`gpu.enabled=true`) and broken drivers fail startup where the prior code silently continued without GPU metrics. |
| 220 | +* **Mitigation**: Matches the CPU contract. Default-off (`gpu.enabled=false`) means only opted-in operators are affected; they can revert to CPU-only by removing the explicit opt-in. |
| 221 | + |
| 222 | +### Configuration risk: invalid `cpu.meters` value |
| 223 | + |
| 224 | +* **Risk**: Operator typo (`"rappl"`) is rejected by `buildCPUMeter` with `unknown cpu meter "rappl"`. |
| 225 | +* **Mitigation**: The error names the unknown backend and the operator's `cpu.meters` is preserved in the printed config at startup so the typo is visible in logs. |
| 226 | + |
| 227 | +## Alternatives Considered |
| 228 | + |
| 229 | +### Alternative 1: Registry pattern with `init()`-time registration |
| 230 | + |
| 231 | +A `RegisterCPUMeter(name, factory)` API where each backend's source file calls it from `init()`, plus a `DiscoverCPU` that walks the registry. This is what GPU does today. |
| 232 | + |
| 233 | +**Why rejected**: speculative architecture for three backends. The registry pattern's value doesn't apply. |
| 234 | + |
| 235 | +The switch dispatch achieves the same operator outcome (config-driven priority) with less complexity. If other CPU backends are required, migrating each `case` to a `RegisterCPUMeter` call is possible and the `CPUPowerMeter` interface doesn't change. |
| 236 | + |
| 237 | +### Alternative 2: Unify CPU and GPU under one per-device interface |
| 238 | + |
| 239 | +This would define a common `Device` interface that both CPU zones and GPU devices implement. |
| 240 | + |
| 241 | +This is not viable since CPU zones (logical, energy in µJ per zone) and GPU devices (physical, watts plus per-process attribution) have different shapes. A unifying interface would push hardware-specific concerns through `interface{}` or lose information. |
| 242 | + |
| 243 | +### Alternative 3: Per-backend Prometheus error counter |
| 244 | + |
| 245 | +This would expose `kepler_meter_failures_total{backend, stage}` instead of returning errors. |
| 246 | + |
| 247 | +While this solves visibility for transient failures during runtime, it doesn't address the startup-time question this proposal targets — "which backend was selected and why." Operational metrics could be added orthogonally. |
| 248 | + |
| 249 | +## Next Steps |
| 250 | + |
| 251 | +Out of scope for this EP. |
| 252 | + |
| 253 | +### GPU failure-mode split |
| 254 | + |
| 255 | +Refactor `gpu.Discover` to split factory / `Init()` / empty-devices failure modes and return `(meters, error)`. Separate PR; independent of the CPU work. |
| 256 | + |
| 257 | +### Operational metrics per backend |
| 258 | + |
| 259 | +A small set of operational metrics per backend: discovery success/failure counters, duration. Useful for operators running heterogeneous fleets. |
| 260 | + |
| 261 | +### Migrate to a registry pattern when warranted |
| 262 | + |
| 263 | +If out-of-tree CPU backends or a plugin model becomes a real requirement (BMC, MSR, ARM-vendor counters arriving from external contributors), the switch in `buildCPUMeter` migrates to `RegisterCPUMeter` calls. The interface (`CPUPowerMeter`) and config (`cpu.meters`) don't change, so the migration is mechanical. |
0 commit comments