|
| 1 | +# Draft: Config-Driven CPU Power Meter Selection |
| 2 | + |
| 3 | +* **Status**: Draft |
| 4 | +* **Author**: Niki Manoledaki |
| 5 | +* **Created**: 2026-05-03 |
| 6 | + |
| 7 | +## Problem |
| 8 | + |
| 9 | +For context, `cmd/kepler/main.go` constructs CPU and GPU power meters in two different shapes: |
| 10 | + |
| 11 | +* `createCPUMeter` hardcodes RAPL with hwmon as a fallback. It returns `(device.CPUPowerMeter, error)`. This is strict by design: Kepler must produce a CPU meter or fail startup. |
| 12 | +* `createGPUMeters` fans out across registered vendors via `gpu.DiscoverAll`. It returns `[]gpu.GPUPowerMeter` with no error, which is a soft failure mode, also by design, as it assumes that GPU monitoring is not required for Kepler. |
| 13 | + |
| 14 | +Once we agree on these assumptions, here are the problems that follow. |
| 15 | + |
| 16 | +1. **Asymmetry.** Two patterns for the same job (turn config into hardware abstractions) makes `cmd/kepler/main.go` harder to read and audit. |
| 17 | +2. **Hidden GPU failures.** `gpu.Discover` (`internal/device/gpu/registry.go:48`) collapses three failure modes into one `nil` return: |
| 18 | + * factory error (driver missing or unsupported) |
| 19 | + * `Init()` error (driver present but broken) |
| 20 | + * empty `Devices()` (vendor registered, no hardware on this node) |
| 21 | + |
| 22 | + Only the third is a legitimate soft outcome. Real backend failures surface only as a `Warn` log line. |
| 23 | +3. **Closed CPU path.** RAPL and hwmon are hard-coded fallbacks with hwmon force-enable as the only escape hatch. Operators on platforms where neither is the right source (ARM with vendor-specific counters, BMC-derived per-socket power, external probes via sidecar) cannot reorder backend priority without forking `createCPUMeter`. Mytton's *Model, estimate or measure?* ([link](https://www.devsustainability.com/p/model-estimate-or-measure-what-matters)) argues that operators making sustainability claims need a source-agnostic power-data path. Kepler's CPU path is opinionated about its sources. |
| 24 | + |
| 25 | +## Goals |
| 26 | + |
| 27 | +* Open CPU power-meter selection to a config-driven priority list operators can reorder. |
| 28 | +* Surface real GPU backend failures as errors instead of logging them. |
| 29 | +* Unify the CPU and GPU lifecycle shape: discovery, failure, initialization. |
| 30 | +* Preserve default behaviour for healthy systems running default config. |
| 31 | + |
| 32 | +## Non-Goals |
| 33 | + |
| 34 | +* This proposal does not build a common interface **per device** across CPU zones and GPU devices. Their measurement shapes differ and a unifying interface would abstract hardware-specific concerns through `interface{}`, which is not ideal. The CPU and GPU attributes should remain separate. |
| 35 | +* This proposal does not replace `monitor.PowerDataProvider`, which will continue to be the consumer for exporters. This proposal unifies the step right before that, which is how to construct the hardware backend. |
| 36 | +* This proposal does not introduce a registry pattern with `init()`-time backend registration. Registry-based dispatch is appropriate when there are many backends or third-party plugins; for three known in-tree backends, a switch is simpler. See [Alternatives Considered](#alternatives-considered). |
| 37 | + |
| 38 | +## Proposed Solution |
| 39 | + |
| 40 | +### Promote `device.PowerMeter` |
| 41 | + |
| 42 | +Promote the unexported `powerMeter` interface in [internal/device/power_meter.go](../../../internal/device/power_meter.go) to an external interface `PowerMeter`, like this: |
| 43 | + |
| 44 | +```go |
| 45 | +// PowerMeter is a hardware backend that reads energy or power from a class |
| 46 | +// of hardware (CPU package, GPU device, etc). |
| 47 | +// |
| 48 | +// Many PowerMeters can be selected. Each contributes its own readings. |
| 49 | +// Domain-specific methods live on subinterfaces that embed PowerMeter. |
| 50 | +type PowerMeter interface { |
| 51 | + service.Service // Name() |
| 52 | + service.Initializer // Init() |
| 53 | +} |
| 54 | +``` |
| 55 | + |
| 56 | +`Shutdown` is optional via `service.Shutdowner`; backends with resources (NVIDIA's NVML library state, future BMC HTTP clients, MSR file handles) implement it and the run framework type-asserts before calling. RAPL, hwmon, and fake hold no resources and don't need it. |
| 57 | + |
| 58 | +`device.CPUPowerMeter` and `gpu.GPUPowerMeter` embed `device.PowerMeter` and add their own methods. No change to those domain methods in this EP. |
| 59 | + |
| 60 | +### Switch-based CPU dispatch |
| 61 | + |
| 62 | +CPU gains a config key for ordering plus a switch that walks the priority list: |
| 63 | + |
| 64 | +```yaml |
| 65 | +cpu: |
| 66 | + meters: ["rapl", "hwmon"] # ordered priority |
| 67 | +``` |
| 68 | +
|
| 69 | +```go |
| 70 | +// internal/device/cpu_power_meter.go |
| 71 | + |
| 72 | +func CreateCPUMeter(logger *slog.Logger, cfg *config.Config) (CPUPowerMeter, error) { |
| 73 | + var errs []error |
| 74 | + for _, name := range cfg.Cpu.Meters { |
| 75 | + meter, err := buildCPUMeter(name, logger, cfg) |
| 76 | + if err != nil { /* aggregate, continue */ } |
| 77 | + if err := meter.Init(); err != nil { /* aggregate, continue */ } |
| 78 | + zones, _ := meter.Zones() |
| 79 | + if len(zones) == 0 { /* soft skip, continue */ } |
| 80 | + return meter, nil |
| 81 | + } |
| 82 | + return nil, errors.Join(errs...) |
| 83 | +} |
| 84 | + |
| 85 | +func buildCPUMeter(name string, logger *slog.Logger, cfg *config.Config) (CPUPowerMeter, error) { |
| 86 | + switch name { |
| 87 | + case "rapl": return NewCPUPowerMeter(...) |
| 88 | + case "hwmon": return NewHwmonPowerMeter(...) |
| 89 | + case "fake": return NewFakeCPUMeter(...) |
| 90 | + default: return nil, fmt.Errorf("unknown cpu meter %q", name) |
| 91 | + } |
| 92 | +} |
| 93 | +``` |
| 94 | + |
| 95 | +Two functions, one file. `CreateCPUMeter` owns the lifecycle (build, init, zones, error aggregation). `buildCPUMeter` is the dispatch. Adding a backend means a new case + a new constructor. |
| 96 | + |
| 97 | +Default value `["rapl", "hwmon"]` preserves the prior fallback behaviour. Operators reorder via the config; no fork needed. |
| 98 | + |
| 99 | +### GPU failure-mode split |
| 100 | + |
| 101 | +`gpu.Discover` is refactored to split the three failure modes and return `(meters, error)`: |
| 102 | + |
| 103 | +| State | GPU result | |
| 104 | +|------------------------------|-------------------------------------------------| |
| 105 | +| Hardware works | `meters, nil` | |
| 106 | +| Hardware absent on this node | `nil, nil` | |
| 107 | +| Configured backend broken | `meters?, err` (per-vendor failures aggregated) | |
| 108 | +| Feature off | `nil, nil` | |
| 109 | + |
| 110 | +Result per scenario at startup: |
| 111 | + |
| 112 | +| Scenario | Result | |
| 113 | +|---------------------|--------------------------| |
| 114 | +| Factory error | Real failure. Aggregate. | |
| 115 | +| `Init()` error | Real failure. Aggregate. | |
| 116 | +| Empty `Devices()` | Soft skip. Not an error. | |
| 117 | +| Success | Append to result. | |
| 118 | + |
| 119 | +Error aggregation uses `errors.Join`. `gpu.Discover` returns the joined error only when the GPU feature is explicitly enabled and every registered vendor returned a real failure. "No GPU on this node" stays `(nil, nil)`. |
| 120 | + |
| 121 | +GPU continues to use a vendor registry (NVIDIA registers from `init()` today). The vendor registry is justified because GPU vendor backends live in subpackages (`gpu/nvidia/`) with vendor-specific dependencies (NVML CGO bindings); a switch in `cmd/kepler/main.go` would force every vendor's heavy import into the main binary regardless of whether the operator opted in. CPU backends don't have this constraint — RAPL, hwmon, and fake all live in `internal/device/` with no vendor-specific imports. |
| 122 | + |
| 123 | +### Unified `cmd/kepler/main.go` |
| 124 | + |
| 125 | +End result after the refactor: |
| 126 | + |
| 127 | +```go |
| 128 | +cpuMeter, err := device.CreateCPUMeter(logger, cfg) |
| 129 | +if err != nil { ... } |
| 130 | + |
| 131 | +gpuMeters, err := gpu.Discover(logger, cfg) |
| 132 | +if err != nil { ... } |
| 133 | +``` |
| 134 | + |
| 135 | +Symmetric shape (`(meter(s), error)`), different failure semantics (CPU strict, GPU soft). |
| 136 | + |
| 137 | +## Detailed Design |
| 138 | + |
| 139 | +### Package layout |
| 140 | + |
| 141 | +```text |
| 142 | +internal/device/ |
| 143 | +├── power_meter.go # PowerMeter interface (promoted from private) |
| 144 | +├── cpu_power_meter.go # CPUPowerMeter, CreateCPUMeter, buildCPUMeter |
| 145 | +├── rapl_sysfs_power_meter.go # NewCPUPowerMeter (RAPL) |
| 146 | +├── hwmon_power_meter.go # NewHwmonPowerMeter |
| 147 | +├── fake_cpu_power_meter.go # NewFakeCPUMeter |
| 148 | +└── gpu/ |
| 149 | + ├── interface.go # GPUPowerMeter embeds device.PowerMeter, plus Vendor and GPUDevice types |
| 150 | + ├── registry.go # gpu.Register, gpu.Discover (refactored failure modes) |
| 151 | + └── nvidia/ # init() calls gpu.Register(NVIDIA, ...) |
| 152 | +``` |
| 153 | + |
| 154 | +No new file for CPU dispatch — `CreateCPUMeter` and `buildCPUMeter` live in `cpu_power_meter.go` next to the `CPUPowerMeter` interface they construct. |
| 155 | + |
| 156 | +### Logging |
| 157 | + |
| 158 | +Per-backend `Warn` on failure, `Info` on selection: |
| 159 | + |
| 160 | +```text |
| 161 | +INFO using rapl power meter |
| 162 | +``` |
| 163 | + |
| 164 | +```text |
| 165 | +WARN rapl not available, trying next backend error="..." |
| 166 | +INFO using hwmon power meter |
| 167 | +``` |
| 168 | + |
| 169 | +GPU keeps its existing per-vendor `Warn` plus a one-line summary: |
| 170 | + |
| 171 | +```text |
| 172 | +INFO gpu meter discovery ok=[nvidia] failed=[amd: factory: rocm-smi not found] |
| 173 | +``` |
| 174 | + |
| 175 | +## Configuration |
| 176 | + |
| 177 | +```yaml |
| 178 | +cpu: |
| 179 | + meters: ["rapl", "hwmon"] # ordered priority |
| 180 | +``` |
| 181 | +
|
| 182 | +Per-backend tuning keys (`rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fake-cpu-meter.zones`) are unchanged. Legacy selectors translate at startup to `cpu.meters` and emit a deprecation warning. See [Backward compatibility](#backward-compatibility) for the full migration. |
| 183 | + |
| 184 | +GPU config does not change in this EP. |
| 185 | + |
| 186 | +## Testing Strategy |
| 187 | + |
| 188 | +* `device.CreateCPUMeter` table-driven tests: unknown name, factory error, `Init()` error, empty zones, success, ordered priority, all-fail aggregation. |
| 189 | +* `gpu.Discover` table-driven tests: all-fail, mixed, all-empty, all-succeed; per-vendor failure modes (factory, `Init()`, empty `Devices()`). |
| 190 | + |
| 191 | +## Backward compatibility |
| 192 | + |
| 193 | +Default `cpu.meters: ["rapl", "hwmon"]` reproduces default behaviour. No breaking change for healthy systems running default config. |
| 194 | + |
| 195 | +Two legacy selectors require migration. They translate at startup to an effective `cpu.meters` value, log a deprecation warning, and stop working in a future release: |
| 196 | + |
| 197 | +* `experimental.hwmon.forceEnabled: true` (today: force hwmon, skip RAPL) → effective `cpu.meters: ["hwmon"]`. |
| 198 | +* `dev.fake-cpu-meter.enabled: true` (today: use fake meter, skip RAPL/hwmon) → effective `cpu.meters: ["fake"]`. |
| 199 | + |
| 200 | +Per-backend tuning keys are unchanged and remain valid: `rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fake-cpu-meter.zones`. |
| 201 | + |
| 202 | +Operators on broken systems will see `kepler` exit with a clearer error when every CPU backend fails (the prior path also exits, but with the hwmon error only, which can be misleading). |
| 203 | + |
| 204 | +GPU error semantics tighten. A node with `gpu.enabled=true` and a broken NVML driver will fail startup instead of silently running CPU-only. This matches the CPU contract. CPU-only is the default (`gpu.enabled=false`); the tightening only affects operators who explicitly opt in to GPU. |
| 205 | + |
| 206 | +## Implementation |
| 207 | + |
| 208 | +Single PR introduces: |
| 209 | + |
| 210 | +1. Promote `device.PowerMeter` interface (`Name`, `Init`). |
| 211 | +2. Add `cfg.Cpu.Meters` config field with default `["rapl", "hwmon"]`. |
| 212 | +3. Add `device.CreateCPUMeter` + `buildCPUMeter` switch dispatch. |
| 213 | +4. Add `Config.ApplyCpuMeterDeprecations` to translate the two legacy keys. |
| 214 | +5. Replace `cmd/kepler/main.go`'s `createCPUMeter` / `createHwmonMeter` with a call to `device.CreateCPUMeter`. |
| 215 | +6. Update sample configs and docs. |
| 216 | + |
| 217 | +GPU failure-mode split lands as a follow-up PR — separable since CPU and GPU paths are independent in `cmd/kepler/main.go`. |
| 218 | + |
| 219 | +## Risks and Mitigations |
| 220 | + |
| 221 | +### Operational risk: stricter GPU error path (follow-up PR) |
| 222 | + |
| 223 | +* **Risk**: Nodes with GPU explicitly enabled (`gpu.enabled=true`) and broken drivers fail startup where the prior code silently continued without GPU metrics. |
| 224 | +* **Mitigation**: Matches the CPU contract. Default-off (`gpu.enabled=false`) means only opted-in operators are affected; they can revert to CPU-only by removing the explicit opt-in. |
| 225 | + |
| 226 | +### Configuration risk: invalid `cpu.meters` value |
| 227 | + |
| 228 | +* **Risk**: Operator typo (`"rappl"`) is rejected by `buildCPUMeter` with `unknown cpu meter "rappl"`. |
| 229 | +* **Mitigation**: The error names the unknown backend and the operator's `cpu.meters` is preserved in the printed config at startup so the typo is visible in logs. |
| 230 | + |
| 231 | +## Alternatives Considered |
| 232 | + |
| 233 | +### Alternative 1: Registry pattern with `init()`-time registration |
| 234 | + |
| 235 | +A `RegisterCPUMeter(name, factory)` API where each backend's source file calls it from `init()`, plus a `DiscoverCPU` that walks the registry. This is what GPU does today. |
| 236 | + |
| 237 | +**Why rejected**: speculative architecture for three in-tree backends. The registry pattern's value (third-party plugins, out-of-tree backends, no main.go edits) doesn't apply to Kepler today. The cost (a new package-level API surface, `init()` side effects, test isolation via `Clear` helpers, jumping between files to find where backends are registered) is real and visible to every reader. |
| 238 | + |
| 239 | +The switch dispatch achieves the same operator outcome (config-driven priority) with less vocabulary. If out-of-tree CPU backends become a real requirement, migrating each `case` to a `RegisterCPUMeter` call is mechanical — the `CPUPowerMeter` interface doesn't change. |
| 240 | + |
| 241 | +GPU keeps its registry because vendor backends live in subpackages with heavy vendor-specific imports (NVML, future ROCm/oneAPI). A switch would pull every vendor's dependencies into the main binary unconditionally. |
| 242 | + |
| 243 | +### Alternative 2: Unify CPU and GPU under one per-device interface |
| 244 | + |
| 245 | +This would define a common `Device` interface that both CPU zones and GPU devices implement. |
| 246 | + |
| 247 | +This is not viable since CPU zones (logical, energy in µJ per zone) and GPU devices (physical, watts plus per-process attribution) have different shapes. A unifying interface would push hardware-specific concerns through `interface{}` or lose information. |
| 248 | + |
| 249 | +### Alternative 3: Per-backend Prometheus error counter |
| 250 | + |
| 251 | +This would expose `kepler_meter_failures_total{backend, stage}` instead of returning errors. |
| 252 | + |
| 253 | +While this solves visibility for transient failures during runtime, it doesn't address the startup-time question this proposal targets — "which backend was selected and why." Operational metrics could be added orthogonally. |
| 254 | + |
| 255 | +## Next Steps |
| 256 | + |
| 257 | +Out of scope for this EP. |
| 258 | + |
| 259 | +### GPU failure-mode split |
| 260 | + |
| 261 | +Refactor `gpu.Discover` to split factory / `Init()` / empty-devices failure modes and return `(meters, error)`. Separate PR; independent of the CPU work. |
| 262 | + |
| 263 | +### Operational metrics per backend |
| 264 | + |
| 265 | +A small set of operational metrics per backend: discovery success/failure counters, duration. Useful for operators running heterogeneous fleets. |
| 266 | + |
| 267 | +### Migrate to a registry pattern when warranted |
| 268 | + |
| 269 | +If out-of-tree CPU backends or a plugin model becomes a real requirement (BMC, MSR, ARM-vendor counters arriving from external contributors), the switch in `buildCPUMeter` migrates to `RegisterCPUMeter` calls. The interface (`CPUPowerMeter`) and config (`cpu.meters`) don't change, so the migration is mechanical. |
0 commit comments