|
| 1 | +# Draft: Extensible Power Meter Registry |
| 2 | + |
| 3 | +* **Status**: Draft |
| 4 | +* **Author**: Niki Manoledaki |
| 5 | +* **Created**: 2026-05-03 |
| 6 | + |
| 7 | +## Problem |
| 8 | + |
| 9 | +For context, `cmd/kepler/main.go` constructs CPU and GPU power meters in two different shapes: |
| 10 | + |
| 11 | +* `createCPUMeter` hardcodes RAPL with hwmon as a fallback. It returns `(device.CPUPowerMeter, error)`. This is strict by design: Kepler must produce a CPU meter or fail startup. |
| 12 | +* `createGPUMeters` fans out across registered vendors via `gpu.DiscoverAll`. It returns `[]gpu.GPUPowerMeter` with no error, which is a soft failure mode, also by design, as it assumes that GPU monitoring is not required for Kepler. |
| 13 | + |
| 14 | +Once we agree on these assumptions, here are the problems that follow. |
| 15 | + |
| 16 | +1. **Asymmetry.** Two patterns for the same job (turn config into hardware abstractions) makes `cmd/kepler/main.go` harder to read and audit. |
| 17 | +2. **Hidden GPU failures.** `gpu.Discover` (`internal/device/gpu/registry.go:48`) collapses three failure modes into one `nil` return: |
| 18 | + * factory error (driver missing or unsupported) |
| 19 | + * `Init()` error (driver present but broken) |
| 20 | + * empty `Devices()` (vendor registered, no hardware on this node) |
| 21 | + |
| 22 | + Only the third is a legitimate soft outcome. Real backend failures surface only as a `Warn` log line. |
| 23 | +3. **Closed CPU path.** RAPL and hwmon are hard-coded fallbacks. Operators on platforms where neither is the right source (ARM with vendor-specific counters, BMC-derived per-socket power, external probes via sidecar) cannot extend Kepler without forking `createCPUMeter`. Mytton's *Model, estimate or measure?* ([link](https://www.devsustainability.com/p/model-estimate-or-measure-what-matters)) argues that operators making sustainability claims need a source-agnostic power-data path. Kepler's CPU path is opinionated about its sources. |
| 24 | + |
| 25 | +## Goals |
| 26 | + |
| 27 | +* Create an extensible registration surface for CPU and GPU devices i.e. `internal/device/`. |
| 28 | +* Unify the pattern for CPU and GPU device lifecycle: discovery, failure, initialization, shutdown. |
| 29 | +* Surface real backend failures as errors instead of logging them. |
| 30 | +* Open CPU power-meter selection to the same registry pattern GPU already uses. |
| 31 | +* Preserve default behaviour for healthy systems running default config. |
| 32 | + |
| 33 | +## Non-Goals |
| 34 | + |
| 35 | +* This proposal does not build a common interface **per device** across CPU zones and GPU devices. Their measurement shapes differ and a unifying interface would abstract hardware-specific concerns through `interface{}`, which is not ideal. The CPU and GPU attributes should remain separate. |
| 36 | +* This proposal does not replace `monitor.PowerDataProvider`, which will continue to be the consumer for exporters. This proposal unifies the step right before that, which is how to register the hardware backend. |
| 37 | + |
| 38 | +## Proposed Solution |
| 39 | + |
| 40 | +### Promote `device.PowerMeter` |
| 41 | + |
| 42 | +Promote the unexported `powerMeter` interface in [internal/device/power_meter.go](../../../internal/device/power_meter.go) to an external interface `PowerMeter`, like this: |
| 43 | + |
| 44 | +```go |
| 45 | +// PowerMeter is a registered hardware backend that reads energy or power |
| 46 | +// from a class of hardware (CPU package, GPU device, etc). |
| 47 | +// |
| 48 | +// Many PowerMeters can be registered. Each contributes its own readings. |
| 49 | +// Domain-specific methods live on subinterfaces that embed PowerMeter. |
| 50 | +type PowerMeter interface { |
| 51 | + service.Service // Name() |
| 52 | + service.Initializer // Init() |
| 53 | +} |
| 54 | +``` |
| 55 | + |
| 56 | +`device.CPUPowerMeter` and `gpu.GPUPowerMeter` embed `device.PowerMeter` and add their own methods. No change to those domain methods in this EP. |
| 57 | + |
| 58 | +### Registry and factory |
| 59 | + |
| 60 | +Each domain owns its registry: |
| 61 | + |
| 62 | +* CPU registration lives at `internal/device/` -- ideally this would be split out into `cpu/` (see [Next Steps](#next-steps)) |
| 63 | +* GPU registration stays in `internal/device/gpu/` where the GPU types and vendor backends live. |
| 64 | + |
| 65 | +```go |
| 66 | +// internal/device/registry.go |
| 67 | +func RegisterCPUMeter(name string, fn func(*slog.Logger, *config.Config) (CPUPowerMeter, error)) |
| 68 | +func DiscoverCPU(*slog.Logger, *config.Config) (CPUPowerMeter, error) |
| 69 | +``` |
| 70 | + |
| 71 | +Refactor GPU power meter: |
| 72 | + |
| 73 | +```go |
| 74 | +// internal/device/gpu/registry.go |
| 75 | +func Register(vendor Vendor, fn func(*slog.Logger, *config.Config) (GPUPowerMeter, error)) |
| 76 | +func Discover(*slog.Logger, *config.Config) ([]GPUPowerMeter, error) |
| 77 | +``` |
| 78 | + |
| 79 | +NVIDIA's `init()` keeps calling `gpu.Register(NVIDIA, newNvidiaMeter)` — same import as the types it returns. The GPU package remains the single home for GPU-specific concerns: `Vendor` enum, `GPUDevice`, MIG types, dcgm-exporter integration. |
| 80 | + |
| 81 | +### Config-driven CPU backend order |
| 82 | + |
| 83 | +CPU gains a config key that mirrors what GPU's vendor registry already does implicitly: |
| 84 | + |
| 85 | +```yaml |
| 86 | +cpu: |
| 87 | + meters: ["rapl", "hwmon"] # ordered priority |
| 88 | +``` |
| 89 | + |
| 90 | +`DiscoverCPU` reads `cfg.Cpu.Meters`, walks the list, calls each registered factory, calls `Init()`, and returns the first meter that yields zones. Default value preserves default behaviour. Operators can swap or re-order without forking. |
| 91 | + |
| 92 | +### Unified startup approach |
| 93 | + |
| 94 | +Both `device.DiscoverCPU` and `gpu.Discover` return `(meters, error)` with similar semantics but different treatment by design: |
| 95 | + |
| 96 | +| State | CPU result | GPU result | |
| 97 | +|------------------------------|------------------------------------|-------------------------------------------------| |
| 98 | +| Hardware works | `meter, nil` | `meters, nil` | |
| 99 | +| Hardware absent on this node | n/a (CPU is mandatory) | `nil, nil` | |
| 100 | +| Configured backend broken | `nil, err` (after order exhausted) | `meters?, err` (per-vendor failures aggregated) | |
| 101 | +| Feature off | n/a | `nil, nil` | |
| 102 | + |
| 103 | +Result per scenario at startup: |
| 104 | + |
| 105 | +| Scenario | Result | |
| 106 | +|---------------------|--------------------------| |
| 107 | +| Factory error | Real failure. Aggregate. | |
| 108 | +| `Init()` error | Real failure. Aggregate. | |
| 109 | +| Empty zones/devices | Soft skip. Not an error. | |
| 110 | +| Success | Append to result. | |
| 111 | + |
| 112 | +Error aggregation uses `errors.Join`. |
| 113 | + |
| 114 | +`device.DiscoverCPU` returns the joined error only if no backend produced a meter, respecting today's strict CPU contract. On the other hand, `gpu.Discover` returns the joined error only when the GPU feature is explicitly enabled and every registered vendor returned a real failure. "No GPU on this node" stays `(nil, nil)`. |
| 115 | + |
| 116 | +Finally, the end result is a unified `cmd/kepler/main.go` after the refactor: |
| 117 | + |
| 118 | +```go |
| 119 | +cpuMeter, err := device.DiscoverCPU(logger, cfg) |
| 120 | +if err != nil { ... } |
| 121 | + |
| 122 | +gpuMeters, err := gpu.Discover(logger, cfg) |
| 123 | +if err != nil { ... } |
| 124 | +``` |
| 125 | + |
| 126 | +## Detailed Design |
| 127 | + |
| 128 | +### Package layout |
| 129 | + |
| 130 | +```text |
| 131 | +internal/device/ |
| 132 | +├── power_meter.go # PowerMeter interface (promoted from private) |
| 133 | +├── registry.go # RegisterCPUMeter, DiscoverCPU |
| 134 | +├── cpu_power_meter.go # CPUPowerMeter embeds PowerMeter |
| 135 | +├── rapl_sysfs_power_meter.go # init() registers "rapl" |
| 136 | +├── hwmon_power_meter.go # init() registers "hwmon" |
| 137 | +├── fake_cpu_power_meter.go # init() registers "fake" |
| 138 | +└── gpu/ |
| 139 | + ├── interface.go # GPUPowerMeter embeds device.PowerMeter, plus Vendor and GPUDevice types |
| 140 | + ├── registry.go # gpu.Register, gpu.Discover (refactored) |
| 141 | + └── nvidia/ # init() calls gpu.Register(NVIDIA, ...) |
| 142 | +``` |
| 143 | + |
| 144 | +### Registration timing |
| 145 | + |
| 146 | +Built-in CPU backends register from `init()` in their files. They live in `internal/device`, so importing the package activates them; no blank imports needed for built-ins. |
| 147 | + |
| 148 | +Backends in subpackages (today: `gpu/nvidia`; future examples: `cpu/msr` per EP-002, `cpu/bmc`) need blank imports in `cmd/kepler/main.go`, matching the NVIDIA pattern. |
| 149 | + |
| 150 | +### Logging |
| 151 | + |
| 152 | +Each `Discover*` call produces one summary log line: |
| 153 | + |
| 154 | +```text |
| 155 | +INFO cpu meter discovery ok=[rapl] failed=[] |
| 156 | +INFO gpu meter discovery ok=[nvidia] failed=[amd: factory: rocm-smi not found] |
| 157 | +``` |
| 158 | + |
| 159 | +Per-backend errors stay at `Warn` for ops who want detail. |
| 160 | + |
| 161 | +## Configuration |
| 162 | + |
| 163 | +```yaml |
| 164 | +cpu: |
| 165 | + meters: ["rapl", "hwmon"] # ordered priority |
| 166 | +``` |
| 167 | +
|
| 168 | +Per-backend tuning keys (`rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fakeCpuMeter.Zones`) are unchanged. Legacy selectors (`experimentalHwmonFeature`, `dev.fakeCpuMeter.Enabled`) translate at startup to `cpu.meters` and emit a deprecation warning. See [Backward compatibility](#backward-compatibility) for the full migration. |
| 169 | + |
| 170 | +GPU config does not change in this EP. |
| 171 | + |
| 172 | +## Testing Strategy |
| 173 | + |
| 174 | +* `device.DiscoverCPU` table-driven tests: backend not registered, factory error, `Init()` error, empty zones, success, ordered priority. |
| 175 | +* `gpu.Discover` table-driven tests: all-fail, mixed, all-empty, all-succeed; per-vendor failure modes (factory, `Init()`, empty `Devices()`). |
| 176 | +* `cmd/kepler` test asserting registry-driven main produces the same service set as the prior code for default config. |
| 177 | + |
| 178 | +## Backward compatibility |
| 179 | + |
| 180 | +Default `cpu.meters: ["rapl", "hwmon"]` reproduces default behaviour. No breaking change for healthy systems running default config. |
| 181 | + |
| 182 | +Two legacy selectors require migration. They translate at startup to an effective `cpu.meters` value, log a deprecation warning, and stop working in a future release: |
| 183 | + |
| 184 | +* `experimentalHwmonFeature: true` (today: force hwmon, skip RAPL) → effective `cpu.meters: ["hwmon"]`. |
| 185 | +* `dev.fakeCpuMeter.Enabled: true` (today: use fake meter, skip RAPL/hwmon) → effective `cpu.meters: ["fake"]`. |
| 186 | + |
| 187 | +Per-backend tuning keys are unchanged and remain valid: `rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fakeCpuMeter.Zones`. |
| 188 | + |
| 189 | +Operators on broken systems will see `kepler` exit with a clearer error when every CPU backend fails (the prior path also exits, but with the hwmon error only, which can be misleading). |
| 190 | + |
| 191 | +GPU error semantics tighten. A node with `gpu.enabled=true` and a broken NVML driver will fail startup instead of silently running CPU-only. This matches the CPU contract. CPU-only is the default (`gpu.enabled=false`); the tightening only affects operators who explicitly opt in to GPU. |
| 192 | + |
| 193 | +## Migration path |
| 194 | + |
| 195 | +1. **Phase 1**: introduce `device.PowerMeter` and `device.RegisterCPUMeter` / `device.DiscoverCPU`. Keep `createCPUMeter` calling into them. |
| 196 | +2. **Phase 2**: refactor `gpu.Discover` to split the three failure modes (factory, `Init()`, empty devices) and aggregate real failures. Update its signature to `(meters, error)`. |
| 197 | +3. **Phase 3**: move RAPL, hwmon, fake under the registry. Add `cpu.meters` config key with default order. |
| 198 | +4. **Phase 4**: simplify `cmd/kepler/main.go` to two `Discover` calls. Move GPU optional-config injection (`IdlePowerConfigurable`, `DCGMEndpointConfigurable`) into the NVIDIA factory. |
| 199 | + |
| 200 | +## Risks and Mitigations |
| 201 | + |
| 202 | +### Operational risk: stricter GPU error path |
| 203 | + |
| 204 | +* **Risk**: Nodes with GPU explicitly enabled (`gpu.enabled=true`) and broken drivers fail startup where the prior code silently continued without GPU metrics. |
| 205 | +* **Mitigation**: Matches the CPU contract. Default-off (`gpu.enabled=false`) means only opted-in operators are affected; they can revert to CPU-only by removing the explicit opt-in. |
| 206 | + |
| 207 | +### Maintenance risk: registry boilerplate |
| 208 | + |
| 209 | +* **Risk**: Two parallel registries duplicate small amounts of code. |
| 210 | +* **Mitigation**: Each registry is ~50 lines. |
| 211 | + |
| 212 | +### Configuration risk: invalid `cpu.meters` value |
| 213 | + |
| 214 | +* **Risk**: Operator typo (`"rappl"`) silently disables that backend. |
| 215 | +* **Mitigation**: `DiscoverCPU` returns an error listing unknown backend names alongside registered ones. Validation runs at startup, not first-use. |
| 216 | + |
| 217 | +## Alternatives Considered |
| 218 | + |
| 219 | +### Alternative 1: Unify error semantics at init time with no registry |
| 220 | + |
| 221 | +Add `error` to `createGPUMeters` and split `gpu.Discover`'s three failure modes. Leave `createCPUMeter`'s fallback chain. |
| 222 | + |
| 223 | +While this is easy to implement, it doesn't address the CPU extensibility problem. Operators still cannot extend non-RAPL/non-hwmon backends easily. |
| 224 | + |
| 225 | +### Alternative 2: Unify CPU and GPU under one per-device interface |
| 226 | + |
| 227 | +This would define a common `Device` interface that both CPU zones and GPU devices implement. |
| 228 | + |
| 229 | +This is not viable since CPU zones (logical, energy in µJ per zone) and GPU devices (physical, watts plus per-process attribution) have different shapes. A unifying interface would push hardware-specific concerns through `interface{}` or lose information. |
| 230 | + |
| 231 | +### Alternative 3: Per-backend Prometheus error counter (cloudcost-exporter pattern) |
| 232 | + |
| 233 | +This would expose `kepler_meter_discovery_failures_total{backend, stage}` instead of returning errors. |
| 234 | + |
| 235 | +While this solves the visibility problem for GPU device registration, it does not provide consistency with the CPU path. That being said, it could be added to this proposal - it would be good to have operational metrics. |
| 236 | + |
| 237 | +## Next Steps |
| 238 | + |
| 239 | +Out of scope for this EP. |
| 240 | + |
| 241 | +### Operational metrics per backend |
| 242 | + |
| 243 | +A small set of operational metrics per registered backend: discovery success/failure counters, duration, latency. Useful for operators running heterogeneous fleets. |
| 244 | + |
| 245 | +### Full `device/` package split by feature |
| 246 | + |
| 247 | +Once CPU gains a second non-trivial backend (e.g., MSR, Redfish-per-CPU, BMC), promote the layout to one subpackage per backend: |
| 248 | + |
| 249 | +```text |
| 250 | +internal/device/ |
| 251 | +├── power_meter.go # PowerMeter, Energy, Power (shared) |
| 252 | +├── cpu/ |
| 253 | +│ ├── meter.go # CPUPowerMeter, EnergyZone |
| 254 | +│ ├── registry.go # cpu.Register, cpu.Discover |
| 255 | +│ ├── rapl/ |
| 256 | +│ ├── hwmon/ |
| 257 | +│ └── fake/ |
| 258 | +└── gpu/ |
| 259 | + ├── meter.go |
| 260 | + ├── registry.go |
| 261 | + └── nvidia/ |
| 262 | +``` |
| 263 | + |
| 264 | +Defer until at least one new backend is on the roadmap. Justifies the move with a concrete addition. |
0 commit comments