docs(proposal): add draft for extensible power meter registry

nikimanoledaki · nikimanoledaki · commit 51fea8e82449 · 2026-05-03T20:24:56.000+02:00
Propose a `device.PowerMeter` interface and per-domain registry to unify
CPU and GPU device discovery in `cmd/kepler/main.go`. Opens the CPU path
to non-RAPL backends (MSR, BMC-derived, external probes) without forking,
matching the registry pattern GPU already uses.

Signed-off-by: nikimanoledaki &lt;niki.manoledaki@grafana.com&gt;
diff --git a/docs/developer/proposal/draft-20260503-extensible-power-meter-registry.md b/docs/developer/proposal/draft-20260503-extensible-power-meter-registry.md
@@ -0,0 +1,264 @@
+# Draft: Extensible Power Meter Registry
+
+* **Status**: Draft
+* **Author**: Niki Manoledaki
+* **Created**: 2026-05-03
+
+## Problem
+
+For context, `cmd/kepler/main.go` constructs CPU and GPU power meters in two different shapes:
+
+* `createCPUMeter` hardcodes RAPL with hwmon as a fallback. It returns `(device.CPUPowerMeter, error)`. This is strict by design: Kepler must produce a CPU meter or fail startup.
+* `createGPUMeters` fans out across registered vendors via `gpu.DiscoverAll`. It returns `[]gpu.GPUPowerMeter` with no error, which is a soft failure mode, also by design, as it assumes that GPU monitoring is not required for Kepler.
+
+Once we agree on these assumptions, here are the problems that follow.
+
+1. **Asymmetry.** Two patterns for the same job (turn config into hardware abstractions) makes `cmd/kepler/main.go` harder to read and audit.
+2. **Hidden GPU failures.** `gpu.Discover` (`internal/device/gpu/registry.go:48`) collapses three failure modes into one `nil` return:
+   * factory error (driver missing or unsupported)
+   * `Init()` error (driver present but broken)
+   * empty `Devices()` (vendor registered, no hardware on this node)
+
+   Only the third is a legitimate soft outcome. Real backend failures surface only as a `Warn` log line.
+3. **Closed CPU path.** RAPL and hwmon are hard-coded fallbacks. Operators on platforms where neither is the right source (ARM with vendor-specific counters, BMC-derived per-socket power, external probes via sidecar) cannot extend Kepler without forking `createCPUMeter`. Mytton's *Model, estimate or measure?* ([link](https://www.devsustainability.com/p/model-estimate-or-measure-what-matters)) argues that operators making sustainability claims need a source-agnostic power-data path. Kepler's CPU path is opinionated about its sources.
+
+## Goals
+
+* Create an extensible registration surface for CPU and GPU devices i.e. `internal/device/`.
+* Unify the pattern for CPU and GPU device lifecycle: discovery, failure, initialization, shutdown.
+* Surface real backend failures as errors instead of logging them.
+* Open CPU power-meter selection to the same registry pattern GPU already uses.
+* Preserve default behaviour for healthy systems running default config.
+
+## Non-Goals
+
+* This proposal does not build a common interface **per device** across CPU zones and GPU devices. Their measurement shapes differ and a unifying interface would abstract hardware-specific concerns through `interface{}`, which is not ideal. The CPU and GPU attributes should remain separate.
+* This proposal does not replace `monitor.PowerDataProvider`, which will continue to be the consumer for exporters. This proposal unifies the step right before that, which is how to register the hardware backend.
+
+## Proposed Solution
+
+### Promote `device.PowerMeter`
+
+Promote the unexported `powerMeter` interface in [internal/device/power_meter.go](../../../internal/device/power_meter.go) to an external interface `PowerMeter`, like this:
+
+```go
+// PowerMeter is a registered hardware backend that reads energy or power
+// from a class of hardware (CPU package, GPU device, etc).
+//
+// Many PowerMeters can be registered. Each contributes its own readings.
+// Domain-specific methods live on subinterfaces that embed PowerMeter.
+type PowerMeter interface {
+    service.Service     // Name()
+    service.Initializer // Init()
+}
+```
+
+`device.CPUPowerMeter` and `gpu.GPUPowerMeter` embed `device.PowerMeter` and add their own methods. No change to those domain methods in this EP.
+
+### Registry and factory
+
+Each domain owns its registry:
+
+* CPU registration lives at `internal/device/` -- ideally this would be split out into `cpu/` (see [Next Steps](#next-steps))
+* GPU registration stays in `internal/device/gpu/` where the GPU types and vendor backends live.
+
+```go
+// internal/device/registry.go
+func RegisterCPUMeter(name string, fn func(*slog.Logger, *config.Config) (CPUPowerMeter, error))
+func DiscoverCPU(*slog.Logger, *config.Config) (CPUPowerMeter, error)
+```
+
+Refactor GPU power meter:
+
+```go
+// internal/device/gpu/registry.go
+func Register(vendor Vendor, fn func(*slog.Logger, *config.Config) (GPUPowerMeter, error))
+func Discover(*slog.Logger, *config.Config) ([]GPUPowerMeter, error)
+```
+
+NVIDIA's `init()` keeps calling `gpu.Register(NVIDIA, newNvidiaMeter)` — same import as the types it returns. The GPU package remains the single home for GPU-specific concerns: `Vendor` enum, `GPUDevice`, MIG types, dcgm-exporter integration.
+
+### Config-driven CPU backend order
+
+CPU gains a config key that mirrors what GPU's vendor registry already does implicitly:
+
+```yaml
+cpu:
+  meters: ["rapl", "hwmon"]   # ordered priority
+```
+
+`DiscoverCPU` reads `cfg.Cpu.Meters`, walks the list, calls each registered factory, calls `Init()`, and returns the first meter that yields zones. Default value preserves default behaviour. Operators can swap or re-order without forking.
+
+### Unified startup approach
+
+Both `device.DiscoverCPU` and `gpu.Discover` return `(meters, error)` with similar semantics but different treatment by design:
+
+| State                        | CPU result                         | GPU result                                      |
+|------------------------------|------------------------------------|-------------------------------------------------|
+| Hardware works               | `meter, nil`                       | `meters, nil`                                   |
+| Hardware absent on this node | n/a (CPU is mandatory)             | `nil, nil`                                      |
+| Configured backend broken    | `nil, err` (after order exhausted) | `meters?, err` (per-vendor failures aggregated) |
+| Feature off                  | n/a                                | `nil, nil`                                      |
+
+Result per scenario at startup:
+
+| Scenario            | Result                   |
+|---------------------|--------------------------|
+| Factory error       | Real failure. Aggregate. |
+| `Init()` error      | Real failure. Aggregate. |
+| Empty zones/devices | Soft skip. Not an error. |
+| Success             | Append to result.        |
+
+Error aggregation uses `errors.Join`.
+
+`device.DiscoverCPU` returns the joined error only if no backend produced a meter, respecting today's strict CPU contract. On the other hand, `gpu.Discover` returns the joined error only when the GPU feature is explicitly enabled and every registered vendor returned a real failure. "No GPU on this node" stays `(nil, nil)`.
+
+Finally, the end result is a unified `cmd/kepler/main.go` after the refactor:
+
+```go
+cpuMeter,  err := device.DiscoverCPU(logger, cfg)
+if err != nil { ... }
+
+gpuMeters, err := gpu.Discover(logger, cfg)
+if err != nil { ... }
+```
+
+## Detailed Design
+
+### Package layout
+
+```text
+internal/device/
+├── power_meter.go              # PowerMeter interface (promoted from private)
+├── registry.go                 # RegisterCPUMeter, DiscoverCPU
+├── cpu_power_meter.go          # CPUPowerMeter embeds PowerMeter
+├── rapl_sysfs_power_meter.go   # init() registers "rapl"
+├── hwmon_power_meter.go        # init() registers "hwmon"
+├── fake_cpu_power_meter.go     # init() registers "fake"
+└── gpu/
+    ├── interface.go            # GPUPowerMeter embeds device.PowerMeter, plus Vendor and GPUDevice types
+    ├── registry.go             # gpu.Register, gpu.Discover (refactored)
+    └── nvidia/                 # init() calls gpu.Register(NVIDIA, ...)
+```
+
+### Registration timing
+
+Built-in CPU backends register from `init()` in their files. They live in `internal/device`, so importing the package activates them; no blank imports needed for built-ins.
+
+Backends in subpackages (today: `gpu/nvidia`; future examples: `cpu/msr` per EP-002, `cpu/bmc`) need blank imports in `cmd/kepler/main.go`, matching the NVIDIA pattern.
+
+### Logging
+
+Each `Discover*` call produces one summary log line:
+
+```text
+INFO cpu meter discovery   ok=[rapl] failed=[]
+INFO gpu meter discovery   ok=[nvidia] failed=[amd: factory: rocm-smi not found]
+```
+
+Per-backend errors stay at `Warn` for ops who want detail.
+
+## Configuration
+
+```yaml
+cpu:
+  meters: ["rapl", "hwmon"]   # ordered priority
+```
+
+Per-backend tuning keys (`rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fakeCpuMeter.Zones`) are unchanged. Legacy selectors (`experimentalHwmonFeature`, `dev.fakeCpuMeter.Enabled`) translate at startup to `cpu.meters` and emit a deprecation warning. See [Backward compatibility](#backward-compatibility) for the full migration.
+
+GPU config does not change in this EP.
+
+## Testing Strategy
+
+* `device.DiscoverCPU` table-driven tests: backend not registered, factory error, `Init()` error, empty zones, success, ordered priority.
+* `gpu.Discover` table-driven tests: all-fail, mixed, all-empty, all-succeed; per-vendor failure modes (factory, `Init()`, empty `Devices()`).
+* `cmd/kepler` test asserting registry-driven main produces the same service set as the prior code for default config.
+
+## Backward compatibility
+
+Default `cpu.meters: ["rapl", "hwmon"]` reproduces default behaviour. No breaking change for healthy systems running default config.
+
+Two legacy selectors require migration. They translate at startup to an effective `cpu.meters` value, log a deprecation warning, and stop working in a future release:
+
+* `experimentalHwmonFeature: true` (today: force hwmon, skip RAPL) → effective `cpu.meters: ["hwmon"]`.
+* `dev.fakeCpuMeter.Enabled: true` (today: use fake meter, skip RAPL/hwmon) → effective `cpu.meters: ["fake"]`.
+
+Per-backend tuning keys are unchanged and remain valid: `rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fakeCpuMeter.Zones`.
+
+Operators on broken systems will see `kepler` exit with a clearer error when every CPU backend fails (the prior path also exits, but with the hwmon error only, which can be misleading).
+
+GPU error semantics tighten. A node with `gpu.enabled=true` and a broken NVML driver will fail startup instead of silently running CPU-only. This matches the CPU contract. CPU-only is the default (`gpu.enabled=false`); the tightening only affects operators who explicitly opt in to GPU.
+
+## Migration path
+
+1. **Phase 1**: introduce `device.PowerMeter` and `device.RegisterCPUMeter` / `device.DiscoverCPU`. Keep `createCPUMeter` calling into them.
+2. **Phase 2**: refactor `gpu.Discover` to split the three failure modes (factory, `Init()`, empty devices) and aggregate real failures. Update its signature to `(meters, error)`.
+3. **Phase 3**: move RAPL, hwmon, fake under the registry. Add `cpu.meters` config key with default order.
+4. **Phase 4**: simplify `cmd/kepler/main.go` to two `Discover` calls. Move GPU optional-config injection (`IdlePowerConfigurable`, `DCGMEndpointConfigurable`) into the NVIDIA factory.
+
+## Risks and Mitigations
+
+### Operational risk: stricter GPU error path
+
+* **Risk**: Nodes with GPU explicitly enabled (`gpu.enabled=true`) and broken drivers fail startup where the prior code silently continued without GPU metrics.
+* **Mitigation**: Matches the CPU contract. Default-off (`gpu.enabled=false`) means only opted-in operators are affected; they can revert to CPU-only by removing the explicit opt-in.
+
+### Maintenance risk: registry boilerplate
+
+* **Risk**: Two parallel registries duplicate small amounts of code.
+* **Mitigation**: Each registry is ~50 lines.
+
+### Configuration risk: invalid `cpu.meters` value
+
+* **Risk**: Operator typo (`"rappl"`) silently disables that backend.
+* **Mitigation**: `DiscoverCPU` returns an error listing unknown backend names alongside registered ones. Validation runs at startup, not first-use.
+
+## Alternatives Considered
+
+### Alternative 1: Unify error semantics at init time with no registry
+
+Add `error` to `createGPUMeters` and split `gpu.Discover`'s three failure modes. Leave `createCPUMeter`'s fallback chain.
+
+While this is easy to implement, it doesn't address the CPU extensibility problem. Operators still cannot extend non-RAPL/non-hwmon backends easily.
+
+### Alternative 2: Unify CPU and GPU under one per-device interface
+
+This would define a common `Device` interface that both CPU zones and GPU devices implement.
+
+This is not viable since CPU zones (logical, energy in µJ per zone) and GPU devices (physical, watts plus per-process attribution) have different shapes. A unifying interface would push hardware-specific concerns through `interface{}` or lose information.
+
+### Alternative 3: Per-backend Prometheus error counter (cloudcost-exporter pattern)
+
+This would expose `kepler_meter_discovery_failures_total{backend, stage}` instead of returning errors.
+
+While this solves the visibility problem for GPU device registration, it does not provide consistency with the CPU path. That being said, it could be added to this proposal - it would be good to have operational metrics.
+
+## Next Steps
+
+Out of scope for this EP.
+
+### Operational metrics per backend
+
+A small set of operational metrics per registered backend: discovery success/failure counters, duration, latency. Useful for operators running heterogeneous fleets.
+
+### Full `device/` package split by feature
+
+Once CPU gains a second non-trivial backend (e.g., MSR, Redfish-per-CPU, BMC), promote the layout to one subpackage per backend:
+
+```text
+internal/device/
+├── power_meter.go     # PowerMeter, Energy, Power (shared)
+├── cpu/
+│   ├── meter.go       # CPUPowerMeter, EnergyZone
+│   ├── registry.go    # cpu.Register, cpu.Discover
+│   ├── rapl/
+│   ├── hwmon/
+│   └── fake/
+└── gpu/
+    ├── meter.go
+    ├── registry.go
+    └── nvidia/
+```
+
+Defer until at least one new backend is on the roadmap. Justifies the move with a concrete addition.