Skip to content

Commit 51fea8e

Browse files
docs(proposal): add draft for extensible power meter registry
Propose a `device.PowerMeter` interface and per-domain registry to unify CPU and GPU device discovery in `cmd/kepler/main.go`. Opens the CPU path to non-RAPL backends (MSR, BMC-derived, external probes) without forking, matching the registry pattern GPU already uses. Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com>
1 parent bab38cb commit 51fea8e

1 file changed

Lines changed: 264 additions & 0 deletions

File tree

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
# Draft: Extensible Power Meter Registry
2+
3+
* **Status**: Draft
4+
* **Author**: Niki Manoledaki
5+
* **Created**: 2026-05-03
6+
7+
## Problem
8+
9+
For context, `cmd/kepler/main.go` constructs CPU and GPU power meters in two different shapes:
10+
11+
* `createCPUMeter` hardcodes RAPL with hwmon as a fallback. It returns `(device.CPUPowerMeter, error)`. This is strict by design: Kepler must produce a CPU meter or fail startup.
12+
* `createGPUMeters` fans out across registered vendors via `gpu.DiscoverAll`. It returns `[]gpu.GPUPowerMeter` with no error, which is a soft failure mode, also by design, as it assumes that GPU monitoring is not required for Kepler.
13+
14+
Once we agree on these assumptions, here are the problems that follow.
15+
16+
1. **Asymmetry.** Two patterns for the same job (turn config into hardware abstractions) makes `cmd/kepler/main.go` harder to read and audit.
17+
2. **Hidden GPU failures.** `gpu.Discover` (`internal/device/gpu/registry.go:48`) collapses three failure modes into one `nil` return:
18+
* factory error (driver missing or unsupported)
19+
* `Init()` error (driver present but broken)
20+
* empty `Devices()` (vendor registered, no hardware on this node)
21+
22+
Only the third is a legitimate soft outcome. Real backend failures surface only as a `Warn` log line.
23+
3. **Closed CPU path.** RAPL and hwmon are hard-coded fallbacks. Operators on platforms where neither is the right source (ARM with vendor-specific counters, BMC-derived per-socket power, external probes via sidecar) cannot extend Kepler without forking `createCPUMeter`. Mytton's *Model, estimate or measure?* ([link](https://www.devsustainability.com/p/model-estimate-or-measure-what-matters)) argues that operators making sustainability claims need a source-agnostic power-data path. Kepler's CPU path is opinionated about its sources.
24+
25+
## Goals
26+
27+
* Create an extensible registration surface for CPU and GPU devices i.e. `internal/device/`.
28+
* Unify the pattern for CPU and GPU device lifecycle: discovery, failure, initialization, shutdown.
29+
* Surface real backend failures as errors instead of logging them.
30+
* Open CPU power-meter selection to the same registry pattern GPU already uses.
31+
* Preserve default behaviour for healthy systems running default config.
32+
33+
## Non-Goals
34+
35+
* This proposal does not build a common interface **per device** across CPU zones and GPU devices. Their measurement shapes differ and a unifying interface would abstract hardware-specific concerns through `interface{}`, which is not ideal. The CPU and GPU attributes should remain separate.
36+
* This proposal does not replace `monitor.PowerDataProvider`, which will continue to be the consumer for exporters. This proposal unifies the step right before that, which is how to register the hardware backend.
37+
38+
## Proposed Solution
39+
40+
### Promote `device.PowerMeter`
41+
42+
Promote the unexported `powerMeter` interface in [internal/device/power_meter.go](../../../internal/device/power_meter.go) to an external interface `PowerMeter`, like this:
43+
44+
```go
45+
// PowerMeter is a registered hardware backend that reads energy or power
46+
// from a class of hardware (CPU package, GPU device, etc).
47+
//
48+
// Many PowerMeters can be registered. Each contributes its own readings.
49+
// Domain-specific methods live on subinterfaces that embed PowerMeter.
50+
type PowerMeter interface {
51+
service.Service // Name()
52+
service.Initializer // Init()
53+
}
54+
```
55+
56+
`device.CPUPowerMeter` and `gpu.GPUPowerMeter` embed `device.PowerMeter` and add their own methods. No change to those domain methods in this EP.
57+
58+
### Registry and factory
59+
60+
Each domain owns its registry:
61+
62+
* CPU registration lives at `internal/device/` -- ideally this would be split out into `cpu/` (see [Next Steps](#next-steps))
63+
* GPU registration stays in `internal/device/gpu/` where the GPU types and vendor backends live.
64+
65+
```go
66+
// internal/device/registry.go
67+
func RegisterCPUMeter(name string, fn func(*slog.Logger, *config.Config) (CPUPowerMeter, error))
68+
func DiscoverCPU(*slog.Logger, *config.Config) (CPUPowerMeter, error)
69+
```
70+
71+
Refactor GPU power meter:
72+
73+
```go
74+
// internal/device/gpu/registry.go
75+
func Register(vendor Vendor, fn func(*slog.Logger, *config.Config) (GPUPowerMeter, error))
76+
func Discover(*slog.Logger, *config.Config) ([]GPUPowerMeter, error)
77+
```
78+
79+
NVIDIA's `init()` keeps calling `gpu.Register(NVIDIA, newNvidiaMeter)` — same import as the types it returns. The GPU package remains the single home for GPU-specific concerns: `Vendor` enum, `GPUDevice`, MIG types, dcgm-exporter integration.
80+
81+
### Config-driven CPU backend order
82+
83+
CPU gains a config key that mirrors what GPU's vendor registry already does implicitly:
84+
85+
```yaml
86+
cpu:
87+
meters: ["rapl", "hwmon"] # ordered priority
88+
```
89+
90+
`DiscoverCPU` reads `cfg.Cpu.Meters`, walks the list, calls each registered factory, calls `Init()`, and returns the first meter that yields zones. Default value preserves default behaviour. Operators can swap or re-order without forking.
91+
92+
### Unified startup approach
93+
94+
Both `device.DiscoverCPU` and `gpu.Discover` return `(meters, error)` with similar semantics but different treatment by design:
95+
96+
| State | CPU result | GPU result |
97+
|------------------------------|------------------------------------|-------------------------------------------------|
98+
| Hardware works | `meter, nil` | `meters, nil` |
99+
| Hardware absent on this node | n/a (CPU is mandatory) | `nil, nil` |
100+
| Configured backend broken | `nil, err` (after order exhausted) | `meters?, err` (per-vendor failures aggregated) |
101+
| Feature off | n/a | `nil, nil` |
102+
103+
Result per scenario at startup:
104+
105+
| Scenario | Result |
106+
|---------------------|--------------------------|
107+
| Factory error | Real failure. Aggregate. |
108+
| `Init()` error | Real failure. Aggregate. |
109+
| Empty zones/devices | Soft skip. Not an error. |
110+
| Success | Append to result. |
111+
112+
Error aggregation uses `errors.Join`.
113+
114+
`device.DiscoverCPU` returns the joined error only if no backend produced a meter, respecting today's strict CPU contract. On the other hand, `gpu.Discover` returns the joined error only when the GPU feature is explicitly enabled and every registered vendor returned a real failure. "No GPU on this node" stays `(nil, nil)`.
115+
116+
Finally, the end result is a unified `cmd/kepler/main.go` after the refactor:
117+
118+
```go
119+
cpuMeter, err := device.DiscoverCPU(logger, cfg)
120+
if err != nil { ... }
121+
122+
gpuMeters, err := gpu.Discover(logger, cfg)
123+
if err != nil { ... }
124+
```
125+
126+
## Detailed Design
127+
128+
### Package layout
129+
130+
```text
131+
internal/device/
132+
├── power_meter.go # PowerMeter interface (promoted from private)
133+
├── registry.go # RegisterCPUMeter, DiscoverCPU
134+
├── cpu_power_meter.go # CPUPowerMeter embeds PowerMeter
135+
├── rapl_sysfs_power_meter.go # init() registers "rapl"
136+
├── hwmon_power_meter.go # init() registers "hwmon"
137+
├── fake_cpu_power_meter.go # init() registers "fake"
138+
└── gpu/
139+
├── interface.go # GPUPowerMeter embeds device.PowerMeter, plus Vendor and GPUDevice types
140+
├── registry.go # gpu.Register, gpu.Discover (refactored)
141+
└── nvidia/ # init() calls gpu.Register(NVIDIA, ...)
142+
```
143+
144+
### Registration timing
145+
146+
Built-in CPU backends register from `init()` in their files. They live in `internal/device`, so importing the package activates them; no blank imports needed for built-ins.
147+
148+
Backends in subpackages (today: `gpu/nvidia`; future examples: `cpu/msr` per EP-002, `cpu/bmc`) need blank imports in `cmd/kepler/main.go`, matching the NVIDIA pattern.
149+
150+
### Logging
151+
152+
Each `Discover*` call produces one summary log line:
153+
154+
```text
155+
INFO cpu meter discovery ok=[rapl] failed=[]
156+
INFO gpu meter discovery ok=[nvidia] failed=[amd: factory: rocm-smi not found]
157+
```
158+
159+
Per-backend errors stay at `Warn` for ops who want detail.
160+
161+
## Configuration
162+
163+
```yaml
164+
cpu:
165+
meters: ["rapl", "hwmon"] # ordered priority
166+
```
167+
168+
Per-backend tuning keys (`rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fakeCpuMeter.Zones`) are unchanged. Legacy selectors (`experimentalHwmonFeature`, `dev.fakeCpuMeter.Enabled`) translate at startup to `cpu.meters` and emit a deprecation warning. See [Backward compatibility](#backward-compatibility) for the full migration.
169+
170+
GPU config does not change in this EP.
171+
172+
## Testing Strategy
173+
174+
* `device.DiscoverCPU` table-driven tests: backend not registered, factory error, `Init()` error, empty zones, success, ordered priority.
175+
* `gpu.Discover` table-driven tests: all-fail, mixed, all-empty, all-succeed; per-vendor failure modes (factory, `Init()`, empty `Devices()`).
176+
* `cmd/kepler` test asserting registry-driven main produces the same service set as the prior code for default config.
177+
178+
## Backward compatibility
179+
180+
Default `cpu.meters: ["rapl", "hwmon"]` reproduces default behaviour. No breaking change for healthy systems running default config.
181+
182+
Two legacy selectors require migration. They translate at startup to an effective `cpu.meters` value, log a deprecation warning, and stop working in a future release:
183+
184+
* `experimentalHwmonFeature: true` (today: force hwmon, skip RAPL) → effective `cpu.meters: ["hwmon"]`.
185+
* `dev.fakeCpuMeter.Enabled: true` (today: use fake meter, skip RAPL/hwmon) → effective `cpu.meters: ["fake"]`.
186+
187+
Per-backend tuning keys are unchanged and remain valid: `rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fakeCpuMeter.Zones`.
188+
189+
Operators on broken systems will see `kepler` exit with a clearer error when every CPU backend fails (the prior path also exits, but with the hwmon error only, which can be misleading).
190+
191+
GPU error semantics tighten. A node with `gpu.enabled=true` and a broken NVML driver will fail startup instead of silently running CPU-only. This matches the CPU contract. CPU-only is the default (`gpu.enabled=false`); the tightening only affects operators who explicitly opt in to GPU.
192+
193+
## Migration path
194+
195+
1. **Phase 1**: introduce `device.PowerMeter` and `device.RegisterCPUMeter` / `device.DiscoverCPU`. Keep `createCPUMeter` calling into them.
196+
2. **Phase 2**: refactor `gpu.Discover` to split the three failure modes (factory, `Init()`, empty devices) and aggregate real failures. Update its signature to `(meters, error)`.
197+
3. **Phase 3**: move RAPL, hwmon, fake under the registry. Add `cpu.meters` config key with default order.
198+
4. **Phase 4**: simplify `cmd/kepler/main.go` to two `Discover` calls. Move GPU optional-config injection (`IdlePowerConfigurable`, `DCGMEndpointConfigurable`) into the NVIDIA factory.
199+
200+
## Risks and Mitigations
201+
202+
### Operational risk: stricter GPU error path
203+
204+
* **Risk**: Nodes with GPU explicitly enabled (`gpu.enabled=true`) and broken drivers fail startup where the prior code silently continued without GPU metrics.
205+
* **Mitigation**: Matches the CPU contract. Default-off (`gpu.enabled=false`) means only opted-in operators are affected; they can revert to CPU-only by removing the explicit opt-in.
206+
207+
### Maintenance risk: registry boilerplate
208+
209+
* **Risk**: Two parallel registries duplicate small amounts of code.
210+
* **Mitigation**: Each registry is ~50 lines.
211+
212+
### Configuration risk: invalid `cpu.meters` value
213+
214+
* **Risk**: Operator typo (`"rappl"`) silently disables that backend.
215+
* **Mitigation**: `DiscoverCPU` returns an error listing unknown backend names alongside registered ones. Validation runs at startup, not first-use.
216+
217+
## Alternatives Considered
218+
219+
### Alternative 1: Unify error semantics at init time with no registry
220+
221+
Add `error` to `createGPUMeters` and split `gpu.Discover`'s three failure modes. Leave `createCPUMeter`'s fallback chain.
222+
223+
While this is easy to implement, it doesn't address the CPU extensibility problem. Operators still cannot extend non-RAPL/non-hwmon backends easily.
224+
225+
### Alternative 2: Unify CPU and GPU under one per-device interface
226+
227+
This would define a common `Device` interface that both CPU zones and GPU devices implement.
228+
229+
This is not viable since CPU zones (logical, energy in µJ per zone) and GPU devices (physical, watts plus per-process attribution) have different shapes. A unifying interface would push hardware-specific concerns through `interface{}` or lose information.
230+
231+
### Alternative 3: Per-backend Prometheus error counter (cloudcost-exporter pattern)
232+
233+
This would expose `kepler_meter_discovery_failures_total{backend, stage}` instead of returning errors.
234+
235+
While this solves the visibility problem for GPU device registration, it does not provide consistency with the CPU path. That being said, it could be added to this proposal - it would be good to have operational metrics.
236+
237+
## Next Steps
238+
239+
Out of scope for this EP.
240+
241+
### Operational metrics per backend
242+
243+
A small set of operational metrics per registered backend: discovery success/failure counters, duration, latency. Useful for operators running heterogeneous fleets.
244+
245+
### Full `device/` package split by feature
246+
247+
Once CPU gains a second non-trivial backend (e.g., MSR, Redfish-per-CPU, BMC), promote the layout to one subpackage per backend:
248+
249+
```text
250+
internal/device/
251+
├── power_meter.go # PowerMeter, Energy, Power (shared)
252+
├── cpu/
253+
│ ├── meter.go # CPUPowerMeter, EnergyZone
254+
│ ├── registry.go # cpu.Register, cpu.Discover
255+
│ ├── rapl/
256+
│ ├── hwmon/
257+
│ └── fake/
258+
└── gpu/
259+
├── meter.go
260+
├── registry.go
261+
└── nvidia/
262+
```
263+
264+
Defer until at least one new backend is on the roadmap. Justifies the move with a concrete addition.

0 commit comments

Comments
 (0)