Skip to content

Commit e977202

Browse files
docs(proposal): add draft for extensible power meter registry
Propose a `device.PowerMeter` interface and per-domain registry to unify CPU and GPU device discovery in `cmd/kepler/main.go`. Opens the CPU path to non-RAPL backends (MSR, BMC-derived, external probes) without forking, matching the registry pattern GPU already uses. Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com>
1 parent bab38cb commit e977202

1 file changed

Lines changed: 263 additions & 0 deletions

File tree

Lines changed: 263 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,263 @@
1+
# Draft: Extensible CPU Power Meter Selection
2+
3+
* **Status**: Draft
4+
* **Author**: Niki Manoledaki
5+
* **Created**: 2026-05-03
6+
7+
## Problem
8+
9+
For context, `cmd/kepler/main.go` constructs CPU and GPU power meters in two different shapes:
10+
11+
* `createCPUMeter` hardcodes RAPL with hwmon as a fallback. It returns `(device.CPUPowerMeter, error)`. This is strict by design: Kepler must produce a CPU meter or fail startup.
12+
* `createGPUMeters` fans out across registered vendors via `gpu.DiscoverAll`. It returns `[]gpu.GPUPowerMeter` with no error, which is a soft failure mode, also by design, as it assumes that GPU monitoring is not required for Kepler.
13+
14+
Once we agree on these assumptions, here are the problems that follow.
15+
16+
1. **Asymmetry.** Two patterns for the same job (turn config into hardware abstractions) makes `cmd/kepler/main.go` harder to read and audit.
17+
2. **Hidden GPU failures.** `gpu.Discover` (`internal/device/gpu/registry.go:48`) collapses three failure modes into one `nil` return:
18+
* factory error (driver missing or unsupported)
19+
* `Init()` error (driver present but broken)
20+
* empty `Devices()` (vendor registered, no hardware on this node)
21+
22+
Only the third is a legitimate soft outcome. Real backend failures surface only as a `Warn` log line.
23+
3. **Closed CPU path.** RAPL and hwmon are hard-coded fallbacks with hwmon force-enable as the only fallback. Operators on infrastructure where neither is the right source cannot extend this model to select a different backend easily.
24+
25+
## Goals
26+
27+
* Extend CPU power meter backend selection to a config-driven list.
28+
* Unify more of the CPU and GPU lifecycle shape: discovery, failure, initialization.
29+
* Surface real GPU backend failures as errors instead of logging them.
30+
* Preserve default behaviour for healthy systems running default config.
31+
32+
## Non-Goals
33+
34+
* This proposal does not build a common interface **per device** across CPU zones and GPU devices. Their measurement shapes differ and a unifying interface would abstract hardware-specific concerns through `interface{}`, which is not ideal. The CPU and GPU attributes should remain separate.
35+
* This proposal does not replace `monitor.PowerDataProvider`, which will continue to be the consumer for exporters. This proposal unifies the step right before that, which is how to initialize the hardware backend.
36+
37+
## Proposed Solution
38+
39+
### Promote `device.PowerMeter`
40+
41+
Promote the unexported `powerMeter` interface in [internal/device/power_meter.go](../../../internal/device/power_meter.go) to an external interface `PowerMeter`, like this:
42+
43+
```go
44+
// PowerMeter reads power from a class of hardware (CPU, GPU, etc).
45+
//
46+
// Many PowerMeters can be selected. Each contributes its own readings.
47+
// Domain-specific methods live on subinterfaces that embed PowerMeter.
48+
type PowerMeter interface {
49+
service.Service // Name()
50+
service.Initializer // Init()
51+
}
52+
```
53+
54+
`Shutdown` is optional via `service.Shutdowner`. Some backends like NVML implement it.
55+
56+
`device.CPUPowerMeter` and `gpu.GPUPowerMeter` embed `device.PowerMeter` and add their own methods. No change to those domain methods in this EP.
57+
58+
### Switch-based CPU dispatch
59+
60+
CPU gains a config field for ordering plus a switch that walks the priority list:
61+
62+
```yaml
63+
cpu:
64+
meters: ["rapl", "hwmon"] # ordered priority
65+
```
66+
67+
```go
68+
// internal/device/cpu_power_meter.go
69+
70+
func CreateCPUMeter(logger *slog.Logger, cfg *config.Config) (CPUPowerMeter, error) {
71+
var errs []error
72+
for _, name := range cfg.Cpu.Meters {
73+
meter, err := buildCPUMeter(name, logger, cfg)
74+
if err != nil { /* aggregate, continue */ }
75+
if err := meter.Init(); err != nil { /* aggregate, continue */ }
76+
zones, _ := meter.Zones()
77+
if len(zones) == 0 { /* soft skip, continue */ }
78+
return meter, nil
79+
}
80+
return nil, errors.Join(errs...)
81+
}
82+
83+
func buildCPUMeter(name string, logger *slog.Logger, cfg *config.Config) (CPUPowerMeter, error) {
84+
switch name {
85+
case "rapl": return NewCPUPowerMeter(...)
86+
case "hwmon": return NewHwmonPowerMeter(...)
87+
case "fake": return NewFakeCPUMeter(...)
88+
default: return nil, fmt.Errorf("unknown cpu meter %q", name)
89+
}
90+
}
91+
```
92+
93+
`CreateCPUMeter` owns the lifecycle (build, init, zones, error aggregation). `buildCPUMeter` is the dispatch. Adding a backend means a new case + a new constructor, which is straightforward.
94+
95+
Default value `["rapl", "hwmon"]` preserves the prior fallback behaviour. Operators make a selection via the config.
96+
97+
### GPU failure-mode split
98+
99+
As part of this unification, `gpu.Discover` is refactored to split the three failure modes and return `(meters, error)`:
100+
101+
| State | GPU result |
102+
|------------------------------|-------------------------------------------------|
103+
| Hardware works | `meters, nil` |
104+
| Hardware absent on this node | `nil, nil` |
105+
| Configured backend broken | `meters?, err` (per-vendor failures aggregated) |
106+
| Feature off | `nil, nil` |
107+
108+
Result per scenario at startup:
109+
110+
| Scenario | Result |
111+
|---------------------|--------------------------|
112+
| Factory error | Real failure. Aggregate. |
113+
| `Init()` error | Real failure. Aggregate. |
114+
| Empty `Devices()` | Soft skip. Not an error. |
115+
| Success | Append to result. |
116+
117+
Error aggregation uses `errors.Join`. `gpu.Discover` returns the joined error only when the GPU feature is explicitly enabled and every registered vendor returned a real failure. "No GPU on this node" stays `(nil, nil)`.
118+
119+
### Unified `cmd/kepler/main.go`
120+
121+
End result after the refactor:
122+
123+
```go
124+
cpuMeter, err := device.CreateCPUMeter(logger, cfg)
125+
if err != nil { ... }
126+
127+
gpuMeters, err := gpu.Discover(logger, cfg)
128+
if err != nil { ... }
129+
```
130+
131+
Symmetric shape (`(meter(s), error)`), different failure semantics (CPU strict, GPU soft).
132+
133+
## Detailed Design
134+
135+
### Package layout
136+
137+
```text
138+
internal/device/
139+
├── power_meter.go # PowerMeter interface (promoted from private)
140+
├── cpu_power_meter.go # CPUPowerMeter, CreateCPUMeter, buildCPUMeter
141+
├── rapl_sysfs_power_meter.go # NewCPUPowerMeter (RAPL)
142+
├── hwmon_power_meter.go # NewHwmonPowerMeter
143+
├── fake_cpu_power_meter.go # NewFakeCPUMeter
144+
└── gpu/
145+
├── interface.go # GPUPowerMeter embeds device.PowerMeter, plus Vendor and GPUDevice types
146+
├── registry.go # gpu.Register, gpu.Discover (refactored failure modes)
147+
└── nvidia/ # init() calls gpu.Register(NVIDIA, ...)
148+
```
149+
150+
No new file for CPU dispatch — `CreateCPUMeter` and `buildCPUMeter` live in `cpu_power_meter.go` next to the `CPUPowerMeter` interface they construct.
151+
152+
### Logging
153+
154+
Per-backend `Warn` on failure, `Info` on selection:
155+
156+
```text
157+
INFO using rapl power meter
158+
```
159+
160+
```text
161+
WARN rapl not available, trying next backend error="..."
162+
INFO using hwmon power meter
163+
```
164+
165+
GPU keeps its existing per-vendor `Warn` plus a one-line summary:
166+
167+
```text
168+
INFO gpu meter discovery ok=[nvidia] failed=[amd: factory: rocm-smi not found]
169+
```
170+
171+
## Configuration
172+
173+
```yaml
174+
cpu:
175+
meters: ["rapl", "hwmon"] # ordered priority
176+
```
177+
178+
Per-backend tuning keys (`rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fake-cpu-meter.zones`) are unchanged. Legacy selectors translate at startup to `cpu.meters` and emit a deprecation warning. See [Backward compatibility](#backward-compatibility) for the full migration.
179+
180+
GPU config does not change in this EP.
181+
182+
## Testing Strategy
183+
184+
* `device.CreateCPUMeter` table-driven tests: unknown name, factory error, `Init()` error, empty zones, success, ordered priority, all-fail aggregation.
185+
* `gpu.Discover` table-driven tests: all-fail, mixed, all-empty, all-succeed; per-vendor failure modes (factory, `Init()`, empty `Devices()`).
186+
187+
## Backward compatibility
188+
189+
Default `cpu.meters: ["rapl", "hwmon"]` reproduces default behaviour. No breaking change for healthy systems running default config.
190+
191+
Two legacy selectors require migration. They translate at startup to an effective `cpu.meters` value, log a deprecation warning, and stop working in a future release:
192+
193+
* `experimental.hwmon.forceEnabled: true` (today: force hwmon, skip RAPL) → effective `cpu.meters: ["hwmon"]`.
194+
* `dev.fake-cpu-meter.enabled: true` (today: use fake meter, skip RAPL/hwmon) → effective `cpu.meters: ["fake"]`.
195+
196+
Per-backend tuning keys are unchanged and remain valid: `rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fake-cpu-meter.zones`.
197+
198+
Operators on broken systems will see `kepler` exit with a clearer error when every CPU backend fails (the prior path also exits, but with the hwmon error only, which can be misleading).
199+
200+
GPU error semantics tighten. A node with `gpu.enabled=true` and a broken NVML driver will fail startup instead of silently running CPU-only. This matches the CPU contract. CPU-only is the default (`gpu.enabled=false`); the tightening only affects operators who explicitly opt in to GPU.
201+
202+
## Implementation
203+
204+
Single PR introduces:
205+
206+
1. Promote `device.PowerMeter` interface (`Name`, `Init`).
207+
2. Add `cfg.Cpu.Meters` config field with default `["rapl", "hwmon"]`.
208+
3. Add `device.CreateCPUMeter` + `buildCPUMeter` switch dispatch.
209+
4. Add `Config.ApplyCpuMeterDeprecations` to translate the two legacy keys.
210+
5. Replace `cmd/kepler/main.go`'s `createCPUMeter` / `createHwmonMeter` with a call to `device.CreateCPUMeter`.
211+
6. Update sample configs and docs.
212+
213+
GPU failure-mode split lands as a follow-up PR — separable since CPU and GPU paths are independent in `cmd/kepler/main.go`.
214+
215+
## Risks and Mitigations
216+
217+
### Operational risk: stricter GPU error path (follow-up PR)
218+
219+
* **Risk**: Nodes with GPU explicitly enabled (`gpu.enabled=true`) and broken drivers fail startup where the prior code silently continued without GPU metrics.
220+
* **Mitigation**: Matches the CPU contract. Default-off (`gpu.enabled=false`) means only opted-in operators are affected; they can revert to CPU-only by removing the explicit opt-in.
221+
222+
### Configuration risk: invalid `cpu.meters` value
223+
224+
* **Risk**: Operator typo (`"rappl"`) is rejected by `buildCPUMeter` with `unknown cpu meter "rappl"`.
225+
* **Mitigation**: The error names the unknown backend and the operator's `cpu.meters` is preserved in the printed config at startup so the typo is visible in logs.
226+
227+
## Alternatives Considered
228+
229+
### Alternative 1: Registry pattern with `init()`-time registration
230+
231+
A `RegisterCPUMeter(name, factory)` API where each backend's source file calls it from `init()`, plus a `DiscoverCPU` that walks the registry. This is what GPU does today.
232+
233+
**Why rejected**: speculative architecture for three backends. The registry pattern's value doesn't apply.
234+
235+
The switch dispatch achieves the same operator outcome (config-driven priority) with less complexity. If other CPU backends are required, migrating each `case` to a `RegisterCPUMeter` call is possible and the `CPUPowerMeter` interface doesn't change.
236+
237+
### Alternative 2: Unify CPU and GPU under one per-device interface
238+
239+
This would define a common `Device` interface that both CPU zones and GPU devices implement.
240+
241+
This is not viable since CPU zones (logical, energy in µJ per zone) and GPU devices (physical, watts plus per-process attribution) have different shapes. A unifying interface would push hardware-specific concerns through `interface{}` or lose information.
242+
243+
### Alternative 3: Per-backend Prometheus error counter
244+
245+
This would expose `kepler_meter_failures_total{backend, stage}` instead of returning errors.
246+
247+
While this solves visibility for transient failures during runtime, it doesn't address the startup-time question this proposal targets — "which backend was selected and why." Operational metrics could be added orthogonally.
248+
249+
## Next Steps
250+
251+
Out of scope for this EP.
252+
253+
### GPU failure-mode split
254+
255+
Refactor `gpu.Discover` to split factory / `Init()` / empty-devices failure modes and return `(meters, error)`. Separate PR; independent of the CPU work.
256+
257+
### Operational metrics per backend
258+
259+
A small set of operational metrics per backend: discovery success/failure counters, duration. Useful for operators running heterogeneous fleets.
260+
261+
### Migrate to a registry pattern when warranted
262+
263+
If out-of-tree CPU backends or a plugin model becomes a real requirement (BMC, MSR, ARM-vendor counters arriving from external contributors), the switch in `buildCPUMeter` migrates to `RegisterCPUMeter` calls. The interface (`CPUPowerMeter`) and config (`cpu.meters`) don't change, so the migration is mechanical.

0 commit comments

Comments
 (0)