Skip to content

Commit c20aaca

Browse files
docs(proposal): add draft for extensible power meter registry
Propose a `device.PowerMeter` interface and per-domain registry to unify CPU and GPU device discovery in `cmd/kepler/main.go`. Opens the CPU path to non-RAPL backends (MSR, BMC-derived, external probes) without forking, matching the registry pattern GPU already uses. Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com>
1 parent bab38cb commit c20aaca

1 file changed

Lines changed: 269 additions & 0 deletions

File tree

Lines changed: 269 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,269 @@
1+
# Draft: Config-Driven CPU Power Meter Selection
2+
3+
* **Status**: Draft
4+
* **Author**: Niki Manoledaki
5+
* **Created**: 2026-05-03
6+
7+
## Problem
8+
9+
For context, `cmd/kepler/main.go` constructs CPU and GPU power meters in two different shapes:
10+
11+
* `createCPUMeter` hardcodes RAPL with hwmon as a fallback. It returns `(device.CPUPowerMeter, error)`. This is strict by design: Kepler must produce a CPU meter or fail startup.
12+
* `createGPUMeters` fans out across registered vendors via `gpu.DiscoverAll`. It returns `[]gpu.GPUPowerMeter` with no error, which is a soft failure mode, also by design, as it assumes that GPU monitoring is not required for Kepler.
13+
14+
Once we agree on these assumptions, here are the problems that follow.
15+
16+
1. **Asymmetry.** Two patterns for the same job (turn config into hardware abstractions) makes `cmd/kepler/main.go` harder to read and audit.
17+
2. **Hidden GPU failures.** `gpu.Discover` (`internal/device/gpu/registry.go:48`) collapses three failure modes into one `nil` return:
18+
* factory error (driver missing or unsupported)
19+
* `Init()` error (driver present but broken)
20+
* empty `Devices()` (vendor registered, no hardware on this node)
21+
22+
Only the third is a legitimate soft outcome. Real backend failures surface only as a `Warn` log line.
23+
3. **Closed CPU path.** RAPL and hwmon are hard-coded fallbacks with hwmon force-enable as the only escape hatch. Operators on platforms where neither is the right source (ARM with vendor-specific counters, BMC-derived per-socket power, external probes via sidecar) cannot reorder backend priority without forking `createCPUMeter`. Mytton's *Model, estimate or measure?* ([link](https://www.devsustainability.com/p/model-estimate-or-measure-what-matters)) argues that operators making sustainability claims need a source-agnostic power-data path. Kepler's CPU path is opinionated about its sources.
24+
25+
## Goals
26+
27+
* Open CPU power-meter selection to a config-driven priority list operators can reorder.
28+
* Surface real GPU backend failures as errors instead of logging them.
29+
* Unify the CPU and GPU lifecycle shape: discovery, failure, initialization.
30+
* Preserve default behaviour for healthy systems running default config.
31+
32+
## Non-Goals
33+
34+
* This proposal does not build a common interface **per device** across CPU zones and GPU devices. Their measurement shapes differ and a unifying interface would abstract hardware-specific concerns through `interface{}`, which is not ideal. The CPU and GPU attributes should remain separate.
35+
* This proposal does not replace `monitor.PowerDataProvider`, which will continue to be the consumer for exporters. This proposal unifies the step right before that, which is how to construct the hardware backend.
36+
* This proposal does not introduce a registry pattern with `init()`-time backend registration. Registry-based dispatch is appropriate when there are many backends or third-party plugins; for three known in-tree backends, a switch is simpler. See [Alternatives Considered](#alternatives-considered).
37+
38+
## Proposed Solution
39+
40+
### Promote `device.PowerMeter`
41+
42+
Promote the unexported `powerMeter` interface in [internal/device/power_meter.go](../../../internal/device/power_meter.go) to an external interface `PowerMeter`, like this:
43+
44+
```go
45+
// PowerMeter is a hardware backend that reads energy or power from a class
46+
// of hardware (CPU package, GPU device, etc).
47+
//
48+
// Many PowerMeters can be selected. Each contributes its own readings.
49+
// Domain-specific methods live on subinterfaces that embed PowerMeter.
50+
type PowerMeter interface {
51+
service.Service // Name()
52+
service.Initializer // Init()
53+
}
54+
```
55+
56+
`Shutdown` is optional via `service.Shutdowner`; backends with resources (NVIDIA's NVML library state, future BMC HTTP clients, MSR file handles) implement it and the run framework type-asserts before calling. RAPL, hwmon, and fake hold no resources and don't need it.
57+
58+
`device.CPUPowerMeter` and `gpu.GPUPowerMeter` embed `device.PowerMeter` and add their own methods. No change to those domain methods in this EP.
59+
60+
### Switch-based CPU dispatch
61+
62+
CPU gains a config key for ordering plus a switch that walks the priority list:
63+
64+
```yaml
65+
cpu:
66+
meters: ["rapl", "hwmon"] # ordered priority
67+
```
68+
69+
```go
70+
// internal/device/cpu_power_meter.go
71+
72+
func CreateCPUMeter(logger *slog.Logger, cfg *config.Config) (CPUPowerMeter, error) {
73+
var errs []error
74+
for _, name := range cfg.Cpu.Meters {
75+
meter, err := buildCPUMeter(name, logger, cfg)
76+
if err != nil { /* aggregate, continue */ }
77+
if err := meter.Init(); err != nil { /* aggregate, continue */ }
78+
zones, _ := meter.Zones()
79+
if len(zones) == 0 { /* soft skip, continue */ }
80+
return meter, nil
81+
}
82+
return nil, errors.Join(errs...)
83+
}
84+
85+
func buildCPUMeter(name string, logger *slog.Logger, cfg *config.Config) (CPUPowerMeter, error) {
86+
switch name {
87+
case "rapl": return NewCPUPowerMeter(...)
88+
case "hwmon": return NewHwmonPowerMeter(...)
89+
case "fake": return NewFakeCPUMeter(...)
90+
default: return nil, fmt.Errorf("unknown cpu meter %q", name)
91+
}
92+
}
93+
```
94+
95+
Two functions, one file. `CreateCPUMeter` owns the lifecycle (build, init, zones, error aggregation). `buildCPUMeter` is the dispatch. Adding a backend means a new case + a new constructor.
96+
97+
Default value `["rapl", "hwmon"]` preserves the prior fallback behaviour. Operators reorder via the config; no fork needed.
98+
99+
### GPU failure-mode split
100+
101+
`gpu.Discover` is refactored to split the three failure modes and return `(meters, error)`:
102+
103+
| State | GPU result |
104+
|------------------------------|-------------------------------------------------|
105+
| Hardware works | `meters, nil` |
106+
| Hardware absent on this node | `nil, nil` |
107+
| Configured backend broken | `meters?, err` (per-vendor failures aggregated) |
108+
| Feature off | `nil, nil` |
109+
110+
Result per scenario at startup:
111+
112+
| Scenario | Result |
113+
|---------------------|--------------------------|
114+
| Factory error | Real failure. Aggregate. |
115+
| `Init()` error | Real failure. Aggregate. |
116+
| Empty `Devices()` | Soft skip. Not an error. |
117+
| Success | Append to result. |
118+
119+
Error aggregation uses `errors.Join`. `gpu.Discover` returns the joined error only when the GPU feature is explicitly enabled and every registered vendor returned a real failure. "No GPU on this node" stays `(nil, nil)`.
120+
121+
GPU continues to use a vendor registry (NVIDIA registers from `init()` today). The vendor registry is justified because GPU vendor backends live in subpackages (`gpu/nvidia/`) with vendor-specific dependencies (NVML CGO bindings); a switch in `cmd/kepler/main.go` would force every vendor's heavy import into the main binary regardless of whether the operator opted in. CPU backends don't have this constraint — RAPL, hwmon, and fake all live in `internal/device/` with no vendor-specific imports.
122+
123+
### Unified `cmd/kepler/main.go`
124+
125+
End result after the refactor:
126+
127+
```go
128+
cpuMeter, err := device.CreateCPUMeter(logger, cfg)
129+
if err != nil { ... }
130+
131+
gpuMeters, err := gpu.Discover(logger, cfg)
132+
if err != nil { ... }
133+
```
134+
135+
Symmetric shape (`(meter(s), error)`), different failure semantics (CPU strict, GPU soft).
136+
137+
## Detailed Design
138+
139+
### Package layout
140+
141+
```text
142+
internal/device/
143+
├── power_meter.go # PowerMeter interface (promoted from private)
144+
├── cpu_power_meter.go # CPUPowerMeter, CreateCPUMeter, buildCPUMeter
145+
├── rapl_sysfs_power_meter.go # NewCPUPowerMeter (RAPL)
146+
├── hwmon_power_meter.go # NewHwmonPowerMeter
147+
├── fake_cpu_power_meter.go # NewFakeCPUMeter
148+
└── gpu/
149+
├── interface.go # GPUPowerMeter embeds device.PowerMeter, plus Vendor and GPUDevice types
150+
├── registry.go # gpu.Register, gpu.Discover (refactored failure modes)
151+
└── nvidia/ # init() calls gpu.Register(NVIDIA, ...)
152+
```
153+
154+
No new file for CPU dispatch — `CreateCPUMeter` and `buildCPUMeter` live in `cpu_power_meter.go` next to the `CPUPowerMeter` interface they construct.
155+
156+
### Logging
157+
158+
Per-backend `Warn` on failure, `Info` on selection:
159+
160+
```text
161+
INFO using rapl power meter
162+
```
163+
164+
```text
165+
WARN rapl not available, trying next backend error="..."
166+
INFO using hwmon power meter
167+
```
168+
169+
GPU keeps its existing per-vendor `Warn` plus a one-line summary:
170+
171+
```text
172+
INFO gpu meter discovery ok=[nvidia] failed=[amd: factory: rocm-smi not found]
173+
```
174+
175+
## Configuration
176+
177+
```yaml
178+
cpu:
179+
meters: ["rapl", "hwmon"] # ordered priority
180+
```
181+
182+
Per-backend tuning keys (`rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fake-cpu-meter.zones`) are unchanged. Legacy selectors translate at startup to `cpu.meters` and emit a deprecation warning. See [Backward compatibility](#backward-compatibility) for the full migration.
183+
184+
GPU config does not change in this EP.
185+
186+
## Testing Strategy
187+
188+
* `device.CreateCPUMeter` table-driven tests: unknown name, factory error, `Init()` error, empty zones, success, ordered priority, all-fail aggregation.
189+
* `gpu.Discover` table-driven tests: all-fail, mixed, all-empty, all-succeed; per-vendor failure modes (factory, `Init()`, empty `Devices()`).
190+
191+
## Backward compatibility
192+
193+
Default `cpu.meters: ["rapl", "hwmon"]` reproduces default behaviour. No breaking change for healthy systems running default config.
194+
195+
Two legacy selectors require migration. They translate at startup to an effective `cpu.meters` value, log a deprecation warning, and stop working in a future release:
196+
197+
* `experimental.hwmon.forceEnabled: true` (today: force hwmon, skip RAPL) → effective `cpu.meters: ["hwmon"]`.
198+
* `dev.fake-cpu-meter.enabled: true` (today: use fake meter, skip RAPL/hwmon) → effective `cpu.meters: ["fake"]`.
199+
200+
Per-backend tuning keys are unchanged and remain valid: `rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fake-cpu-meter.zones`.
201+
202+
Operators on broken systems will see `kepler` exit with a clearer error when every CPU backend fails (the prior path also exits, but with the hwmon error only, which can be misleading).
203+
204+
GPU error semantics tighten. A node with `gpu.enabled=true` and a broken NVML driver will fail startup instead of silently running CPU-only. This matches the CPU contract. CPU-only is the default (`gpu.enabled=false`); the tightening only affects operators who explicitly opt in to GPU.
205+
206+
## Implementation
207+
208+
Single PR introduces:
209+
210+
1. Promote `device.PowerMeter` interface (`Name`, `Init`).
211+
2. Add `cfg.Cpu.Meters` config field with default `["rapl", "hwmon"]`.
212+
3. Add `device.CreateCPUMeter` + `buildCPUMeter` switch dispatch.
213+
4. Add `Config.ApplyCpuMeterDeprecations` to translate the two legacy keys.
214+
5. Replace `cmd/kepler/main.go`'s `createCPUMeter` / `createHwmonMeter` with a call to `device.CreateCPUMeter`.
215+
6. Update sample configs and docs.
216+
217+
GPU failure-mode split lands as a follow-up PR — separable since CPU and GPU paths are independent in `cmd/kepler/main.go`.
218+
219+
## Risks and Mitigations
220+
221+
### Operational risk: stricter GPU error path (follow-up PR)
222+
223+
* **Risk**: Nodes with GPU explicitly enabled (`gpu.enabled=true`) and broken drivers fail startup where the prior code silently continued without GPU metrics.
224+
* **Mitigation**: Matches the CPU contract. Default-off (`gpu.enabled=false`) means only opted-in operators are affected; they can revert to CPU-only by removing the explicit opt-in.
225+
226+
### Configuration risk: invalid `cpu.meters` value
227+
228+
* **Risk**: Operator typo (`"rappl"`) is rejected by `buildCPUMeter` with `unknown cpu meter "rappl"`.
229+
* **Mitigation**: The error names the unknown backend and the operator's `cpu.meters` is preserved in the printed config at startup so the typo is visible in logs.
230+
231+
## Alternatives Considered
232+
233+
### Alternative 1: Registry pattern with `init()`-time registration
234+
235+
A `RegisterCPUMeter(name, factory)` API where each backend's source file calls it from `init()`, plus a `DiscoverCPU` that walks the registry. This is what GPU does today.
236+
237+
**Why rejected**: speculative architecture for three in-tree backends. The registry pattern's value (third-party plugins, out-of-tree backends, no main.go edits) doesn't apply to Kepler today. The cost (a new package-level API surface, `init()` side effects, test isolation via `Clear` helpers, jumping between files to find where backends are registered) is real and visible to every reader.
238+
239+
The switch dispatch achieves the same operator outcome (config-driven priority) with less vocabulary. If out-of-tree CPU backends become a real requirement, migrating each `case` to a `RegisterCPUMeter` call is mechanical — the `CPUPowerMeter` interface doesn't change.
240+
241+
GPU keeps its registry because vendor backends live in subpackages with heavy vendor-specific imports (NVML, future ROCm/oneAPI). A switch would pull every vendor's dependencies into the main binary unconditionally.
242+
243+
### Alternative 2: Unify CPU and GPU under one per-device interface
244+
245+
This would define a common `Device` interface that both CPU zones and GPU devices implement.
246+
247+
This is not viable since CPU zones (logical, energy in µJ per zone) and GPU devices (physical, watts plus per-process attribution) have different shapes. A unifying interface would push hardware-specific concerns through `interface{}` or lose information.
248+
249+
### Alternative 3: Per-backend Prometheus error counter
250+
251+
This would expose `kepler_meter_failures_total{backend, stage}` instead of returning errors.
252+
253+
While this solves visibility for transient failures during runtime, it doesn't address the startup-time question this proposal targets — "which backend was selected and why." Operational metrics could be added orthogonally.
254+
255+
## Next Steps
256+
257+
Out of scope for this EP.
258+
259+
### GPU failure-mode split
260+
261+
Refactor `gpu.Discover` to split factory / `Init()` / empty-devices failure modes and return `(meters, error)`. Separate PR; independent of the CPU work.
262+
263+
### Operational metrics per backend
264+
265+
A small set of operational metrics per backend: discovery success/failure counters, duration. Useful for operators running heterogeneous fleets.
266+
267+
### Migrate to a registry pattern when warranted
268+
269+
If out-of-tree CPU backends or a plugin model becomes a real requirement (BMC, MSR, ARM-vendor counters arriving from external contributors), the switch in `buildCPUMeter` migrates to `RegisterCPUMeter` calls. The interface (`CPUPowerMeter`) and config (`cpu.meters`) don't change, so the migration is mechanical.

0 commit comments

Comments
 (0)