docs(proposal): Improve extensibility of registering power meter backends by nikimanoledaki · Pull Request #2467 · sustainable-computing-io/kepler

nikimanoledaki · 2026-05-03T14:09:26Z

Why

cmd/kepler/main.go constructs CPU and GPU power meters in two different ways:

createCPUMeter is a strict RAPL-with-hwmon-fallback chain.
createGPUMeters is a soft fan-out across registered vendors.

Three problems follow: asymmetric patterns, hidden GPU init failures, and a strict CPU path that operators cannot extend easily.

What

This proposal:

Promotes the unexported powerMeter interface in internal/device/power_meter.go to an exported PowerMeter (Name, Init). Shutdown is optional via service.Shutdowner.
Adds cpu.meters config (default ["rapl", "hwmon"]) that operators can select.
Replaces the hardcoded fallback chain with a switch dispatch: device.CreateCPUMeter walks the priority list, buildCPUMeter constructs each backend in a switch over backend names.
Splits gpu.Discover into three failure modes so real backend failures surface as errors rather than log lines (separable follow-up PR).
Translates two legacy keys (experimental.hwmon.forceEnabled, dev.fake-cpu-meter.enabled) to cpu.meters at startup with a deprecation warning.

AI tooling disclosure: I used Claude Code to plan and generate some of these ideas. I prompted and heavily edited any AI-generated text and guarantee that I have read, reviewed, and edited every line.

Propose a `device.PowerMeter` interface and per-domain registry to unify CPU and GPU device discovery in `cmd/kepler/main.go`. Opens the CPU path to non-RAPL backends (MSR, BMC-derived, external probes) without forking, matching the registry pattern GPU already uses. Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com>

- `markdown-table-format`: align column widths in the GPU result-per-scenario table. - `markdownlint MD024`: rename duplicate `### GPU failure-mode split` heading in the Next Steps section to `### GPU failure-mode split (separate PR)` to disambiguate from the same heading in Proposed Solution. Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com>

vimalk78 · 2026-05-05T06:21:46Z

+```yaml
+cpu:
+  meters: ["rapl", "hwmon"]   # ordered priority
+```


the name cpu.meters doesnt convey that it is a ordered fallback chain. could we use cpu.meterPriority or cpu.fallbackChain ?

I agree with fallbackChain field.
And, in addition to log, Kepler resource of kepler-operator can have status to present the active (selected) meter.

Thank you, I can rename this field 👍

Please take a look: #2471

vimalk78 · 2026-05-05T08:29:54Z

+Once we agree on these assumptions, here are the problems that follow.
+
+1. **Asymmetry.** Two patterns for the same job (turn config into hardware abstractions) makes `cmd/kepler/main.go` harder to read and audit.
+2. **Hidden GPU failures.** `gpu.Discover` (`internal/device/gpu/registry.go:48`) collapses three failure modes into one `nil` return:


there are some prerequisites for gpu initialization.

the node must have a gpu

node feature discovery operator must discover gpu and add some labels to node

nvidia (or AMD) gpu operator must be installed and some CRs created.

wait for drivers to be installed and initialized etc. this ensures nvml is present

now kepler's gpu discovery should succeed

if kepler inits gpu before all the steps are completed, gpu init will fail.
if gpu is enabled, and gpu init fails, kepler cannot separate the two case

gpu is not present on the node

gpu is present but cannot initialize
and a cluster can be heterogeneous, i.e onlt few nodes may have gpu

so there was a deliberate decision to allow cpu power to continue to work if gpu side fails for any reason.
@vprashar2929 do you remember same way or did i miss something?

this proposal will fail kepler startup if gpu is configured and gpu init fails, right?

may be there are ways to detect gpu's presence on the node, and if gpu is present, and if gpu init fails we treat it as kepler failure. but the method to detect gpu presence should work in VM nodes also.

@vimalk78 Yes this is exact flow.

so there was a deliberate decision to allow cpu power to continue to work if gpu side fails for any reason.

IMO CPU is the base/primary power source for Kepler, and GPU is supplementary. Every node has a CPU, so Kepler should always be able to report CPU power. GPU power only makes sense once that whole prerequisite chain has actually converged on the node. In a heterogeneous cluster tying Kepler's startup to a successful GPU init would mean blocking CPU power monitoring on perfectly healthy nodes just because the GPU side hasn't caught up yet.

sunya-ch · 2026-05-07T10:39:20Z

+```text
+internal/device/
+├── power_meter.go              # PowerMeter interface (promoted from private)
+├── cpu_power_meter.go          # CPUPowerMeter, CreateCPUMeter, buildCPUMeter


We may have folder for cpu (same as what we have for gpu)

Yes! I agree. I plan to do this in a follow-up PR 👍

I can update the proposal to reflect this

github-actions Bot added the docs Documentation changes label May 3, 2026

nikimanoledaki changed the title ~~docs(proposal): add draft for extensible power meter registry~~ docs(proposal): Make power meter registration easier to extend May 3, 2026

nikimanoledaki force-pushed the nm/draft-extensible-power-meter-registry branch 2 times, most recently from 072edab to 8acc9f0 Compare May 3, 2026 14:19

nikimanoledaki marked this pull request as ready for review May 3, 2026 14:19

nikimanoledaki requested review from sunya-ch and vimalk78 May 3, 2026 14:24

nikimanoledaki changed the title ~~docs(proposal): Make power meter registration easier to extend~~ docs(proposal): Make registration power meter device easier to extend May 3, 2026

nikimanoledaki changed the title ~~docs(proposal): Make registration power meter device easier to extend~~ docs(proposal): Make registering power meter device more extensible May 3, 2026

nikimanoledaki changed the title ~~docs(proposal): Make registering power meter device more extensible~~ docs(proposal): Improve extensibility of registering power meter backends May 3, 2026

nikimanoledaki force-pushed the nm/draft-extensible-power-meter-registry branch from 8acc9f0 to 51fea8e Compare May 3, 2026 18:26

nikimanoledaki mentioned this pull request May 3, 2026

feat(device): add cpu.meters config and switch-based CPU meter selection #2468

Merged

nikimanoledaki force-pushed the nm/draft-extensible-power-meter-registry branch 2 times, most recently from c20aaca to e977202 Compare May 3, 2026 21:31

nikimanoledaki force-pushed the nm/draft-extensible-power-meter-registry branch from e977202 to 8e887ef Compare May 3, 2026 21:32

github-actions Bot added the fix A bug fix label May 3, 2026

vimalk78 reviewed May 5, 2026

View reviewed changes

sunya-ch reviewed May 7, 2026

View reviewed changes

nikimanoledaki mentioned this pull request May 12, 2026

refactor(config): rename cpu.meters to cpu.preferredMeters #2471

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(proposal): Improve extensibility of registering power meter backends#2467

docs(proposal): Improve extensibility of registering power meter backends#2467
nikimanoledaki wants to merge 2 commits into
mainfrom
nm/draft-extensible-power-meter-registry

nikimanoledaki commented May 3, 2026 •

edited

Loading

Uh oh!

vimalk78 May 5, 2026

Uh oh!

sunya-ch May 7, 2026

Uh oh!

nikimanoledaki May 12, 2026

Uh oh!

nikimanoledaki May 12, 2026

Uh oh!

vimalk78 May 5, 2026

Uh oh!

vprashar2929 May 8, 2026

Uh oh!

sunya-ch May 7, 2026

Uh oh!

nikimanoledaki May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nikimanoledaki commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Uh oh!

vimalk78 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

sunya-ch May 7, 2026

Choose a reason for hiding this comment

Uh oh!

nikimanoledaki May 12, 2026

Choose a reason for hiding this comment

Uh oh!

nikimanoledaki May 12, 2026

Choose a reason for hiding this comment

Uh oh!

vimalk78 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

vprashar2929 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

sunya-ch May 7, 2026

Choose a reason for hiding this comment

Uh oh!

nikimanoledaki May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nikimanoledaki commented May 3, 2026 •

edited

Loading

nikimanoledaki May 12, 2026 •

edited

Loading