docs(proposal): Improve extensibility of registering power meter backends#2467
docs(proposal): Improve extensibility of registering power meter backends#2467nikimanoledaki wants to merge 2 commits into
Conversation
072edab to
8acc9f0
Compare
8acc9f0 to
51fea8e
Compare
c20aaca to
e977202
Compare
Propose a `device.PowerMeter` interface and per-domain registry to unify CPU and GPU device discovery in `cmd/kepler/main.go`. Opens the CPU path to non-RAPL backends (MSR, BMC-derived, external probes) without forking, matching the registry pattern GPU already uses. Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com>
e977202 to
8e887ef
Compare
- `markdown-table-format`: align column widths in the GPU result-per-scenario table. - `markdownlint MD024`: rename duplicate `### GPU failure-mode split` heading in the Next Steps section to `### GPU failure-mode split (separate PR)` to disambiguate from the same heading in Proposed Solution. Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com>
| ```yaml | ||
| cpu: | ||
| meters: ["rapl", "hwmon"] # ordered priority | ||
| ``` |
There was a problem hiding this comment.
the name cpu.meters doesnt convey that it is a ordered fallback chain. could we use cpu.meterPriority or cpu.fallbackChain ?
There was a problem hiding this comment.
I agree with fallbackChain field.
And, in addition to log, Kepler resource of kepler-operator can have status to present the active (selected) meter.
There was a problem hiding this comment.
Thank you, I can rename this field 👍
| Once we agree on these assumptions, here are the problems that follow. | ||
|
|
||
| 1. **Asymmetry.** Two patterns for the same job (turn config into hardware abstractions) makes `cmd/kepler/main.go` harder to read and audit. | ||
| 2. **Hidden GPU failures.** `gpu.Discover` (`internal/device/gpu/registry.go:48`) collapses three failure modes into one `nil` return: |
There was a problem hiding this comment.
there are some prerequisites for gpu initialization.
- the node must have a gpu
- node feature discovery operator must discover gpu and add some labels to node
- nvidia (or AMD) gpu operator must be installed and some CRs created.
- wait for drivers to be installed and initialized etc. this ensures nvml is present
- now kepler's gpu discovery should succeed
if kepler inits gpu before all the steps are completed, gpu init will fail.
if gpu is enabled, and gpu init fails, kepler cannot separate the two case
- gpu is not present on the node
- gpu is present but cannot initialize
and a cluster can be heterogeneous, i.e onlt few nodes may have gpu
so there was a deliberate decision to allow cpu power to continue to work if gpu side fails for any reason.
@vprashar2929 do you remember same way or did i miss something?
this proposal will fail kepler startup if gpu is configured and gpu init fails, right?
may be there are ways to detect gpu's presence on the node, and if gpu is present, and if gpu init fails we treat it as kepler failure. but the method to detect gpu presence should work in VM nodes also.
There was a problem hiding this comment.
@vimalk78 Yes this is exact flow.
so there was a deliberate decision to allow cpu power to continue to work if gpu side fails for any reason.
IMO CPU is the base/primary power source for Kepler, and GPU is supplementary. Every node has a CPU, so Kepler should always be able to report CPU power. GPU power only makes sense once that whole prerequisite chain has actually converged on the node. In a heterogeneous cluster tying Kepler's startup to a successful GPU init would mean blocking CPU power monitoring on perfectly healthy nodes just because the GPU side hasn't caught up yet.
| ```text | ||
| internal/device/ | ||
| ├── power_meter.go # PowerMeter interface (promoted from private) | ||
| ├── cpu_power_meter.go # CPUPowerMeter, CreateCPUMeter, buildCPUMeter |
There was a problem hiding this comment.
We may have folder for cpu (same as what we have for gpu)
There was a problem hiding this comment.
Yes! I agree. I plan to do this in a follow-up PR 👍
I can update the proposal to reflect this
Why
cmd/kepler/main.goconstructs CPU and GPU power meters in two different ways:createCPUMeteris a strict RAPL-with-hwmon-fallback chain.createGPUMetersis a soft fan-out across registered vendors.Three problems follow: asymmetric patterns, hidden GPU init failures, and a strict CPU path that operators cannot extend easily.
What
This proposal:
internal/device/power_meter.goto an exported PowerMeter (Name, Init). Shutdown is optional via service.Shutdowner.AI tooling disclosure: I used Claude Code to plan and generate some of these ideas. I prompted and heavily edited any AI-generated text and guarantee that I have read, reviewed, and edited every line.