Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,263 @@
# Draft: Extensible Power Meter Selection

* **Status**: Draft
* **Author**: Niki Manoledaki
* **Created**: 2026-05-03

## Problem

For context, `cmd/kepler/main.go` constructs CPU and GPU power meters in two different shapes:

* `createCPUMeter` hardcodes RAPL with hwmon as a fallback. It returns `(device.CPUPowerMeter, error)`. This is strict by design: Kepler must produce a CPU meter or fail startup.
* `createGPUMeters` fans out across registered vendors via `gpu.DiscoverAll`. It returns `[]gpu.GPUPowerMeter` with no error, which is a soft failure mode, also by design, as it assumes that GPU monitoring is not required for Kepler.

Once we agree on these assumptions, here are the problems that follow.

1. **Asymmetry.** Two patterns for the same job (turn config into hardware abstractions) makes `cmd/kepler/main.go` harder to read and audit.
2. **Hidden GPU failures.** `gpu.Discover` (`internal/device/gpu/registry.go:48`) collapses three failure modes into one `nil` return:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some prerequisites for gpu initialization.

  • the node must have a gpu
  • node feature discovery operator must discover gpu and add some labels to node
  • nvidia (or AMD) gpu operator must be installed and some CRs created.
  • wait for drivers to be installed and initialized etc. this ensures nvml is present
  • now kepler's gpu discovery should succeed

if kepler inits gpu before all the steps are completed, gpu init will fail.
if gpu is enabled, and gpu init fails, kepler cannot separate the two case

  • gpu is not present on the node
  • gpu is present but cannot initialize
    and a cluster can be heterogeneous, i.e onlt few nodes may have gpu

so there was a deliberate decision to allow cpu power to continue to work if gpu side fails for any reason.
@vprashar2929 do you remember same way or did i miss something?

this proposal will fail kepler startup if gpu is configured and gpu init fails, right?

may be there are ways to detect gpu's presence on the node, and if gpu is present, and if gpu init fails we treat it as kepler failure. but the method to detect gpu presence should work in VM nodes also.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vimalk78 Yes this is exact flow.

so there was a deliberate decision to allow cpu power to continue to work if gpu side fails for any reason.

IMO CPU is the base/primary power source for Kepler, and GPU is supplementary. Every node has a CPU, so Kepler should always be able to report CPU power. GPU power only makes sense once that whole prerequisite chain has actually converged on the node. In a heterogeneous cluster tying Kepler's startup to a successful GPU init would mean blocking CPU power monitoring on perfectly healthy nodes just because the GPU side hasn't caught up yet.

* factory error (driver missing or unsupported)
* `Init()` error (driver present but broken)
* empty `Devices()` (vendor registered, no hardware on this node)

Only the third is a legitimate soft outcome. Real backend failures surface only as a `Warn` log line.
3. **Closed CPU path.** RAPL and hwmon are hard-coded fallbacks with hwmon force-enable as the only fallback. Operators on infrastructure where neither is the right source cannot extend this model to select a different backend easily.

## Goals

* Extend CPU power meter backend selection to a config-driven list.
* Unify more of the CPU and GPU lifecycle shape: discovery, failure, initialization.
* Surface real GPU backend failures as errors instead of logging them.
* Preserve default behaviour for healthy systems running default config.

## Non-Goals

* This proposal does not build a common interface **per device** across CPU zones and GPU devices. Their measurement shapes differ and a unifying interface would abstract hardware-specific concerns through `interface{}`, which is not ideal. The CPU and GPU attributes should remain separate.
* This proposal does not replace `monitor.PowerDataProvider`, which will continue to be the consumer for exporters. This proposal unifies the step right before that, which is how to initialize the hardware backend.

## Proposed Solution

### Promote `device.PowerMeter`

Promote the unexported `powerMeter` interface in [internal/device/power_meter.go](../../../internal/device/power_meter.go) to an external interface `PowerMeter`, like this:

```go
// PowerMeter reads power from a class of hardware (CPU, GPU, etc).
//
// Many PowerMeters can be selected. Each contributes its own readings.
// Domain-specific methods live on subinterfaces that embed PowerMeter.
type PowerMeter interface {
service.Service // Name()
service.Initializer // Init()
}
```

`Shutdown` is optional via `service.Shutdowner`. Some backends like NVML implement it.

`device.CPUPowerMeter` and `gpu.GPUPowerMeter` embed `device.PowerMeter` and add their own methods. No change to those domain methods in this EP.

### Switch-based CPU dispatch

CPU gains a config field for ordering plus a switch that walks the priority list:

```yaml
cpu:
meters: ["rapl", "hwmon"] # ordered priority
```

```go
// internal/device/cpu_power_meter.go

func CreateCPUMeter(logger *slog.Logger, cfg *config.Config) (CPUPowerMeter, error) {
var errs []error
for _, name := range cfg.Cpu.Meters {
meter, err := buildCPUMeter(name, logger, cfg)
if err != nil { /* aggregate, continue */ }
if err := meter.Init(); err != nil { /* aggregate, continue */ }
zones, _ := meter.Zones()
if len(zones) == 0 { /* soft skip, continue */ }
return meter, nil
}
return nil, errors.Join(errs...)
}

func buildCPUMeter(name string, logger *slog.Logger, cfg *config.Config) (CPUPowerMeter, error) {
switch name {
case "rapl": return NewCPUPowerMeter(...)
case "hwmon": return NewHwmonPowerMeter(...)
case "fake": return NewFakeCPUMeter(...)
default: return nil, fmt.Errorf("unknown cpu meter %q", name)
}
}
```

`CreateCPUMeter` owns the lifecycle (build, init, zones, error aggregation). `buildCPUMeter` is the dispatch. Adding a backend means a new case + a new constructor, which is straightforward.

Default value `["rapl", "hwmon"]` preserves the prior fallback behaviour. Operators make a selection via the config.

### GPU failure-mode split

As part of this unification, `gpu.Discover` is refactored to split the three failure modes and return `(meters, error)`:

| State | GPU result |
|------------------------------|-------------------------------------------------|
| Hardware works | `meters, nil` |
| Hardware absent on this node | `nil, nil` |
| Configured backend broken | `meters?, err` (per-vendor failures aggregated) |
| Feature off | `nil, nil` |

Result per scenario at startup:

| Scenario | Result |
|-------------------|--------------------------|
| Factory error | Real failure. Aggregate. |
| `Init()` error | Real failure. Aggregate. |
| Empty `Devices()` | Soft skip. Not an error. |
| Success | Append to result. |

Error aggregation uses `errors.Join`. `gpu.Discover` returns the joined error only when the GPU feature is explicitly enabled and every registered vendor returned a real failure. "No GPU on this node" stays `(nil, nil)`.

### Unified `cmd/kepler/main.go`

End result after the refactor:

```go
cpuMeter, err := device.CreateCPUMeter(logger, cfg)
if err != nil { ... }

gpuMeters, err := gpu.Discover(logger, cfg)
if err != nil { ... }
```

Symmetric shape (`(meter(s), error)`), different failure semantics (CPU strict, GPU soft).

## Detailed Design

### Package layout

```text
internal/device/
├── power_meter.go # PowerMeter interface (promoted from private)
├── cpu_power_meter.go # CPUPowerMeter, CreateCPUMeter, buildCPUMeter
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have folder for cpu (same as what we have for gpu)

Copy link
Copy Markdown
Collaborator Author

@nikimanoledaki nikimanoledaki May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I agree. I plan to do this in a follow-up PR 👍

I can update the proposal to reflect this

├── rapl_sysfs_power_meter.go # NewCPUPowerMeter (RAPL)
├── hwmon_power_meter.go # NewHwmonPowerMeter
├── fake_cpu_power_meter.go # NewFakeCPUMeter
└── gpu/
├── interface.go # GPUPowerMeter embeds device.PowerMeter, plus Vendor and GPUDevice types
├── registry.go # gpu.Register, gpu.Discover (refactored failure modes)
└── nvidia/ # init() calls gpu.Register(NVIDIA, ...)
```

No new file for CPU dispatch — `CreateCPUMeter` and `buildCPUMeter` live in `cpu_power_meter.go` next to the `CPUPowerMeter` interface they construct.

### Logging

Per-backend `Warn` on failure, `Info` on selection:

```text
INFO using rapl power meter
```

```text
WARN rapl not available, trying next backend error="..."
INFO using hwmon power meter
```

GPU keeps its existing per-vendor `Warn` plus a one-line summary:

```text
INFO gpu meter discovery ok=[nvidia] failed=[amd: factory: rocm-smi not found]
```

## Configuration

```yaml
cpu:
meters: ["rapl", "hwmon"] # ordered priority
```
Comment on lines +173 to +176
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name cpu.meters doesnt convey that it is a ordered fallback chain. could we use cpu.meterPriority or cpu.fallbackChain ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with fallbackChain field.
And, in addition to log, Kepler resource of kepler-operator can have status to present the active (selected) meter.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I can rename this field 👍

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look: #2471


Per-backend tuning keys (`rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fake-cpu-meter.zones`) are unchanged. Legacy selectors translate at startup to `cpu.meters` and emit a deprecation warning. See [Backward compatibility](#backward-compatibility) for the full migration.

GPU config does not change in this EP.

## Testing Strategy

* `device.CreateCPUMeter` table-driven tests: unknown name, factory error, `Init()` error, empty zones, success, ordered priority, all-fail aggregation.
* `gpu.Discover` table-driven tests: all-fail, mixed, all-empty, all-succeed; per-vendor failure modes (factory, `Init()`, empty `Devices()`).

## Backward compatibility

Default `cpu.meters: ["rapl", "hwmon"]` reproduces default behaviour. No breaking change for healthy systems running default config.

Two legacy selectors require migration. They translate at startup to an effective `cpu.meters` value, log a deprecation warning, and stop working in a future release:

* `experimental.hwmon.forceEnabled: true` (today: force hwmon, skip RAPL) → effective `cpu.meters: ["hwmon"]`.
* `dev.fake-cpu-meter.enabled: true` (today: use fake meter, skip RAPL/hwmon) → effective `cpu.meters: ["fake"]`.

Per-backend tuning keys are unchanged and remain valid: `rapl.zones`, `experimental.hwmon.zones`, `experimental.hwmon.chipRules`, `dev.fake-cpu-meter.zones`.

Operators on broken systems will see `kepler` exit with a clearer error when every CPU backend fails (the prior path also exits, but with the hwmon error only, which can be misleading).

GPU error semantics tighten. A node with `gpu.enabled=true` and a broken NVML driver will fail startup instead of silently running CPU-only. This matches the CPU contract. CPU-only is the default (`gpu.enabled=false`); the tightening only affects operators who explicitly opt in to GPU.

## Implementation

Single PR introduces:

1. Promote `device.PowerMeter` interface (`Name`, `Init`).
2. Add `cfg.Cpu.Meters` config field with default `["rapl", "hwmon"]`.
3. Add `device.CreateCPUMeter` + `buildCPUMeter` switch dispatch.
4. Add `Config.ApplyCpuMeterDeprecations` to translate the two legacy keys.
5. Replace `cmd/kepler/main.go`'s `createCPUMeter` / `createHwmonMeter` with a call to `device.CreateCPUMeter`.
6. Update sample configs and docs.

GPU failure-mode split lands as a follow-up PR — separable since CPU and GPU paths are independent in `cmd/kepler/main.go`.

## Risks and Mitigations

### Operational risk: stricter GPU error path (follow-up PR)

* **Risk**: Nodes with GPU explicitly enabled (`gpu.enabled=true`) and broken drivers fail startup where the prior code silently continued without GPU metrics.
* **Mitigation**: Matches the CPU contract. Default-off (`gpu.enabled=false`) means only opted-in operators are affected; they can revert to CPU-only by removing the explicit opt-in.

### Configuration risk: invalid `cpu.meters` value

* **Risk**: Operator typo (`"rappl"`) is rejected by `buildCPUMeter` with `unknown cpu meter "rappl"`.
* **Mitigation**: The error names the unknown backend and the operator's `cpu.meters` is preserved in the printed config at startup so the typo is visible in logs.

## Alternatives Considered

### Alternative 1: Registry pattern with `init()`-time registration

A `RegisterCPUMeter(name, factory)` API where each backend's source file calls it from `init()`, plus a `DiscoverCPU` that walks the registry. This is what GPU does today.

**Why rejected**: speculative architecture for three backends. The registry pattern's value doesn't apply.

The switch dispatch achieves the same operator outcome (config-driven priority) with less complexity. If other CPU backends are required, migrating each `case` to a `RegisterCPUMeter` call is possible and the `CPUPowerMeter` interface doesn't change.

### Alternative 2: Unify CPU and GPU under one per-device interface

This would define a common `Device` interface that both CPU zones and GPU devices implement.

This is not viable since CPU zones (logical, energy in µJ per zone) and GPU devices (physical, watts plus per-process attribution) have different shapes. A unifying interface would push hardware-specific concerns through `interface{}` or lose information.

### Alternative 3: Per-backend Prometheus error counter

This would expose `kepler_meter_failures_total{backend, stage}` instead of returning errors.

While this solves visibility for transient failures during runtime, it doesn't address the startup-time question this proposal targets — "which backend was selected and why." Operational metrics could be added orthogonally.

## Next Steps

Out of scope for this EP.

### GPU failure-mode split (separate PR)

Refactor `gpu.Discover` to split factory / `Init()` / empty-devices failure modes and return `(meters, error)`. Separate PR; independent of the CPU work.

### Operational metrics per backend

A small set of operational metrics per backend: discovery success/failure counters, duration. Useful for operators running heterogeneous fleets.

### Migrate to a registry pattern when warranted

If out-of-tree CPU backends or a plugin model becomes a real requirement (BMC, MSR, ARM-vendor counters arriving from external contributors), the switch in `buildCPUMeter` migrates to `RegisterCPUMeter` calls. The interface (`CPUPowerMeter`) and config (`cpu.meters`) don't change, so the migration is mechanical.
Loading