Skip to content

Commit 55f6f4d

Browse files
feat(device): add cpu.meters config and switch-based CPU meter selection (#2468)
* feat(device): promote `PowerMeter` interface Promote the unexported `powerMeter` interface in `internal/device/power_meter.go` to an exported `PowerMeter` requiring `Name` and `Init` (via `service.Service` and `service.Initializer`). Backends with resources can opt into cleanup by implementing `service.Shutdowner`; the run framework already type-asserts it. `device.CPUPowerMeter` embeds `PowerMeter`. The fake CPU meter and `MockCPUPowerMeter` gain a no-op `Init()` to satisfy the interface; no behaviour change. Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com> * feat(device): add `cpu.meters` config and switch-based CPU meter selection `createCPUMeter` in `cmd/kepler/main.go` hardcoded the RAPL → hwmon fallback chain. This commit replaces it with a config-driven switch: - New `cfg.Cpu.Meters` config (default `["rapl", "hwmon"]`) lets operators reorder backend priority without forking. - New `device.CreateCPUMeter` walks the list, builds each backend, runs `Init()`, and returns the first that reports zones. Real failures aggregate via `errors.Join`; empty zones are a soft skip. - New `buildCPUMeter` is a switch over backend names — adding a new backend means adding one case and one constructor. - `cmd/kepler/main.go` calls `device.CreateCPUMeter` and drops the old `createCPUMeter` / `createHwmonMeter` helpers (~70 lines deleted). Two legacy keys translate to `cpu.meters` at startup with a deprecation warning, preserving operator behaviour: - `dev.fake-cpu-meter.enabled: true` → `cpu.meters: ["fake"]` - `experimental.hwmon.forceEnabled: true` → `cpu.meters: ["hwmon"]` Sample configs (`hack/config.yaml`, helm values, k8s configmap, dev compose) and `docs/user/configuration.md` updated. No behaviour change for healthy systems running default config. Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com> * fix(ci): unbreak yaml lint and fake-cpu-meter sed target Three CI fixes for PR #2468: - `yamllint`: drop redundant double-quotes in `meters: [rapl, hwmon]` across `hack/config.yaml`, `compose/dev/kepler-dev/etc/kepler/config.yaml`, and `manifests/helm/kepler/values.yaml`. Project's `.yamllint.yaml` enforces `quoted-strings: required: only-when-needed`. - `compose-deploy` / `build-and-deploy`: the CI runs `sed -i '/fake-cpu-meter:/{n;s/enabled: false/enabled: true/}'` to flip the fake meter on for tests. The previous commit added a multi-line deprecation comment between `fake-cpu-meter:` and `enabled: false`, which broke the sed (`n` jumped to the comment instead of `enabled:`). Move the deprecation note to a trailing comment on the same line as `enabled: false` in all four affected files. - `codecov/patch`: add `internal/device/cpu_power_meter_test.go` covering `CreateCPUMeter` and `buildCPUMeter` (success, unknown name, fallthrough, empty meters). Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com> * test(device): cover rapl/hwmon paths and error branches in CreateCPUMeter Extends `cpu_power_meter_test.go` to exercise the previously-uncovered new lines flagged by codecov/patch: - `buildCPUMeter` "rapl" case (with `cfg.Rapl.Zones` filter logging). - `buildCPUMeter` "hwmon" case with experimental config (zones + chipRules). - `buildCPUMeter` "hwmon" case with nil experimental config. - `CreateCPUMeter` factory-error path (rapl with bogus sysfs). - `CreateCPUMeter` Init-error path (hwmon with bogus sysfs). - `CreateCPUMeter` aggregated-error path (rapl + hwmon both fail). Tests use `/nonexistent/sysfs/path` to force factory or Init failures without requiring real hardware. Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com> * refactor(cpu): switch dispatch, structured logging, table tests Three polish changes on top of the switch-based design: - `ApplyCpuMeterDeprecations` rewritten as a `switch` for symmetry. The previous `if { ...; return } if { ... }` shape had a load-bearing early return that read as asymmetric. - `cpu_power_meter.go` logging uses static messages with a `"meter"` attr instead of `fmt.Sprintf`-baked dynamic strings. Mirrors the existing convention in `gpu/registry.go` and the rest of `internal/device/`, and gives operators a queryable `"meter"` attr that didn't exist before. - `config_cpu_test.go` collapses `TestCpuMetersDefault` and `TestApplyCpuMeterDeprecations` into a single `TestCpuMeters` table. The default case is one row where `setup` is a no-op. Drops the `io.Discard` boilerplate for `slog.DiscardHandler` (Go 1.24+). Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com> * fix(device): aggregate empty-zone soft-skips in CreateCPUMeter When all configured backends in `cpu.Meters` returned empty zones, the soft-skip path discarded that signal and the function returned a misleading "cpu.meters is empty" error. Append to `errs` on the empty-zones path so the joined error reports each skipped backend by name. Handle the truly empty-config case with an early return at the top of the function. Addresses review feedback on #2468. Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com> * feat(config): hard-fail on unknown `cpu.meters` entries Validate `cpu.meters` against the set of known backends in `Config.Validate` and return an explicit error listing the valid names. A typo such as `["rappl", "hwmon"]` is now caught at startup instead of silently falling through to the next backend. The `Warn` in `CreateCPUMeter` stays as defense-in-depth for callers that skip validation. Addresses review feedback on #2468. Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com> * chore(cpu): use `slog.DiscardHandler` and clarify legacy precedence - Switch `discardLogger()` in `cpu_power_meter_test.go` to `slog.New(slog.DiscardHandler)` to match `config_cpu_test.go`. Drops the unused `io` import. - Note in `ApplyCpuMeterDeprecations` that `fake` takes precedence over `hwmon` when both legacy keys are set. Addresses review feedback on #2468. Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com> --------- Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com>
1 parent bab38cb commit 55f6f4d

13 files changed

Lines changed: 567 additions & 95 deletions

File tree

cmd/kepler/main.go

Lines changed: 3 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,8 @@ func main() {
4545
logVersionInfo(logger)
4646
printConfigInfo(logger, cfg)
4747

48+
cfg.ApplyCpuMeterDeprecations(logger)
49+
4850
services, err := createServices(logger, cfg)
4951
if err != nil {
5052
logger.Error("failed to create services", "error", err)
@@ -127,7 +129,7 @@ Configuration
127129

128130
func createServices(logger *slog.Logger, cfg *config.Config) ([]service.Service, error) {
129131
logger.Debug("Creating all services")
130-
cpuPowerMeter, err := createCPUMeter(logger, cfg)
132+
cpuPowerMeter, err := device.CreateCPUMeter(logger, cfg)
131133
if err != nil {
132134
return nil, fmt.Errorf("failed to create CPU power meter: %w", err)
133135
}
@@ -309,79 +311,6 @@ func createPrometheusExporter(
309311
return promExporter, nil
310312
}
311313

312-
func createCPUMeter(logger *slog.Logger, cfg *config.Config) (device.CPUPowerMeter, error) {
313-
// If fake meter is explicitly enabled, use it directly
314-
if fake := cfg.Dev.FakeCpuMeter; *fake.Enabled {
315-
return device.NewFakeCPUMeter(fake.Zones, device.WithFakeLogger(logger))
316-
}
317-
318-
// If hwmon is explicitly enabled, use it directly
319-
if cfg.IsFeatureEnabled(config.ExperimentalHwmonFeature) {
320-
logger.Info("hwmon explicitly enabled via config")
321-
return createHwmonMeter(logger, cfg)
322-
}
323-
324-
// Try RAPL first (default logic -> attempt RAPL, then attempt HWMON)
325-
if len(cfg.Rapl.Zones) > 0 {
326-
logger.Info("rapl zones are filtered", "zones-enabled", cfg.Rapl.Zones)
327-
}
328-
329-
raplMeter, err := device.NewCPUPowerMeter(
330-
cfg.Host.SysFS,
331-
device.WithRaplLogger(logger),
332-
device.WithZoneFilter(cfg.Rapl.Zones),
333-
)
334-
if err != nil {
335-
logger.Warn("RAPL not available, falling back to hwmon", "error", err)
336-
return createHwmonMeter(logger, cfg)
337-
}
338-
339-
// Verify RAPL can actually read energy data
340-
if initErr := raplMeter.Init(); initErr != nil {
341-
logger.Warn("RAPL initialization failed, falling back to hwmon", "error", initErr)
342-
return createHwmonMeter(logger, cfg)
343-
}
344-
345-
logger.Info("using RAPL power meter")
346-
return raplMeter, nil
347-
}
348-
349-
func createHwmonMeter(logger *slog.Logger, cfg *config.Config) (device.CPUPowerMeter, error) {
350-
var hwmonZones []string
351-
var chipRules []device.ConfigChipRule
352-
353-
if cfg.Experimental != nil {
354-
hwmon := cfg.Experimental.Hwmon
355-
356-
if len(hwmon.Zones) > 0 {
357-
logger.Info("hwmon zones are filtered", "zones-enabled", hwmon.Zones)
358-
hwmonZones = hwmon.Zones
359-
}
360-
361-
for _, cr := range hwmon.ChipRules {
362-
chipRules = append(chipRules, device.ConfigChipRule{
363-
Name: cr.Name,
364-
Pairings: cr.Pairings,
365-
SkipVoltages: cr.SkipVoltages,
366-
SkipCurrents: cr.SkipCurrents,
367-
UseSameIndex: cr.UseSameIndex,
368-
})
369-
}
370-
371-
if len(chipRules) > 0 {
372-
logger.Info("hwmon chip rules configured", "count", len(chipRules))
373-
}
374-
}
375-
376-
logger.Info("using hwmon power meter")
377-
return device.NewHwmonPowerMeter(
378-
cfg.Host.SysFS,
379-
device.WithHwmonLogger(logger),
380-
device.WithHwmonZoneFilter(hwmonZones),
381-
device.WithHwmonChipRules(chipRules),
382-
)
383-
}
384-
385314
// createGPUMeters discovers and initializes GPU power meters for all vendors.
386315
// Uses the registry pattern to support multiple GPU vendors (NVIDIA, AMD, Intel).
387316
// Returns empty slice if GPU is not enabled or no GPUs are available (soft-fail).

compose/dev/kepler-dev/etc/kepler/config.yaml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,12 @@ host:
3232
sysfs: /host/sys # Path to sysfs filesystem (default: /sys)
3333
procfs: /host/proc # Path to procfs filesystem (default: /proc)
3434

35+
cpu:
36+
# Ordered list of CPU power-meter backends. The first backend that
37+
# initializes successfully and reports zones is used.
38+
# Built-in backends: rapl, hwmon, fake.
39+
meters: [rapl, hwmon]
40+
3541
rapl:
3642
zones: [] # zones to be enabled, empty enables all default zones
3743

@@ -71,7 +77,7 @@ kube: # kubernetes related config
7177
# WARN DO NOT ENABLE THIS IN PRODUCTION - for development / testing only
7278
dev:
7379
fake-cpu-meter:
74-
enabled: false
80+
enabled: false # DEPRECATED: set cpu.meters: ["fake"] instead. Overrides cpu.meters and emits a deprecation warning when true.
7581
zones: [] # zones to be enabled, empty enables all default zones
7682

7783
# EXPERIMENTAL FEATURES - These features are experimental and may be unstable
@@ -84,7 +90,7 @@ experimental:
8490
nodeName: kepler-dev # Node name to use (overrides Kubernetes node name and hostname fallback)
8591
httpTimeout: 5s # HTTP client timeout for BMC requests (default: 5s)
8692
hwmon:
87-
forceEnabled: false # Force hwmon as the power meter, skipping RAPL auto-detection
93+
forceEnabled: false # DEPRECATED: set cpu.meters: ["hwmon"] instead. Emits a deprecation warning when true.
8894
zones: [] # List of zones to enable (default enable all)
8995
chipRules: [] # User-defined chip pairing rules (override/add to hardcoded defaults)
9096
gpu:

config/config.go

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ package config
66
import (
77
"fmt"
88
"io"
9+
"log/slog"
910
"net"
1011
"net/url"
1112
"os"
@@ -53,6 +54,13 @@ type (
5354
ProcFS string `yaml:"procfs"`
5455
}
5556

57+
// Cpu configuration controls how CPU power meters are selected.
58+
// Meters lists backends in priority order. The first backend that
59+
// initializes successfully and reports zones is used.
60+
Cpu struct {
61+
Meters []string `yaml:"meters"`
62+
}
63+
5664
// Rapl configuration
5765
Rapl struct {
5866
Zones []string `yaml:"zones"`
@@ -192,6 +200,7 @@ type (
192200
Log Log `yaml:"log"`
193201
Host Host `yaml:"host"`
194202
Monitor Monitor `yaml:"monitor"`
203+
Cpu Cpu `yaml:"cpu"`
195204
Rapl Rapl `yaml:"rapl"`
196205
Exporter Exporter `yaml:"exporter"`
197206
Web Web `yaml:"web"`
@@ -264,6 +273,9 @@ const (
264273
MonitorStaleness = "monitor.staleness" // not a flag
265274
MonitorMaxTerminatedFlag = "monitor.max-terminated"
266275

276+
// CPU
277+
CpuMeters = "cpu.meters" // not a flag
278+
267279
// RAPL
268280
RaplZones = "rapl.zones" // not a flag
269281

@@ -313,6 +325,9 @@ func DefaultConfig() *Config {
313325
SysFS: "/sys",
314326
ProcFS: "/proc",
315327
},
328+
Cpu: Cpu{
329+
Meters: []string{"rapl", "hwmon"},
330+
},
316331
Rapl: Rapl{
317332
Zones: []string{},
318333
},
@@ -358,6 +373,28 @@ func DefaultConfig() *Config {
358373
return cfg
359374
}
360375

376+
// ApplyCpuMeterDeprecations translates legacy CPU-meter selectors into an
377+
// effective cpu.meters value and logs a deprecation warning per translation.
378+
//
379+
// Legacy selectors:
380+
// - experimental.hwmon.forceEnabled=true → cpu.meters: ["hwmon"]
381+
// - dev.fake-cpu-meter.enabled=true → cpu.meters: ["fake"]
382+
//
383+
// Legacy selectors win over an explicit cpu.meters when set, since operators
384+
// who set them today expect the legacy behavior. When both legacy keys are set,
385+
// fake takes precedence over hwmon. The legacy keys will stop working in a
386+
// future release.
387+
func (c *Config) ApplyCpuMeterDeprecations(logger *slog.Logger) {
388+
switch {
389+
case ptr.Deref(c.Dev.FakeCpuMeter.Enabled, false):
390+
logger.Warn(`dev.fake-cpu-meter.enabled is deprecated; set cpu.meters: ["fake"] instead`)
391+
c.Cpu.Meters = []string{"fake"}
392+
case c.Experimental != nil && ptr.Deref(c.Experimental.Hwmon.ForceEnabled, false):
393+
logger.Warn(`experimental.hwmon.forceEnabled is deprecated; set cpu.meters: ["hwmon"] instead`)
394+
c.Cpu.Meters = []string{"hwmon"}
395+
}
396+
}
397+
361398
// Load loads configuration from an io.Reader
362399
func Load(r io.Reader) (*Config, error) {
363400
cfg := DefaultConfig()
@@ -786,6 +823,10 @@ func (c *Config) sanitize() {
786823
c.Web.ListenAddresses[i] = strings.TrimSpace(c.Web.ListenAddresses[i])
787824
}
788825

826+
for i := range c.Cpu.Meters {
827+
c.Cpu.Meters[i] = strings.TrimSpace(c.Cpu.Meters[i])
828+
}
829+
789830
for i := range c.Rapl.Zones {
790831
c.Rapl.Zones[i] = strings.TrimSpace(c.Rapl.Zones[i])
791832
}
@@ -875,6 +916,19 @@ func (c *Config) Validate(skips ...SkipValidation) error {
875916
}
876917
}
877918
}
919+
{ // cpu.meters
920+
// Keep this list in sync with the switch in internal/device/cpu_power_meter.go.
921+
validCpuMeters := map[string]bool{
922+
"rapl": true,
923+
"hwmon": true,
924+
"fake": true,
925+
}
926+
for _, name := range c.Cpu.Meters {
927+
if !validCpuMeters[name] {
928+
errs = append(errs, fmt.Sprintf("invalid cpu.meters entry %q, must be one of %q, %q, %q", name, "rapl", "hwmon", "fake"))
929+
}
930+
}
931+
}
878932
{ // Monitor
879933
if c.Monitor.Interval < 0 {
880934
errs = append(errs, fmt.Sprintf("invalid monitor interval: %s can't be negative", c.Monitor.Interval))
@@ -1076,6 +1130,7 @@ func (c *Config) manualString() string {
10761130
{MonitorIntervalFlag, c.Monitor.Interval.String()},
10771131
{MonitorStaleness, c.Monitor.Staleness.String()},
10781132
{MonitorMaxTerminatedFlag, fmt.Sprintf("%d", c.Monitor.MaxTerminated)},
1133+
{CpuMeters, strings.Join(c.Cpu.Meters, ", ")},
10791134
{RaplZones, strings.Join(c.Rapl.Zones, ", ")},
10801135
{ExporterStdoutEnabledFlag, fmt.Sprintf("%v", c.Exporter.Stdout.Enabled)},
10811136
{ExporterPrometheusEnabledFlag, fmt.Sprintf("%v", c.Exporter.Prometheus.Enabled)},

config/config_cpu_test.go

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
// SPDX-FileCopyrightText: 2025 The Kepler Authors
2+
// SPDX-License-Identifier: Apache-2.0
3+
4+
package config
5+
6+
import (
7+
"log/slog"
8+
"strings"
9+
"testing"
10+
11+
"github.com/stretchr/testify/assert"
12+
"github.com/stretchr/testify/require"
13+
"k8s.io/utils/ptr"
14+
)
15+
16+
// TestCpuMeters covers cfg.Cpu.Meters defaults, deprecation translation, and
17+
// precedence. Each case sets up a Config and asserts the resulting Meters
18+
// after ApplyCpuMeterDeprecations runs. Log output is discarded; behaviour
19+
// is verified solely via cfg.Cpu.Meters.
20+
func TestCpuMeters(t *testing.T) {
21+
tests := []struct {
22+
name string
23+
setup func(*Config)
24+
want []string
25+
}{
26+
{
27+
name: "default: rapl then hwmon",
28+
setup: func(*Config) {},
29+
want: []string{"rapl", "hwmon"},
30+
},
31+
{
32+
name: "fake-cpu-meter overrides cpu.meters",
33+
setup: func(c *Config) {
34+
c.Dev.FakeCpuMeter.Enabled = ptr.To(true)
35+
},
36+
want: []string{"fake"},
37+
},
38+
{
39+
name: "hwmon forceEnabled overrides cpu.meters",
40+
setup: func(c *Config) {
41+
c.Experimental = &Experimental{}
42+
c.Experimental.Hwmon.ForceEnabled = ptr.To(true)
43+
},
44+
want: []string{"hwmon"},
45+
},
46+
{
47+
name: "fake wins when both legacy keys are set",
48+
setup: func(c *Config) {
49+
c.Dev.FakeCpuMeter.Enabled = ptr.To(true)
50+
c.Experimental = &Experimental{}
51+
c.Experimental.Hwmon.ForceEnabled = ptr.To(true)
52+
},
53+
want: []string{"fake"},
54+
},
55+
{
56+
name: "no legacy: explicit cpu.meters preserved",
57+
setup: func(c *Config) {
58+
c.Cpu.Meters = []string{"rapl"}
59+
},
60+
want: []string{"rapl"},
61+
},
62+
}
63+
64+
logger := slog.New(slog.DiscardHandler)
65+
66+
for _, tc := range tests {
67+
t.Run(tc.name, func(t *testing.T) {
68+
cfg := DefaultConfig()
69+
tc.setup(cfg)
70+
cfg.ApplyCpuMeterDeprecations(logger)
71+
assert.Equal(t, tc.want, cfg.Cpu.Meters)
72+
})
73+
}
74+
}
75+
76+
func TestCpuMetersValidation(t *testing.T) {
77+
tests := []struct {
78+
name string
79+
meters []string
80+
wantErr string
81+
}{
82+
{
83+
name: "all known backends",
84+
meters: []string{"rapl", "hwmon", "fake"},
85+
},
86+
{
87+
name: "unknown backend",
88+
meters: []string{"rappl"},
89+
wantErr: `invalid cpu.meters entry "rappl"`,
90+
},
91+
{
92+
name: "typo before valid backend still rejected",
93+
meters: []string{"rappl", "hwmon"},
94+
wantErr: `invalid cpu.meters entry "rappl"`,
95+
},
96+
{
97+
name: "empty list passes Validate (CreateCPUMeter handles emptiness)",
98+
meters: []string{},
99+
},
100+
}
101+
102+
for _, tc := range tests {
103+
t.Run(tc.name, func(t *testing.T) {
104+
cfg := DefaultConfig()
105+
cfg.Cpu.Meters = tc.meters
106+
err := cfg.Validate(SkipHostValidation)
107+
if tc.wantErr == "" {
108+
assert.NoError(t, err)
109+
return
110+
}
111+
require.Error(t, err)
112+
assert.Contains(t, err.Error(), tc.wantErr)
113+
})
114+
}
115+
}
116+
117+
func TestLoadCpuMetersFromYAML(t *testing.T) {
118+
tests := []struct {
119+
name string
120+
yamlData string
121+
want []string
122+
}{
123+
{
124+
name: "explicit hwmon-first",
125+
yamlData: "cpu:\n meters: [hwmon, rapl]\n",
126+
want: []string{"hwmon", "rapl"},
127+
},
128+
{
129+
name: "single backend",
130+
yamlData: "cpu:\n meters: [fake]\n",
131+
want: []string{"fake"},
132+
},
133+
{
134+
name: "empty list",
135+
yamlData: "cpu:\n meters: []\n",
136+
want: []string{},
137+
},
138+
}
139+
140+
for _, tc := range tests {
141+
t.Run(tc.name, func(t *testing.T) {
142+
cfg, err := Load(strings.NewReader(tc.yamlData))
143+
require.NoError(t, err)
144+
assert.Equal(t, tc.want, cfg.Cpu.Meters)
145+
})
146+
}
147+
}

0 commit comments

Comments
 (0)