docs(proposal): add EP-003 for VM CPU Power Modeling for Kepler by kaiyiliu-strala · Pull Request #2291 · sustainable-computing-io/kepler

kaiyiliu-strala · 2025-08-25T11:25:48Z

Introduces enhancement proposal for adding Machine Learning models to estimate kepler power metrics in a Virtual Machine environment when hardware power measurement interfaces like RAPL are not available.

Introduces enhancement proposal for adding Machine Learning models to estimate kepler power metrics in a Virtual Machine environment when hardware power measurement interfaces like RAPL are not available. Signed-off-by: Kaiyi Liu <kaliu@redhat.com>

github-actions · 2025-08-25T11:32:34Z

�[1m 🔆🔆🔆 Validating 🔆🔆🔆 �[0m
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Profiling reports are ready to be viewed

⚠️ Variability in pprof CPU and Memory profiles
When comparing pprof profiles of Kepler versions, expect variability in CPU and memory. Focus only on significant, consistent differences.

💻 CPU Comparison with base Kepler

File: kepler
Type: cpu
Time: 2025-08-25 11:29:51 UTC
Duration: 120s, Total samples = 450ms ( 0.37%)
Active filters:
   show=github.com/sustainable-computing-io
Showing nodes accounting for -70ms, 15.56% of 450ms total
      flat  flat%   sum%        cum   cum%
         0     0%     0%      -40ms  8.89%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).scheduleNextCollection.func1
     -30ms  6.67%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).collectProcessMetrics
         0     0%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).calculatePower
         0     0%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).refreshSnapshot
         0     0%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh
         0     0%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh.func1
     -10ms  2.22%  8.89%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).Refresh
         0     0%  8.89%      -20ms  4.44%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).Collect
         0     0%  8.89%      -20ms  4.44%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).refreshProcesses
         0     0%  8.89%       10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).Snapshot
         0     0%  8.89%       10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).ensureFreshData
     -10ms  2.22% 11.11%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/monitor.(*TerminatedResourceTracker[go.shape.*uint8]).Add
      10ms  2.22%  8.89%       10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/monitor.newProcess (inline)
     -10ms  2.22% 11.11%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.(*procFSReader).AllProcs
     -10ms  2.22% 13.33%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.(*procWrapper).CPUTime
     -10ms  2.22% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.(*procWrapper).Cgroups
         0     0% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).updateProcessCache
         0     0% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.computeTypeInfoFromProc.func1
         0     0% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.containerInfoFromProc
         0     0% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.populateProcessFields

💾 Memory Comparison with base Kepler (Inuse)

File: kepler
Type: inuse_space
Time: 2025-08-25 11:31:51 UTC
Duration: 120.01s, Total samples = 4780.43kB 
Active filters:
   show=github.com/sustainable-computing-io
Showing nodes accounting for -1027.99kB, 21.50% of 4780.43kB total
      flat  flat%   sum%        cum   cum%
         0     0%     0% -1027.99kB 21.50%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).Collect
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).Snapshot
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).calculatePower
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).ensureFreshData
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).refreshSnapshot
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh.func1
 -516.01kB 10.79% 10.79%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/resource.(*procFSReader).AllProcs
         0     0% 10.79%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).Refresh
         0     0% 10.79%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).refreshProcesses
 -511.98kB 10.71% 21.50%  -511.98kB 10.71%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).collectProcessMetrics

💾 Memory Comparison with base Kepler (Alloc)

File: kepler
Type: alloc_space
Time: 2025-08-25 11:31:51 UTC
Duration: 120.01s, Total samples = 37784.13kB 
Active filters:
   show=github.com/sustainable-computing-io
Showing nodes accounting for -7187.87kB, 19.02% of 37784.13kB total
Dropped 2 nodes (cum <= 188.92kB)
      flat  flat%   sum%        cum   cum%
         0     0%     0% -3588.92kB  9.50%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).Refresh
         0     0%     0% -3588.92kB  9.50%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).refreshProcesses
         0     0%     0% -3581.06kB  9.48%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).calculatePower
         0     0%     0% -3581.06kB  9.48%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).refreshSnapshot
         0     0%     0% -3581.06kB  9.48%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh
         0     0%     0% -3581.06kB  9.48%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh.func1
         0     0%     0% -3069.68kB  8.12%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).Snapshot
-2571.04kB  6.80%  6.80% -2571.04kB  6.80%  github.com/sustainable-computing-io/kepler/internal/resource.(*procFSReader).CPUUsageRatio
         0     0%  6.80% -2571.04kB  6.80%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).Refresh.func3
         0     0%  6.80% -2571.04kB  6.80%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).refreshNode
-2560.90kB  6.78% 13.58% -2560.90kB  6.78%  github.com/sustainable-computing-io/kepler/internal/resource.(*procWrapper).CPUTime
         0     0% 13.58% -2560.90kB  6.78%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).updateProcessCache
         0     0% 13.58% -2560.90kB  6.78%  github.com/sustainable-computing-io/kepler/internal/resource.populateProcessFields
         0     0% 13.58% -2051.55kB  5.43%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).scheduleNextCollection.func1
         0     0% 13.58% -2045.54kB  5.41%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).Collect
 -516.01kB  1.37% 14.95% -1540.17kB  4.08%  github.com/sustainable-computing-io/kepler/internal/monitor.(*Snapshot).Clone
         0     0% 14.95% -1529.51kB  4.05%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).ensureFreshData
-1028.02kB  2.72% 17.67% -1028.02kB  2.72%  github.com/sustainable-computing-io/kepler/internal/resource.(*procFSReader).AllProcs
         0     0% 17.67%  1026.38kB  2.72%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*cpuInfoCollector).Collect
 1026.38kB  2.72% 14.95%  1026.38kB  2.72%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*realProcFS).CPUInfo
 -512.02kB  1.36% 16.31% -1024.16kB  2.71%  github.com/sustainable-computing-io/kepler/internal/monitor.(*Process).Clone (inline)
-1024.16kB  2.71% 19.02% -1024.16kB  2.71%  github.com/sustainable-computing-io/kepler/internal/monitor.newProcess (inline)
 1024.06kB  2.71% 16.31%  1024.06kB  2.71%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).collectNodeMetrics
         0     0% 16.31%     -514kB  1.36%  github.com/sustainable-computing-io/kepler/internal/resource.computeTypeInfoFromProc.func1
    -514kB  1.36% 17.67%     -514kB  1.36%  github.com/sustainable-computing-io/kepler/internal/resource.containerInfoFromCgroupPaths
         0     0% 17.67%     -514kB  1.36%  github.com/sustainable-computing-io/kepler/internal/resource.containerInfoFromProc
 -512.14kB  1.36% 19.02%  -512.14kB  1.36%  maps.Copy[go.shape.map[github.com/sustainable-computing-io/kepler/internal/device.EnergyZone]github.com/sustainable-computing-io/kepler/internal/monitor.Usage,go.shape.map[github.com/sustainable-computing-io/kepler/internal/device.EnergyZone]github.com/sustainable-computing-io/kepler/internal/monitor.Usage,go.shape.interface { Energy ; Index int; MaxEnergy github.com/sustainable-computing-io/kepler/internal/device.Energy; Name string; Path string },go.shape.struct { EnergyTotal github.com/sustainable-computing-io/kepler/internal/device.Energy; Power github.com/sustainable-computing-io/kepler/internal/device.Power }] (inline)

⬇️ Download the Profiling artifacts from the Actions Summary page

📦 Artifact name: profile-artifacts-2291

🔧 Or use GitHub CLI to download artifacts:

gh run download 17207440019 -n profile-artifacts-2291

vimalk78 · 2025-08-28T06:15:04Z

+
+## Problem Statement
+
+Virtual machines lack direct access to hardware power measurement interfaces (RAPL, IPMI, etc.) that are essential for energy monitoring in cloud and virtualized environments. Current Kepler deployments in VMs cannot provide accurate power consumption estimates because they cannot access the underlying hardware power consumption data. This creates a significant gap in energy monitoring capabilities for the growing virtualized infrastructure landscape.


can we mention that PMU is usually disabled for VMs by providers, so VMs do not show any performance counter also.

vimalk78 · 2025-08-28T09:13:25Z

+- **Primary Goal**: Develop zone-specific machine learning models (package, core, DRAM, uncore) for CPU power estimation in VMs
+- **Secondary Goal**: Create a production-ready deployment system for VM power models in Go environments
+- **Tertiary Goal**: Establish best practices for VM power modeling including CPU pinning and isolation requirements
+- **Performance Goal**: Achieve <10% mean absolute percentage error compared to baremetal measurements


prefer RMS error over MAPE

vimalk78 · 2025-08-28T09:18:31Z

+### Functional Requirements
+
+- **FR1**: Train separate ML models for each power zone (package, core, DRAM, uncore)
+- **FR2**: Use only VM-accessible OS and memory counters as input features


code seems to show it, but pls add a section showing list of features used for model training.

vimalk78 · 2025-08-28T10:00:37Z

+kepler_vm_last_training_timestamp{zone="package"} 1692984532
+```
+
+## Implementation Plan


typically enhancement proposals do not contain implementation plan.

vimalk78 · 2025-08-28T10:23:49Z

please add some detail about

Model Architecture, mentioning model type (regression, neural network, xgboost etc.) with list of input features and feature engineering approach

Training Data Requirements, mentioning training data source, if any preprocessing on data, training data size requirements

how to detect/prevent model overfitting?

hyperparameters

is there any dependence on number of vCPU? if not why, if yes what does this mean for model training/selection
consider a typical case of 128 CPU baremetal machine running multiple VMs, some with 4 vCPUs, some with 8 vCPUs

Model Architecture, mentioning model type (regression, neural network, xgboost etc.) with list of input features and feature engineering approach

I think this would be highly data dependent. Dimensionality is a concern of course, but also the linearity or non-linearity of the data is also import.

If the data is highly linear then a neural network or tree boosting approach like GBDT or GBM (xgboost wraps these) is over kill for our needs and some regression model would be effiencent. If the data contains minor non-linearities then a model NN or XGBoost may even still be over kill, because SVM with an RBF kernal could get you there and be more efficent (old school I know, but there is no school like the old school). If it is highly non-linear then the NN or XGBoost is totally sensible. Also if explainabilty is important then NNs would struggle there.

The other three all related to the first question and the data, so the short version is it depends and it may be too early to say with any confidence (especially if we don't have the EDA done)

github-actions · 2026-04-09T00:34:30Z

This PR is stale because it has been open 60 days with no activity.

laurall974 · 2026-04-22T07:12:55Z

@vimalk78 @KaiyiLiu1234 @sunya-ch I think this PR can be closed.

github-actions Bot added the docs Documentation changes label Aug 25, 2025

kaiyiliu-strala requested review from sthaha and sunya-ch August 25, 2025 11:46

vimalk78 reviewed Aug 28, 2025

View reviewed changes

This was referenced Oct 14, 2025

[Track] ensure compliance to the CNCF to get machine sponsorship from external #1914

Closed

Transparency of Kepler information especially the power models that it is using #1456

Closed

[Track] enable e2e process to add new training machine #1910

Closed

github-actions Bot added the stale Stale state - issue will be closed in 7 days label Apr 9, 2026

This was referenced Apr 17, 2026

docs(proposal): convert vm cpu model proposal to explore improvements #2463

Closed

docs(proposal): convert vm cpu model proposal to explore improvements #2464

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(proposal): add EP-003 for VM CPU Power Modeling for Kepler#2291

docs(proposal): add EP-003 for VM CPU Power Modeling for Kepler#2291
kaiyiliu-strala wants to merge 1 commit into
sustainable-computing-io:mainfrom
kaiyiliu-strala:vm-model-proposal

kaiyiliu-strala commented Aug 25, 2025

Uh oh!

github-actions Bot commented Aug 25, 2025

Uh oh!

vimalk78 Aug 28, 2025

Uh oh!

vimalk78 Aug 28, 2025

Uh oh!

vimalk78 Aug 28, 2025

Uh oh!

vimalk78 Aug 28, 2025

Uh oh!

vimalk78 Aug 28, 2025

Uh oh!

stephan-rayner Nov 26, 2025

Uh oh!

github-actions Bot commented Apr 9, 2026

Uh oh!

laurall974 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		## Problem Statement

		Virtual machines lack direct access to hardware power measurement interfaces (RAPL, IPMI, etc.) that are essential for energy monitoring in cloud and virtualized environments. Current Kepler deployments in VMs cannot provide accurate power consumption estimates because they cannot access the underlying hardware power consumption data. This creates a significant gap in energy monitoring capabilities for the growing virtualized infrastructure landscape.

Conversation

kaiyiliu-strala commented Aug 25, 2025

Uh oh!

github-actions Bot commented Aug 25, 2025

Uh oh!

vimalk78 Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

vimalk78 Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

vimalk78 Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

vimalk78 Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

vimalk78 Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

stephan-rayner Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 9, 2026

Uh oh!

laurall974 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants