docs(proposal): add EP-003 for VM CPU Power Modeling for Kepler#2291
docs(proposal): add EP-003 for VM CPU Power Modeling for Kepler#2291kaiyiliu-strala wants to merge 1 commit into
Conversation
Introduces enhancement proposal for adding Machine Learning models to estimate kepler power metrics in a Virtual Machine environment when hardware power measurement interfaces like RAPL are not available. Signed-off-by: Kaiyi Liu <kaliu@redhat.com>
|
�[1m 🔆🔆🔆 Validating 🔆🔆🔆 �[0m
💻 CPU Comparison with base Kepler💾 Memory Comparison with base Kepler (Inuse)💾 Memory Comparison with base Kepler (Alloc)⬇️ Download the Profiling artifacts from the Actions Summary page 📦 Artifact name: 🔧 Or use GitHub CLI to download artifacts: gh run download 17207440019 -n profile-artifacts-2291 |
|
|
||
| ## Problem Statement | ||
|
|
||
| Virtual machines lack direct access to hardware power measurement interfaces (RAPL, IPMI, etc.) that are essential for energy monitoring in cloud and virtualized environments. Current Kepler deployments in VMs cannot provide accurate power consumption estimates because they cannot access the underlying hardware power consumption data. This creates a significant gap in energy monitoring capabilities for the growing virtualized infrastructure landscape. |
There was a problem hiding this comment.
can we mention that PMU is usually disabled for VMs by providers, so VMs do not show any performance counter also.
| - **Primary Goal**: Develop zone-specific machine learning models (package, core, DRAM, uncore) for CPU power estimation in VMs | ||
| - **Secondary Goal**: Create a production-ready deployment system for VM power models in Go environments | ||
| - **Tertiary Goal**: Establish best practices for VM power modeling including CPU pinning and isolation requirements | ||
| - **Performance Goal**: Achieve <10% mean absolute percentage error compared to baremetal measurements |
There was a problem hiding this comment.
prefer RMS error over MAPE
| ### Functional Requirements | ||
|
|
||
| - **FR1**: Train separate ML models for each power zone (package, core, DRAM, uncore) | ||
| - **FR2**: Use only VM-accessible OS and memory counters as input features |
There was a problem hiding this comment.
code seems to show it, but pls add a section showing list of features used for model training.
| kepler_vm_last_training_timestamp{zone="package"} 1692984532 | ||
| ``` | ||
|
|
||
| ## Implementation Plan |
There was a problem hiding this comment.
typically enhancement proposals do not contain implementation plan.
There was a problem hiding this comment.
please add some detail about
- Model Architecture, mentioning model type (regression, neural network, xgboost etc.) with list of input features and feature engineering approach
- Training Data Requirements, mentioning training data source, if any preprocessing on data, training data size requirements
- how to detect/prevent model overfitting?
- hyperparameters
is there any dependence on number of vCPU? if not why, if yes what does this mean for model training/selection
consider a typical case of 128 CPU baremetal machine running multiple VMs, some with 4 vCPUs, some with 8 vCPUs
There was a problem hiding this comment.
Model Architecture, mentioning model type (regression, neural network, xgboost etc.) with list of input features and feature engineering approach
I think this would be highly data dependent. Dimensionality is a concern of course, but also the linearity or non-linearity of the data is also import.
If the data is highly linear then a neural network or tree boosting approach like GBDT or GBM (xgboost wraps these) is over kill for our needs and some regression model would be effiencent. If the data contains minor non-linearities then a model NN or XGBoost may even still be over kill, because SVM with an RBF kernal could get you there and be more efficent (old school I know, but there is no school like the old school). If it is highly non-linear then the NN or XGBoost is totally sensible. Also if explainabilty is important then NNs would struggle there.
The other three all related to the first question and the data, so the short version is it depends and it may be too early to say with any confidence (especially if we don't have the EDA done)
|
This PR is stale because it has been open 60 days with no activity. |
Introduces enhancement proposal for adding Machine Learning models to estimate kepler power metrics in a Virtual Machine environment when hardware power measurement interfaces like RAPL are not available.