Skip to content

Commit 20b5276

Browse files
author
benoit-cty
committed
Adastra
Handle power and energy_accumulator Adastra Adastra Doc
1 parent 6d69886 commit 20b5276

8 files changed

Lines changed: 703 additions & 7 deletions

File tree

codecarbon/core/gpu.py

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -303,20 +303,26 @@ def _get_total_energy_consumption(self):
303303
energy_count = self._call_amdsmi_with_reinit(
304304
amdsmi.amdsmi_get_energy_count, self.handle
305305
)
306+
energy_key = None
307+
if "energy_accumulator" in energy_count:
308+
energy_key = "energy_accumulator"
309+
elif "power" in energy_count:
310+
energy_key = "power"
311+
if energy_key is None:
312+
logger.warning(
313+
f"Neither 'energy_accumulator' nor 'power' found in energy_count: {energy_count}"
314+
)
315+
return None
306316
# The amdsmi library returns a dict with energy counter and resolution
307317
# The counter is the actual accumulated value, resolution tells us how much each unit is worth
308-
counter_value = energy_count.get("energy_accumulator", 0)
318+
counter_value = energy_count.get(energy_key, 0)
309319
counter_resolution_uj = energy_count.get("counter_resolution", 0)
310320
if counter_value == 0 and counter_resolution_uj > 0:
311321
# In some cases, the energy_accumulator is 0 but it exist in the metrics info, try to get it from there as a fallback
312322
metrics_info = self._get_gpu_metrics_info()
313-
counter_value = metrics_info.get("energy_accumulator", 0)
323+
counter_value = metrics_info.get(energy_key, 0)
314324
logger.debug(
315-
f"Energy accumulator value from metrics info : {counter_value} for GPU {self._gpu_name} with handle {self.handle} {metrics_info=}"
316-
)
317-
318-
if counter_value == 0 or counter_resolution_uj == 0:
319-
logger.warning(
325+
f"Energy accumulator value from metrics info : {counter_value} for GPU handle {self.handle} {metrics_info=}"
320326
f"Failed to retrieve AMD GPU energy accumulator. energy_count: {energy_count} {counter_value=} {counter_resolution_uj=}",
321327
exc_info=True,
322328
)

examples/slurm_rocm/README.md

Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# CodeCarbon on CINES Adastra HPC with AMD ROCM
2+
3+
This project was provided with computing and storage resources by GENCI at CINES thanks to the grant AD010615147R1 on the supercomputer Adastra's MI250x/MI300 partition.
4+
5+
Thanks to this grant we were able to develop and test the AMD ROCM support in CodeCarbon, and provide this quick start guide to help other users of Adastra HPC to easily monitor the carbon emissions of their machine learning workloads running on AMD GPUs.
6+
7+
It was tested on Adastra but it will likely work on any SLURM cluster with AMD GPUs and ROCM support.
8+
9+
## Quick Start Guide
10+
11+
Adastra security rules require users to connect through a fixed IP. We choose to setup a small host in the cloud to act as a bastion server, allowing us to connect to Adastra from anywhere without needing to change our IP address.
12+
13+
Adastra architecture is quite standard for a HPC cluster, with a login node and compute nodes. The login node has internet access and is the only one accessible from outside, while the compute nodes are where the GPU workloads run, without internet access.
14+
15+
The Python environment is setup on the login node, and referenced by the compute nodes.
16+
17+
The job is submitted from the login node using `sbatch`, and the SLURM script takes care of loading the Python environment and running the code on the compute node.
18+
19+
If the `--time` option of `sbatch` is less than 30 minutes, the job will be put in the `debug` partition, which has a faster scheduling but a shorter maximum runtime.
20+
21+
### Export your configuration
22+
23+
Adapt the following environment variables with your own configuration. You can add them to your `.bashrc` or `.zshrc` for convenience.
24+
25+
```bash
26+
export BASTION_IP="xx.xx.xx.xx"
27+
export BASTION_USER="username"
28+
export HPC_HOST="xx.xx.fr"
29+
export HPC_PASS="xxxxx"
30+
export PROJECT_ID="xxx"
31+
export USER_NAME="username_hpc"
32+
export HPC_PROJECT_FOLDER="/lus/home/xxx"
33+
```
34+
35+
### Connect to CINES Adastra
36+
37+
```bash
38+
sshpass -p "$HPC_PASS" ssh -J $BASTION_USER@$BASTION_IP $USER_NAME@$HPC_HOST
39+
```
40+
41+
For the first time you may want to connect one-by-one to debug any SSH issue before using `sshpass`:
42+
43+
```bash
44+
ssh -o ServerAliveInterval=60 $BASTION_USER@$BASTION_IP
45+
ssh -o ServerAliveInterval=60 $USER_NAME@$HPC_HOST
46+
```
47+
48+
### Copy your code to Adastra
49+
50+
```bash
51+
sshpass -p "$HPC_PASS" scp -r -J $BASTION_USER@$BASTION_IP /you/folder/* $USER_NAME@$HPC_HOST:$HPC_PROJECT_FOLDER
52+
```
53+
54+
### Install CodeCarbon and dependencies
55+
56+
Be careful to install the correct version of `amdsmi` that is compatible with the ROCM version on Adastra. The last available version we used is `7.0.1`.
57+
58+
#### Simple installation
59+
60+
61+
```bash
62+
module load python/3.12
63+
module load rocm/7.0.1
64+
65+
python -m venv .venv
66+
source .venv/bin/activate
67+
pip install --upgrade pip
68+
# Important: Adastra's MI250 runs ROCm 6.4.3 natively.
69+
# With export ROCM_PATH=/opt/rocm-6.4.3 in our SLURM script, this python wheel perfectly matches the C library without symlink issues!
70+
pip install amdsmi==7.0.1
71+
pip install codecarbon
72+
```
73+
74+
#### use a branch of CodeCarbon with PyTorch
75+
76+
```bash
77+
module load python/3.12
78+
module load rocm/7.0.1
79+
git clone https://github.com/mlco2/codecarbon.git
80+
# If you want a specific version, use git checkout <tag> to switch to the desired version.
81+
git checkout -b feat/rocm
82+
cd codecarbon
83+
python -m venv .venv
84+
source .venv/bin/activate
85+
python -V
86+
# Must be 3.12.x
87+
pip install --upgrade pip
88+
# Important: Adastra's MI250 runs ROCm 6.4.3 natively.
89+
# With export ROCM_PATH=/opt/rocm-6.4.3 in our SLURM script, this python wheel perfectly matches the C library without symlink issues!
90+
pip install amdsmi==7.0.1
91+
# Look at https://download.pytorch.org/whl/torch/ for the correct version matching your Python (cp312) and ROCM version.
92+
# torch-2.10.0+rocm7.0-cp312-cp312-manylinux_2_28_x86_64.whl
93+
pip3 install torch==2.10.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.0
94+
pip install numpy
95+
96+
# Install CodeCarbon in editable mode to allow for live code changes without reinstallation
97+
pip install -e .
98+
```
99+
100+
### Submit a Job
101+
102+
**Option A: Using sbatch (recommended)**
103+
```bash
104+
sbatch examples/slurm_rocm/run_codecarbon_pytorch.slurm
105+
```
106+
107+
### 4. Monitor Job Status
108+
```bash
109+
# View running jobs
110+
squeue -u $USER
111+
112+
# View job output
113+
tail -f logs/<job_id>.out
114+
```
115+
116+
## Troubleshooting
117+
118+
119+
```
120+
Error :
121+
[codecarbon WARNING @ 10:28:46] AMD GPU detected but amdsmi is not properly configured. Please ensure amdsmi is correctly installed to get GPU metrics.Tips : check consistency between Python amdsmi package and ROCm versions, and ensure AMD drivers are up to date. Error: /opt/rocm/lib/libamd_smi.so: undefined symbol: amdsmi_get_cpu_affinity_with_scope
122+
```
123+
124+
This mean you have a mismatch between the `amdsmi` Python package and the ROCM version installed on Adastra. To fix this, ensure you install the correct version of `amdsmi` that matches the ROCM version (e.g., `amdsmi==7.0.1` for ROCM 7.0.1).
125+
126+
```bash
127+
KeyError: 'ROCM_PATH'
128+
```
129+
This means the rocm module is not loaded, load it with `module load rocm/7.0.1`.
130+
131+
## Limitations and Future Work
132+
133+
The AMD Instinct MI250 accelerator card contains two Graphics Compute Dies (GCDs) per physical card. However, when monitoring energy consumption (e.g., via rocm-smi or tools like CodeCarbon), only one GCD reports power usage, while the other shows zero values. This is problematic for accurate energy accounting, especially in HPC/SLURM environments where jobs may be allocated a single GCD.
134+
135+
So in that case we display a warning.
136+
137+
In a future work we will use `average_gfx_activity` to estimate the corresponding power of both GCDs, and provide an estimation instead of 0.
138+
139+
## Documentation
140+
141+
- [CINES Adastra GPU allocation](https://dci.dci-gitlab.cines.fr/webextranet/user_support/index.html#allocating-a-single-gpu)
142+
- [CINES PyTorch on ROCM](https://dci.dci-gitlab.cines.fr/webextranet/software_stack/libraries/index.html#pytorch)
143+
- [AMD SMI library](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html)
144+
145+
146+
## Annex: Example of Job Details with scontrol
147+
148+
This trace was obtained to adapt `codecarbon/core/util.py` to properly parse the SLURM job details and extract the relevant information about GPU and CPU allocation.
149+
150+
```
151+
[$PROJECT_ID] $USER_NAME@login5:~/codecarbon$ scontrol show job 4687018
152+
JobId=4687018 JobName=codecarbon-test
153+
UserId=$USER_NAME(xxx) GroupId=grp_$USER_NAME(xxx) MCS_label=N/A
154+
Priority=900000 Nice=0 Account=xxxxxx QOS=debug
155+
JobState=COMPLETED Reason=None Dependency=(null)
156+
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
157+
RunTime=00:00:24 TimeLimit=00:05:00 TimeMin=N/A
158+
SubmitTime=2026-03-02T17:12:49 EligibleTime=2026-03-02T17:12:49
159+
AccrueTime=2026-03-02T17:12:49
160+
StartTime=2026-03-02T17:12:49 EndTime=2026-03-02T17:13:13 Deadline=N/A
161+
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-03-02T17:12:49 Scheduler=Main
162+
Partition=mi250-shared AllocNode:Sid=login5:2553535
163+
ReqNodeList=(null) ExcNodeList=(null)
164+
NodeList=g1341
165+
BatchHost=g1341
166+
NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:1
167+
ReqTRES=cpu=8,mem=29000M,node=1,billing=8,gres/gpu=1
168+
AllocTRES=cpu=16,mem=29000M,energy=10211,node=1,billing=16,gres/gpu=1,gres/gpu:mi250x=1
169+
Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=*
170+
MinCPUsNode=8 MinMemoryNode=29000M MinTmpDiskNode=0
171+
Features=MI250&DEBUG DelayBoot=00:00:00
172+
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
173+
Command=/lus/home/CT6/$PROJECT_ID/$USER_NAME/codecarbon/run_codecarbon.sh
174+
WorkDir=/lus/home/CT6/$PROJECT_ID/$USER_NAME/codecarbon
175+
AdminComment=Accounting=1
176+
StdErr=/lus/home/CT6/$PROJECT_ID/$USER_NAME/codecarbon/logs/4687018.err
177+
StdIn=/dev/null
178+
StdOut=/lus/home/CT6/$PROJECT_ID/$USER_NAME/codecarbon/logs/4687018.out
179+
TresPerNode=gres/gpu:1
180+
TresPerTask=cpu=8
181+
```
182+

examples/slurm_rocm/amdsmi_demo.py

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
#!/usr/bin/env python3
2+
3+
import amdsmi
4+
5+
6+
def main():
7+
try:
8+
# Initialize AMD SMI
9+
amdsmi.amdsmi_init()
10+
11+
# Get all GPU handles
12+
devices = amdsmi.amdsmi_get_processor_handles()
13+
14+
if not devices:
15+
print("No AMD GPUs detected.")
16+
return
17+
18+
for idx, device in enumerate(devices):
19+
print(f"\n===== GPU {idx} =====")
20+
21+
# Get GPU metrics
22+
metrics = amdsmi.amdsmi_get_gpu_metrics_info(device)
23+
24+
# Energy (microjoules)
25+
energy = metrics.get("energy_accumulator", None)
26+
27+
# Power (microwatts)
28+
avg_power = metrics.get("average_socket_power", None)
29+
cur_power = metrics.get("current_socket_power", None)
30+
31+
print(f"Energy accumulator : {energy} uJ")
32+
print(f"Average socket power : {avg_power} W")
33+
print(f"Current socket power : {cur_power} W")
34+
35+
amdsmi.amdsmi_shut_down()
36+
37+
except Exception as e:
38+
print("Error:", str(e))
39+
40+
41+
if __name__ == "__main__":
42+
main()
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
#!/bin/bash
2+
#SBATCH --account=cad15147
3+
#SBATCH --constraint=MI250
4+
#SBATCH --nodes=1
5+
#SBATCH --time=0:25:00
6+
#SBATCH --gpus-per-node=1
7+
#SBATCH --ntasks-per-node=1
8+
#SBATCH --cpus-per-task=8
9+
#SBATCH --threads-per-core=1
10+
#SBATCH --job-name=codecarbon-test
11+
#SBATCH --output=logs/%j.out
12+
#SBATCH --error=logs/%j.err
13+
14+
# Load AMD ROCM environment
15+
module purge
16+
module load cpe/24.07
17+
module load python/3.12
18+
module load rocm/7.0.1
19+
20+
# Print environment info
21+
echo "=== Job Environment ==="
22+
echo "Running on host: $(hostname)"
23+
echo "at: $(date)"
24+
echo "Job ID: $SLURM_JOB_ID"
25+
echo "Number of GPUs: $SLURM_GPUS_PER_NODE"
26+
echo "ROCR_VISIBLE_DEVICES: $ROCR_VISIBLE_DEVICES"
27+
echo "HSA_OVERRIDE_GFX_VERSION: $HSA_OVERRIDE_GFX_VERSION"
28+
echo "LD_LIBRARY_PATH: $LD_LIBRARY_PATH"
29+
echo "PATH: $PATH"
30+
export PYTHONPATH=/opt/rocm-7.0.1/share/amd_smi:$PYTHONPATH
31+
echo "PYTHONPATH: $PYTHONPATH"
32+
33+
rocm-smi
34+
rocm-smi --version
35+
rocm-smi --showmetrics --json
36+
37+
38+
# Create logs directory if it doesn't exist
39+
mkdir -p logs
40+
41+
# Run the Python script
42+
pip install amdsmi==7.0.1
43+
echo "=== Installed AMD SMI Python Package ==="
44+
python3 -m pip list | grep -E "(amd)"
45+
echo "=== AMD SMI Metrics ==="
46+
amd-smi -h
47+
# Verify activation (optional)
48+
echo "=== Python Version ==="
49+
which python3
50+
python3 --version
51+
ls /opt/rocm-7.0.1/share/amd_smi
52+
echo "=== ls /opt ==="
53+
ls /opt
54+
echo "=== Running Training Script ==="
55+
srun python amdsmi_demo.py

examples/slurm_rocm/no_load.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
"""
2+
Use CodeCarbon but without loading the AMD GPU.
3+
pip install codecarbon
4+
"""
5+
6+
import time
7+
8+
from codecarbon import track_emissions
9+
10+
11+
@track_emissions(
12+
measure_power_secs=5,
13+
log_level="debug",
14+
)
15+
def train_model():
16+
"""
17+
This function will do nothing.
18+
"""
19+
print("10 seconds before ending script...")
20+
time.sleep(10)
21+
22+
23+
if __name__ == "__main__":
24+
model = train_model()

0 commit comments

Comments
 (0)