|
| 1 | +# CodeCarbon on CINES Adastra HPC with AMD ROCM |
| 2 | + |
| 3 | +This project was provided with computing and storage resources by GENCI at CINES thanks to the grant AD010615147R1 on the supercomputer Adastra's MI250x/MI300 partition. |
| 4 | + |
| 5 | +Thanks to this grant we were able to develop and test the AMD ROCM support in CodeCarbon, and provide this quick start guide to help other users of Adastra HPC to easily monitor the carbon emissions of their machine learning workloads running on AMD GPUs. |
| 6 | + |
| 7 | +It was tested on Adastra but it will likely work on any SLURM cluster with AMD GPUs and ROCM support. |
| 8 | + |
| 9 | +## Quick Start Guide |
| 10 | + |
| 11 | +Adastra security rules require users to connect through a fixed IP. We choose to setup a small host in the cloud to act as a bastion server, allowing us to connect to Adastra from anywhere without needing to change our IP address. |
| 12 | + |
| 13 | +Adastra architecture is quite standard for a HPC cluster, with a login node and compute nodes. The login node has internet access and is the only one accessible from outside, while the compute nodes are where the GPU workloads run, without internet access. |
| 14 | + |
| 15 | +The Python environment is setup on the login node, and referenced by the compute nodes. |
| 16 | + |
| 17 | +The job is submitted from the login node using `sbatch`, and the SLURM script takes care of loading the Python environment and running the code on the compute node. |
| 18 | + |
| 19 | +If the `--time` option of `sbatch` is less than 30 minutes, the job will be put in the `debug` partition, which has a faster scheduling but a shorter maximum runtime. |
| 20 | + |
| 21 | +### Export your configuration |
| 22 | + |
| 23 | +Adapt the following environment variables with your own configuration. You can add them to your `.bashrc` or `.zshrc` for convenience. |
| 24 | + |
| 25 | +```bash |
| 26 | +export BASTION_IP="xx.xx.xx.xx" |
| 27 | +export BASTION_USER="username" |
| 28 | +export HPC_HOST="xx.xx.fr" |
| 29 | +export HPC_PASS="xxxxx" |
| 30 | +export PROJECT_ID="xxx" |
| 31 | +export USER_NAME="username_hpc" |
| 32 | +export HPC_PROJECT_FOLDER="/lus/home/xxx" |
| 33 | +``` |
| 34 | + |
| 35 | +### Connect to CINES Adastra |
| 36 | + |
| 37 | +```bash |
| 38 | +sshpass -p "$HPC_PASS" ssh -J $BASTION_USER@$BASTION_IP $USER_NAME@$HPC_HOST |
| 39 | +``` |
| 40 | + |
| 41 | +For the first time you may want to connect one-by-one to debug any SSH issue before using `sshpass`: |
| 42 | + |
| 43 | +```bash |
| 44 | +ssh -o ServerAliveInterval=60 $BASTION_USER@$BASTION_IP |
| 45 | +ssh -o ServerAliveInterval=60 $USER_NAME@$HPC_HOST |
| 46 | +``` |
| 47 | + |
| 48 | +### Copy your code to Adastra |
| 49 | + |
| 50 | +```bash |
| 51 | +sshpass -p "$HPC_PASS" scp -r -J $BASTION_USER@$BASTION_IP /you/folder/* $USER_NAME@$HPC_HOST:$HPC_PROJECT_FOLDER |
| 52 | +``` |
| 53 | + |
| 54 | +### Install CodeCarbon and dependencies |
| 55 | + |
| 56 | +Be careful to install the correct version of `amdsmi` that is compatible with the ROCM version on Adastra. The last available version we used is `7.0.1`. |
| 57 | + |
| 58 | +#### Simple installation |
| 59 | + |
| 60 | + |
| 61 | +```bash |
| 62 | +module load python/3.12 |
| 63 | +module load rocm/7.0.1 |
| 64 | + |
| 65 | +python -m venv .venv |
| 66 | +source .venv/bin/activate |
| 67 | +pip install --upgrade pip |
| 68 | +# Important: Adastra's MI250 runs ROCm 6.4.3 natively. |
| 69 | +# With export ROCM_PATH=/opt/rocm-6.4.3 in our SLURM script, this python wheel perfectly matches the C library without symlink issues! |
| 70 | +pip install amdsmi==7.0.1 |
| 71 | +pip install codecarbon |
| 72 | +``` |
| 73 | + |
| 74 | +#### use a branch of CodeCarbon with PyTorch |
| 75 | + |
| 76 | +```bash |
| 77 | +module load python/3.12 |
| 78 | +module load rocm/7.0.1 |
| 79 | +git clone https://github.com/mlco2/codecarbon.git |
| 80 | +# If you want a specific version, use git checkout <tag> to switch to the desired version. |
| 81 | +git checkout -b feat/rocm |
| 82 | +cd codecarbon |
| 83 | +python -m venv .venv |
| 84 | +source .venv/bin/activate |
| 85 | +python -V |
| 86 | +# Must be 3.12.x |
| 87 | +pip install --upgrade pip |
| 88 | +# Important: Adastra's MI250 runs ROCm 6.4.3 natively. |
| 89 | +# With export ROCM_PATH=/opt/rocm-6.4.3 in our SLURM script, this python wheel perfectly matches the C library without symlink issues! |
| 90 | +pip install amdsmi==7.0.1 |
| 91 | +# Look at https://download.pytorch.org/whl/torch/ for the correct version matching your Python (cp312) and ROCM version. |
| 92 | +# torch-2.10.0+rocm7.0-cp312-cp312-manylinux_2_28_x86_64.whl |
| 93 | +pip3 install torch==2.10.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.0 |
| 94 | +pip install numpy |
| 95 | + |
| 96 | +# Install CodeCarbon in editable mode to allow for live code changes without reinstallation |
| 97 | +pip install -e . |
| 98 | +``` |
| 99 | + |
| 100 | +### Submit a Job |
| 101 | + |
| 102 | +**Option A: Using sbatch (recommended)** |
| 103 | +```bash |
| 104 | +sbatch examples/slurm_rocm/run_codecarbon_pytorch.slurm |
| 105 | +``` |
| 106 | + |
| 107 | +### 4. Monitor Job Status |
| 108 | +```bash |
| 109 | +# View running jobs |
| 110 | +squeue -u $USER |
| 111 | + |
| 112 | +# View job output |
| 113 | +tail -f logs/<job_id>.out |
| 114 | +``` |
| 115 | + |
| 116 | +## Troubleshooting |
| 117 | + |
| 118 | + |
| 119 | +``` |
| 120 | +Error : |
| 121 | +[codecarbon WARNING @ 10:28:46] AMD GPU detected but amdsmi is not properly configured. Please ensure amdsmi is correctly installed to get GPU metrics.Tips : check consistency between Python amdsmi package and ROCm versions, and ensure AMD drivers are up to date. Error: /opt/rocm/lib/libamd_smi.so: undefined symbol: amdsmi_get_cpu_affinity_with_scope |
| 122 | +``` |
| 123 | + |
| 124 | +This mean you have a mismatch between the `amdsmi` Python package and the ROCM version installed on Adastra. To fix this, ensure you install the correct version of `amdsmi` that matches the ROCM version (e.g., `amdsmi==7.0.1` for ROCM 7.0.1). |
| 125 | + |
| 126 | +```bash |
| 127 | +KeyError: 'ROCM_PATH' |
| 128 | +``` |
| 129 | +This means the rocm module is not loaded, load it with `module load rocm/7.0.1`. |
| 130 | + |
| 131 | +## Limitations and Future Work |
| 132 | + |
| 133 | +The AMD Instinct MI250 accelerator card contains two Graphics Compute Dies (GCDs) per physical card. However, when monitoring energy consumption (e.g., via rocm-smi or tools like CodeCarbon), only one GCD reports power usage, while the other shows zero values. This is problematic for accurate energy accounting, especially in HPC/SLURM environments where jobs may be allocated a single GCD. |
| 134 | + |
| 135 | +So in that case we display a warning. |
| 136 | + |
| 137 | +In a future work we will use `average_gfx_activity` to estimate the corresponding power of both GCDs, and provide an estimation instead of 0. |
| 138 | + |
| 139 | +## Documentation |
| 140 | + |
| 141 | +- [CINES Adastra GPU allocation](https://dci.dci-gitlab.cines.fr/webextranet/user_support/index.html#allocating-a-single-gpu) |
| 142 | +- [CINES PyTorch on ROCM](https://dci.dci-gitlab.cines.fr/webextranet/software_stack/libraries/index.html#pytorch) |
| 143 | +- [AMD SMI library](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html) |
| 144 | + |
| 145 | + |
| 146 | +## Annex: Example of Job Details with scontrol |
| 147 | + |
| 148 | +This trace was obtained to adapt `codecarbon/core/util.py` to properly parse the SLURM job details and extract the relevant information about GPU and CPU allocation. |
| 149 | + |
| 150 | +``` |
| 151 | +[$PROJECT_ID] $USER_NAME@login5:~/codecarbon$ scontrol show job 4687018 |
| 152 | +JobId=4687018 JobName=codecarbon-test |
| 153 | + UserId=$USER_NAME(xxx) GroupId=grp_$USER_NAME(xxx) MCS_label=N/A |
| 154 | + Priority=900000 Nice=0 Account=xxxxxx QOS=debug |
| 155 | + JobState=COMPLETED Reason=None Dependency=(null) |
| 156 | + Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 |
| 157 | + RunTime=00:00:24 TimeLimit=00:05:00 TimeMin=N/A |
| 158 | + SubmitTime=2026-03-02T17:12:49 EligibleTime=2026-03-02T17:12:49 |
| 159 | + AccrueTime=2026-03-02T17:12:49 |
| 160 | + StartTime=2026-03-02T17:12:49 EndTime=2026-03-02T17:13:13 Deadline=N/A |
| 161 | + SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-03-02T17:12:49 Scheduler=Main |
| 162 | + Partition=mi250-shared AllocNode:Sid=login5:2553535 |
| 163 | + ReqNodeList=(null) ExcNodeList=(null) |
| 164 | + NodeList=g1341 |
| 165 | + BatchHost=g1341 |
| 166 | + NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:1 |
| 167 | + ReqTRES=cpu=8,mem=29000M,node=1,billing=8,gres/gpu=1 |
| 168 | + AllocTRES=cpu=16,mem=29000M,energy=10211,node=1,billing=16,gres/gpu=1,gres/gpu:mi250x=1 |
| 169 | + Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=* |
| 170 | + MinCPUsNode=8 MinMemoryNode=29000M MinTmpDiskNode=0 |
| 171 | + Features=MI250&DEBUG DelayBoot=00:00:00 |
| 172 | + OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) |
| 173 | + Command=/lus/home/CT6/$PROJECT_ID/$USER_NAME/codecarbon/run_codecarbon.sh |
| 174 | + WorkDir=/lus/home/CT6/$PROJECT_ID/$USER_NAME/codecarbon |
| 175 | + AdminComment=Accounting=1 |
| 176 | + StdErr=/lus/home/CT6/$PROJECT_ID/$USER_NAME/codecarbon/logs/4687018.err |
| 177 | + StdIn=/dev/null |
| 178 | + StdOut=/lus/home/CT6/$PROJECT_ID/$USER_NAME/codecarbon/logs/4687018.out |
| 179 | + TresPerNode=gres/gpu:1 |
| 180 | + TresPerTask=cpu=8 |
| 181 | +``` |
| 182 | + |
0 commit comments