Skip to content

Commit 7fb7f44

Browse files
elelayshsjpb
andauthored
Switch to cgroup for job control and associated settings (#205)
* Change default ProctrackType, SelectTypeParameters,TaskPlugin - switching ProctrackType to proctrack/cgroup, which is the recommended setting (was disabled for CI) - Enabling TaskPlugins task/cgroup,task/affinity to bind jobs to allocated cpuset - SelectTypeParameters=CR_Core_Memory to also limit memory * test cgroup enabled * Enable the cpuset controller in the VM enable cpuset for whole user.slice hierarchy * Running podman pull in RL8 tests to create the missing session.slice in cgroup. So that we can enable the cpuset controller for it. * Switch to jobacct_gather/cgroup * [skipci] molecule update readme * Go back to SelectTypeParameters=CR_Core * Wait for job to be running before checking cgroup properties * molecule: add name to a task Co-authored-by: Steve Brasier <33413598+sjpb@users.noreply.github.com> * molecule: remove cgroup memory limit test --------- Co-authored-by: Steve Brasier <33413598+sjpb@users.noreply.github.com>
1 parent 9177e8b commit 7fb7f44

4 files changed

Lines changed: 49 additions & 5 deletions

File tree

.github/workflows/ci.yml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,10 @@ jobs:
7979
working-directory: molecule/images
8080
if: matrix.image == 'localhost/rocky9systemd'
8181

82+
- name: Load rocky8 container image
83+
run: podman pull ${{ matrix.image }}
84+
if: matrix.image != 'localhost/rocky9systemd'
85+
8286
- name: Set up Python 3.
8387
uses: actions/setup-python@v4
8488
with:
@@ -99,6 +103,22 @@ jobs:
99103
- name: Create ansible.cfg with correct roles_path
100104
run: printf '[defaults]\nroles_path=../' >ansible.cfg
101105

106+
- name: Enable the cpuset controller
107+
run: |
108+
echo "======== BEFORE ========"
109+
find /sys/fs/cgroup/ -name cgroup.subtree_control -exec grep -H '' '{}' ';'
110+
echo "==== CHANGING CPUSET ==="
111+
set -x
112+
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/cgroup.subtree_control
113+
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1001.slice/cgroup.subtree_control
114+
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/cgroup.subtree_control
115+
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/session.slice/cgroup.subtree_control
116+
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/cgroup.subtree_control
117+
echo +cpuset | sudo tee /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/app.slice/cgroup.subtree_control
118+
set +x
119+
echo "======= CHECKING ======="
120+
find /sys/fs/cgroup/ -name cgroup.subtree_control -exec grep -H '' '{}' ';'
121+
102122
- name: Run Molecule tests.
103123
run: molecule test -s ${{ matrix.scenario }}
104124
env:

defaults/main.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,8 @@ openhpc_gres_autodetect: 'off'
1616
openhpc_default_config:
1717
# This only defines values which are not Slurm defaults
1818
SlurmctldHost: "{{ openhpc_slurm_control_host }}{% if openhpc_slurm_control_host_address is defined %}({{ openhpc_slurm_control_host_address }}){% endif %}"
19-
ProctrackType: proctrack/linuxproc # TODO: really want cgroup but needs cgroup.conf and workaround for CI
19+
ProctrackType: proctrack/cgroup
20+
TaskPlugin: task/cgroup,task/affinity
2021
SlurmdSpoolDir: /var/spool/slurm # NB: not OpenHPC default!
2122
SlurmUser: slurm
2223
StateSaveLocation: "{{ openhpc_state_save_location }}"
@@ -80,7 +81,7 @@ openhpc_slurm_accounting_storage_user: slurm
8081
#openhpc_slurm_accounting_storage_pass:
8182

8283
# Job accounting
83-
openhpc_slurm_job_acct_gather_type: jobacct_gather/linux
84+
openhpc_slurm_job_acct_gather_type: jobacct_gather/cgroup
8485
openhpc_slurm_job_acct_gather_frequency: 30
8586
openhpc_slurm_job_comp_type: jobcomp/none
8687
openhpc_slurm_job_comp_loc: /var/log/slurm_jobacct.log

molecule/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,9 +47,11 @@ Build a Rocky Linux 9 image with systemd included:
4747
Run tests, e.g.:
4848

4949
cd ansible-role-openhpc/
50-
MOLECULE_NO_LOG="false" MOLECULE_IMAGE=rockylinux:8 molecule test --all
50+
MOLECULE_NO_LOG="false" MOLECULE_IMAGE=rockylinux/rockylinux:8 molecule test --all
5151

52-
where the image may be `rockylinux:8` or `localhost/rocky9systemd`.
52+
where the image may be `rockylinux/rockylinux:8` or `localhost/rocky9systemd`.
53+
54+
Tested with version 8.7.0 of `ansible`, 2.15.13 of `ansible-core` (installed when python version is 3.9).
5355

5456
Other useful options during development:
5557
- Prevent destroying instances by using `molecule test --destroy never`

molecule/test1/verify.yml

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,28 @@
66
- name: Get slurm partition info
77
command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace
88
register: sinfo
9-
- name:
9+
- name: Check nodes are up/idle
1010
assert: # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
1111
that: "sinfo.stdout_lines == ['compute*,up,60-00:00:00,2,idle,testohpc-compute-[0-1]']"
1212
fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}"
13+
14+
- name: Run limited job and check limits
15+
hosts: testohpc-compute-0
16+
tasks:
17+
- name: Assert expected SelectTypeParameters
18+
ansible.builtin.assert:
19+
that: ansible_local.slurm.SelectTypeParameters in ('CR_CORE', 'CR_CORE_MEMORY')
20+
- ansible.builtin.shell: |
21+
printf '#!bin/bash\nsleep 500' | sbatch --cpus-per-task=1 --ntasks=1 --mem=200m --nodelist=testohpc-compute-0
22+
retry=0
23+
while [[ "$retry" -lt 10 ]] && ! squeue -l | grep -q RUNNING; do
24+
sleep 1
25+
retry=$(( retry + 1))
26+
done
27+
- name: Get cpuset cgroup limit
28+
shell: cat /sys/fs/cgroup/system.slice/testohpc-compute-0_slurmstepd.scope/job_*/cpuset.cpus
29+
register: job_cpuset
30+
- name: Assert cpuset cgroup presence
31+
assert:
32+
that: "job_cpuset.stdout_lines[0] in ('0', '0-1')" # depending on the VM's state
33+
fail_msg: "FAILED - actual value: {{ job_cpuset.stdout_lines }}"

0 commit comments

Comments
 (0)