Skip to content

Commit 4658af5

Browse files
author
Sqoia Dev Agent
committed
feat: SLURM Lua job_submit plugin for real-time allocation enforcement
Lua Plugin (job_submit.lua): - Real-time budget checking at job submission (sbatch/salloc/srun) - Rejects jobs that would exceed account allocation with clear error message - Shows remaining balance: Budget, Used, Remaining SU - Annotates jobs >80% utilization in job comment field - Audit-only mode toggle (ENABLE_ENFORCEMENT = false) - Reads allocations from /etc/slurmledger/rates.json - Pattern-matching JSON parser for SLURM's minimal Lua environment Balance Enforcer (companion tool, not primary): - Rewritten as reporting/reconciliation tool - --check: report allocation status - --reconcile: compare SLURM GrpTRESMins vs SlurmLedger allocations - --sync: push limits to SLURM via sacctmgr (alternative to Lua) Updated PRODUCTION_SETUP.md with Lua plugin install instructions Updated README with enforcement approach comparison
1 parent 876caaf commit 4658af5

5 files changed

Lines changed: 573 additions & 80 deletions

File tree

PRODUCTION_SETUP.md

Lines changed: 60 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -136,23 +136,61 @@ Navigate to **Administration → Institution Profile**:
136136
3. Enter bank/payment details (for invoice footer)
137137
4. Set payment terms (e.g., "Net 30")
138138

139-
## Step 10: Set Up Balance Enforcement (Optional)
139+
## Step 10: Set Up Balance Enforcement
140140

141-
For pre-paid allocations, install the cron job:
141+
SlurmLedger uses a SLURM Lua job_submit plugin for real-time allocation enforcement.
142+
When a user submits a job that would exceed their account's budget, the job is
143+
rejected with a clear error message showing remaining balance.
144+
145+
### Install the Lua Plugin
142146

143147
```bash
144-
# Create cron job for hourly balance checks
145-
sudo tee /etc/cron.d/slurmledger-enforcer << 'EOF'
146-
# SlurmLedger Balance Enforcer — check allocations hourly
147-
0 * * * * root /usr/bin/python3 /usr/share/cockpit/slurmledger/balance_enforcer.py --enforce --log /var/log/slurmledger/enforcer.log
148-
EOF
149-
sudo chmod 644 /etc/cron.d/slurmledger-enforcer
148+
# Copy the plugin
149+
sudo cp /usr/share/cockpit/slurmledger/job_submit.lua /etc/slurm/job_submit.lua
150+
sudo chmod 644 /etc/slurm/job_submit.lua
151+
152+
# Enable in slurm.conf
153+
echo "JobSubmitPlugins=lua" >> /etc/slurm/slurm.conf
150154

151-
# Test it first (dry run):
152-
sudo python3 /usr/share/cockpit/slurmledger/balance_enforcer.py --check
155+
# Apply configuration
156+
sudo scontrol reconfigure
153157
```
154158

155-
The enforcer uses SLURM's native `GrpTRESMins` limit to cap accounts at their allocation. Jobs submitted after the limit is reached will be held in PENDING state with reason `AssocGrpCPUMinutesLimit`.
159+
### What Users See
160+
161+
When a job is rejected:
162+
```
163+
$ sbatch my_job.sh
164+
sbatch: error: SlurmLedger: Job rejected — account 'physics-lab' has exceeded its allocation.
165+
Budget: 500000 SU | Used: 498200 SU | Remaining: 1800 SU
166+
This job would require ~2400 SU.
167+
Contact your PI or HPC admin to request additional allocation.
168+
```
169+
170+
When approaching the limit (>80%), jobs are accepted but annotated:
171+
```
172+
$ squeue -j 12345 -o "%j %k"
173+
my_sim [SlurmLedger] Account 'physics-lab' at 87% of allocation
174+
```
175+
176+
### Enforcement Modes
177+
178+
Edit `/etc/slurm/job_submit.lua`:
179+
- `ENABLE_ENFORCEMENT = true` — reject jobs over budget (default)
180+
- `ENABLE_ENFORCEMENT = false` — audit-only mode (log but don't reject)
181+
182+
### Companion Tools
183+
184+
```bash
185+
# Check all account balances
186+
python3 /usr/share/cockpit/slurmledger/balance_enforcer.py --check
187+
188+
# Reconcile SLURM limits with SlurmLedger allocations
189+
python3 /usr/share/cockpit/slurmledger/balance_enforcer.py --reconcile
190+
191+
# Alternative: push limits via GrpTRESMins (instead of Lua)
192+
python3 /usr/share/cockpit/slurmledger/balance_enforcer.py --sync
193+
```
156194

157195
## Step 11: Configure Financial Integration (Optional)
158196

@@ -215,7 +253,8 @@ sudo chmod 755 /etc/cron.daily/slurmledger-backup
215253
- [ ] Institution profile is complete
216254
- [ ] Test invoice generates with correct branding
217255
- [ ] Invoice numbers are sequential
218-
- [ ] Balance enforcer runs without errors (dry run)
256+
- [ ] Lua plugin installed at /etc/slurm/job_submit.lua and enabled in slurm.conf
257+
- [ ] Balance check runs without errors: `balance_enforcer.py --check`
219258
- [ ] Backup cron is active
220259
- [ ] File permissions are correct on /etc/slurmledger/
221260

@@ -231,9 +270,16 @@ sudo chmod 755 /etc/cron.daily/slurmledger-backup
231270
- Upload a logo (PNG/JPG, under 256KB)
232271
- Fill in the bank/payment information
233272

234-
### Balance enforcer says "No allocations configured"
273+
### Balance check says "No allocations configured"
235274
- Set up allocations in Administration → Allocations
236-
- Only "prepaid" allocations are enforced
275+
- Only "prepaid" allocations are enforced by the Lua plugin
276+
277+
### Lua plugin not rejecting jobs
278+
- Verify `JobSubmitPlugins=lua` is in slurm.conf: `grep JobSubmitPlugins /etc/slurm/slurm.conf`
279+
- Verify the plugin file exists and is readable: `ls -la /etc/slurm/job_submit.lua`
280+
- Check slurmctld logs: `journalctl -u slurmctld | grep SlurmLedger`
281+
- Confirm `ENABLE_ENFORCEMENT = true` in `/etc/slurm/job_submit.lua`
282+
- Run `scontrol reconfigure` after any slurm.conf change
237283

238284
### Permission denied on config save
239285
- Check `/etc/slurmledger/` ownership: `ls -la /etc/slurmledger/`

README.md

Lines changed: 49 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -186,20 +186,64 @@ Create custom rules via the admin UI — no config file editing required.
186186

187187
## Balance Enforcement
188188

189-
For pre-paid allocations, `balance_enforcer.py` enforces budget limits via SLURM's native `GrpTRESMins` mechanism. Install the cron job to run it hourly:
189+
SlurmLedger supports two approaches for enforcing prepaid allocation budgets:
190+
191+
### Recommended: Lua job_submit plugin
192+
193+
The `job_submit.lua` plugin is called by SLURM at job submission time (before the
194+
job is accepted). When a user submits a job that would exceed their account's budget,
195+
the job is rejected immediately with a clear, user-visible error message:
190196

191197
```
192-
# /etc/cron.d/slurmledger-enforcer
193-
0 * * * * root /usr/bin/python3 /usr/share/cockpit/slurmledger/balance_enforcer.py --enforce --log /var/log/slurmledger/enforcer.log
198+
$ sbatch my_job.sh
199+
sbatch: error: SlurmLedger: Job rejected — account 'physics-lab' has exceeded its allocation.
200+
Budget: 500000 SU | Used: 498200 SU | Remaining: 1800 SU
201+
This job would require ~2400 SU.
202+
Contact your PI or HPC admin to request additional allocation.
203+
```
204+
205+
Install once per cluster — no cron job required:
206+
207+
```bash
208+
sudo cp /usr/share/cockpit/slurmledger/job_submit.lua /etc/slurm/job_submit.lua
209+
echo "JobSubmitPlugins=lua" >> /etc/slurm/slurm.conf
210+
sudo scontrol reconfigure
194211
```
195212

196-
Run a dry-run check manually at any time:
213+
Set `ENABLE_ENFORCEMENT = false` in the plugin file to switch to audit-only mode
214+
(logs rejections without actually blocking jobs).
215+
216+
### Alternative: GrpTRESMins via sacctmgr
217+
218+
If your site cannot use Lua plugins, `balance_enforcer.py --sync` pushes allocation
219+
limits to SLURM as `GrpTRESMins` on each account. SLURM holds jobs that would exceed
220+
the limit with reason `AssocGrpCPUMinutesLimit`. This approach is less precise (the
221+
counter tracks lifetime CPU-minutes on the account, not allocation-period usage) and
222+
gives users a less informative error message.
197223

198224
```bash
225+
python3 /usr/share/cockpit/slurmledger/balance_enforcer.py --sync
226+
```
227+
228+
### Reporting: balance_enforcer.py --check
229+
230+
`balance_enforcer.py --check` queries sacct to report current usage against each
231+
prepaid allocation. It works regardless of which enforcement approach is active and
232+
is the data source for the **Check Balances** button in the Admin Dashboard.
233+
234+
```bash
235+
# Text report
199236
python3 /usr/share/cockpit/slurmledger/balance_enforcer.py --check
237+
238+
# JSON output (consumed by Cockpit dashboard)
239+
python3 /usr/share/cockpit/slurmledger/balance_enforcer.py --check --json
240+
241+
# Reconcile SLURM GrpTRESMins against SlurmLedger allocations
242+
python3 /usr/share/cockpit/slurmledger/balance_enforcer.py --reconcile
200243
```
201244

202-
The **Check Balances** button in the Admin Dashboard runs the same check interactively and displays results in the UI.
245+
The **Check Balances** button in the Admin Dashboard runs `--check --json` and
246+
displays results in a table with per-account budget, usage, remaining SU, and status.
203247

204248
## Local Development & Testing
205249

0 commit comments

Comments
 (0)