-
Planning & Submission
scripts/plan_autoexp.pyrenders SBATCH scripts and writes a manifest (PlanManifest) containing job metadata (script path, log template, monitoring config, restart policies, etc.).scripts/submit_autoexp.pyloads the manifest, instantiates a monitor/controller by way ofbuild_host_runtime()andinstantiate_controller(), submits jobs (or an array) through the SLURM client, and registers them with theMonitorController.- During registration the controller persists each job in
MonitorStateStore(StoredJobentries keyed byjob_id).
-
Monitoring Loop
run_monitoring()drives_monitor_loop(), which periodically callsMonitorController.observe_once().- The controller calls
monitor.watch(...)(async), merges the result with the latestsqueue()snapshot, evaluates events, and, if required, calls the restart policy (by way of_apply_policy()). - Action payloads are written as JSON files to
<monitoring_state_dir>/actions/…. - Whenever job state changes (
_set_state,_finalize_job,_restart_job) the controller updatesMonitorStateStoreso the session can be resumed.
-
Persistence Layout
MonitorStateStorewrites<monitoring_state_dir>/<plan_id>.json, containing:{ "session_id": "...", "project_name": "...", "config": {... original manifest config ...}, "jobs": [ { "job_id": "123456", "name": "demo_0", "script_path": "...sbatch", "log_path": "...%A_%a.log", "attempts": 1, "metadata": {...}, ... } ] }- Each restart (triggered by policies) rewrites the store with the new
job_id(for example123456 -> 123789).
-
Monitoring Resume (
scripts/monitor_autoexp.py)- Reads the session JSON (by way of
--session <plan_id>or--session path/to/session.json) to locate the manifest, then loads the monitor runtime, instantiates a controller, and callsrestore_jobs(). restore_jobs()pulls allStoredJobentries and re-registers them verbatim (controller.register_job(job_id, registration, attempts)). Sessions with no active jobs are skipped by default when running--all; add--include-completedto reprocess archived sessions.- If the monitor is run with
--monitor-override …, Hydra-style overrides are merged into the monitor config before instantiation (for exampledebug_sync=true).
- Reads the session JSON (by way of
- The state store persists the exact SLURM job ID active at the time of persistence.
- If monitoring is stopped and the job continues under the same ID, resuming works — the controller finds the job by way of
squeue()and proceeds. - Problem: If the job is resubmitted while monitoring is offline (for example manual restart, scheduler requeue), the new SLURM job ID differs from the stored one. On resume:
restore_jobs()registers the stale job ID.observe_once()seessqueue()returnNOT_FOUND, classifies it astimeout, and applies thetimeoutpolicy. Net result: the controller may immediately attempt a new restart, overriding the in-flight job.
- There is no mechanism to reconcile the stored job ID with the current queue state (for example lookup by job name or metadata).
- When monitoring stops before a job has been finalized (
_finalize_job()), theStoredJobentry remains. - On resume, the job is re-registered:
- If the log contains a terminal marker (
termination_string) the monitor emits aSuccessStateevent and_apply_policy()(for modesuccess) finalizes the job. - If no termination marker exists and the job is missing from
squeue(), the controller interprets the situation astimeout. Depending on policies, this may either stop the job with a “timeout” reason or attempt another restart — even though the job is already done.
- If the log contains a terminal marker (
- The UI/CLI therefore shows “undefined/pending” jobs until a manual intervention removes them from the state store.
monitor_autoexp.pyrestores only the jobs persisted in the state store. Jobs defined in the manifest but never registered previously are not added.- There is no flag or workflow to “monitor everything from the manifest” after a stop/restart; operators must manually re-register jobs or re-run submission.
- The persistence layer stores the current attempt count but not the mapping of attempt → job ID.
- When resuming, the user cannot tell which SLURM job ID is the active attempt, nor can the controller match a new job ID back to the job it belongs to.
- Log paths now include
%j/%A_%aso restarts write to unique files, but the state store persists the template rather than an expanded path. - If external tooling truncates or rotates logs differently per attempt, the monitor replays entire files on resume, potentially re-firing events.
Several of the pain points above are now addressed in code and tooling:
-
Persist resolved log paths and last-known states. The monitor writes
resolved_log_path,last_monitor_state,last_slurm_state, and alast_updatedtimestamp for every tracked job. Restarted monitors no longer have to expand%j/%A_%aplaceholders and can immediately point operators at concrete log files. -
Re-register SLURM state on resume.
restore_jobs()and the container orchestrator both callslurm_client.register_job(...)when rehydrating a session. Fake and real clients now have the job in their internal registries, makingsqueue()snapshots work again after a pause. -
State store validation script.
scripts/tests/test_monitor_resume.pyautomates a plan → submit → interrupt → resume workflow on a login node and asserts that the state file contains resolved log paths. It also checks that the monitor prints a restore message on the second run, providing an end-to-end smoke test for the resume path. -
Documentation updates. README/SPEC now describe the resume flow, the new state fields, and how to use the monitoring test harness on a cluster.
The longer-term restructure ideas remain valid and would further harden the resume path:
- Track logical job instances (name → attempt history) instead of persisting a single job ID.
- Reconcile SLURM job IDs that change while monitoring is offline by querying
squeue/sacctby job name. - Surface completion state for jobs that finished while the monitor was down without relying solely on log markers.
- Provide a
--monitor-alloption to (re)register jobs defined in the manifest even if they were never tracked before the pause. - Expand attempt history in the state store so operators can inspect previous retries after the job finalises.