feat: fetch remote FirecREST logs into sessions#1136
Conversation
leafty
left a comment
There was a problem hiding this comment.
Why read from the file system when stdout and stderr are returned from GET /compute/{system_name}/jobs/{job_id}/metadata?
Get metadata of a job by {job_id}
thanks I missed this one. |
There may be a good case for reading from the file system, just making sure we are considering the simplest option first. |
|
I read the path for the stdout/err files from the metadata with a fallback for slurm defaults in 594448c. |
|
FYI this is what FirecREST runs to get the metadata. https://github.com/eth-cscs/firecrest-v2/blob/master/src/lib/scheduler_clients/slurm/cli_commands/sacct_job_metadata_command.py#L12 |
619df23 to
1dbe1b1
Compare
Summary
Fetch remote Slurm session stdout/stderr logs via FirecREST and stream them into the controller pod's stdout, making them visible through
kubectl logs.Motivations and Context
Remote sessions running on Slurm clusters write their output to files on the cluster filesystem (e.g.,
slurm-<jobid>.out). Previously, these logs were inaccessible from the Kubernetes sidecar, making it hard for users and operators to debug running or completed remote sessions. This change polls the log files through FirecREST's filesystem view API and forwards any new lines to the pod's stdout.Changes
fetchSessionLogs— polls remote stdout and stderr files viaGetViewFilesystemSystemNameOpsViewGetand writes complete lines toos.Stdoutprefixed with[session/stdout]or[session/stderr].fetchLogStream— handles a single stream: buffered reading, byte-offset tracking, and line-by-line flushing so partial lines survive across polls.savedState— now persistsstdout_path,stderr_path,stdout_offset, andstderr_offsetalongside the existingjob_id.saveJobID→saveStateandrecoverJobID→recoverState— updated to serialize and deserialize the new fields.periodicSessionStatus— callsfetchSessionLogson each status tick and persists offsets to disk.Start— constructs expected Slurm log file paths after job submission and callssaveState.Notes
Warnlevel and do not fail the status loop; missing files early in the job lifecycle are silently ignored.