Skip to content

feat: fetch remote FirecREST logs into sessions#1136

Merged
SalimKayal merged 5 commits into
mainfrom
salimkayal-feat-forward-slurm-logs
Jun 17, 2026
Merged

feat: fetch remote FirecREST logs into sessions#1136
SalimKayal merged 5 commits into
mainfrom
salimkayal-feat-forward-slurm-logs

Conversation

@SalimKayal

@SalimKayal SalimKayal commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fetch remote Slurm session stdout/stderr logs via FirecREST and stream them into the controller pod's stdout, making them visible through kubectl logs.

Motivations and Context

Remote sessions running on Slurm clusters write their output to files on the cluster filesystem (e.g., slurm-<jobid>.out). Previously, these logs were inaccessible from the Kubernetes sidecar, making it hard for users and operators to debug running or completed remote sessions. This change polls the log files through FirecREST's filesystem view API and forwards any new lines to the pod's stdout.

Changes

  • Added fetchSessionLogs — polls remote stdout and stderr files via GetViewFilesystemSystemNameOpsViewGet and writes complete lines to os.Stdout prefixed with [session/stdout] or [session/stderr].
  • Added fetchLogStream — handles a single stream: buffered reading, byte-offset tracking, and line-by-line flushing so partial lines survive across polls.
  • Extended savedState — now persists stdout_path, stderr_path, stdout_offset, and stderr_offset alongside the existing job_id.
  • Renamed saveJobIDsaveState and recoverJobIDrecoverState — updated to serialize and deserialize the new fields.
  • Updated periodicSessionStatus — calls fetchSessionLogs on each status tick and persists offsets to disk.
  • Updated Start — constructs expected Slurm log file paths after job submission and calls saveState.

Notes

  • Log fetch errors are logged at Warn level and do not fail the status loop; missing files early in the job lifecycle are silently ignored.
  • The buffer/offset approach ensures we don't duplicate lines across polls or after a controller restart.

@SalimKayal SalimKayal changed the title feat: fetch remote logs into sessions feat: fetch remote FirecREST logs into sessions Jun 15, 2026
@SalimKayal SalimKayal marked this pull request as ready for review June 15, 2026 12:07
@SalimKayal SalimKayal requested review from a team and olevski as code owners June 15, 2026 12:07
@SalimKayal SalimKayal requested review from leafty and sambuc June 15, 2026 12:07

@leafty leafty left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why read from the file system when stdout and stderr are returned from GET /compute/{system_name}/jobs/{job_id}/metadata?

Get metadata of a job by {job_id}

Comment thread internal/remote/firecrest/controller.go Outdated
Comment thread internal/remote/firecrest/controller.go
@SalimKayal

Copy link
Copy Markdown
Collaborator Author

Why read from the file system when stdout and stderr are returned from GET /compute/{system_name}/jobs/{job_id}/metadata?

thanks I missed this one.

@leafty

leafty commented Jun 16, 2026

Copy link
Copy Markdown
Member

Why read from the file system when stdout and stderr are returned from GET /compute/{system_name}/jobs/{job_id}/metadata?

thanks I missed this one.

There may be a good case for reading from the file system, just making sure we are considering the simplest option first.

@SalimKayal

Copy link
Copy Markdown
Collaborator Author

I read the path for the stdout/err files from the metadata with a fallback for slurm defaults in 594448c.

@SalimKayal

Copy link
Copy Markdown
Collaborator Author

@SalimKayal SalimKayal force-pushed the salimkayal-feat-forward-slurm-logs branch from 619df23 to 1dbe1b1 Compare June 16, 2026 10:59
@olevski olevski self-requested a review June 16, 2026 12:33
olevski
olevski previously approved these changes Jun 16, 2026
@SalimKayal SalimKayal merged commit 01431a5 into main Jun 17, 2026
8 of 9 checks passed
@SalimKayal SalimKayal deleted the salimkayal-feat-forward-slurm-logs branch June 17, 2026 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants