[TRTLLMINF-99][infra] Add SLURM frontend failover to L0#15674
[TRTLLMINF-99][infra] Add SLURM frontend failover to L0#15674dpitman-nvda wants to merge 5 commits into
Conversation
Signed-off-by: Derek Pitman <dpitman@nvidia.com>
b4a2e4b to
742fbcf
Compare
📝 WalkthroughWalkthroughThe Jenkins SLURM pipeline now routes SSH and SCP operations through frontend-failover helpers across result upload, agent bring-up, cleanup, and sbatch execution. The sbatch path reuses active jobs from stored job ids, re-queries allocation state on tracking failures, and raises ChangesSLURM frontend failover
Sequence Diagram(s)sequenceDiagram
participant runLLMTestlistWithSbatch
participant withSlurmFrontendFailover
participant "SLURM frontend" as SLURMFrontend
participant "SLURM allocation" as SLURMAllocation
runLLMTestlistWithSbatch->>withSlurmFrontendFailover: submit Run Pytest and read slurm_job_id.txt
withSlurmFrontendFailover->>SLURMFrontend: SSH commands through frontend failover
SLURMFrontend->>SLURMAllocation: reuse or submit sbatch job
SLURMAllocation-->>SLURMFrontend: job id and state
runLLMTestlistWithSbatch->>withSlurmFrontendFailover: track job via frontend failover
withSlurmFrontendFailover->>SLURMAllocation: query allocation-level state
SLURMAllocation-->>withSlurmFrontendFailover: TIMEOUT
withSlurmFrontendFailover-->>runLLMTestlistWithSbatch: throw UserFailure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 200-221: The failover logic in the scp retry block exits too early
on any non-255 return code, so `retryableConnectionCheck` in the `attempts` loop
never gets a chance to decide whether connection errors should be retried.
Update the `attempts` generation inside the script returned by this helper so
that `scpFromRemoteCmd` failures like connection refused/DNS/no route can still
fall through to `retryableConnectionCheck` instead of immediately `exit`ing,
while still preserving the final failure exit behavior in the wrapper around
`__slurm_frontend_rc`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 755c117c-d635-445d-8070-f415b423cc49
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
| def attempts = remotes.collect { remote -> """ | ||
| echo '[SLURM-FRONTEND] trying ${remote.host}' >&2 | ||
| __slurm_frontend_output=\$({ ${scpFromRemoteCmd(remote, remotePath, localPath)}; } 2>&1) | ||
| __slurm_frontend_rc=\$? | ||
| printf '%s\\n' "\$__slurm_frontend_output" >&2 | ||
| if [ \$__slurm_frontend_rc -eq 0 ]; then | ||
| exit 0 | ||
| fi | ||
| if [ \$__slurm_frontend_rc -ne 255 ]; then | ||
| exit \$__slurm_frontend_rc | ||
| fi | ||
| ${retryableConnectionCheck} | ||
| """.stripIndent().trim() } | ||
|
|
||
| return """ | ||
| ( | ||
| __slurm_frontend_rc=255 | ||
| ${attempts.join("\n")} | ||
| exit \$__slurm_frontend_rc | ||
| ) | ||
| """.stripIndent().replaceAll(/\s+$/, "") | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🌐 Web query:
What exit code does OpenSSH scp return on a connection failure (connection refused, no route to host, could not resolve hostname), and did this change in OpenSSH 9.0 SFTP-mode scp?
💡 Result:
When OpenSSH scp encounters a connection failure, such as "connection refused," "no route to host," or a failure to resolve the hostname, it exits with a non-zero status code, typically 1 [1][2][3][4][5]. Official documentation for OpenSSH scp states generally that the utility exits with 0 on success and >0 if an error occurs [1][4]. The transition in OpenSSH 9.0 to using the SFTP protocol by default did not change this exit behavior [6][4][5]. The primary changes introduced in OpenSSH 9.0 regarding scp involved the underlying protocol used for file transfers (switching from the legacy scp/rcp protocol to SFTP) and differences in how remote path characters are handled [6][7]. The exit status remains defined as 0 for success and a non-zero value (typically 1) when an error occurs [1][4]. Because scp relies on the ssh(1) utility to establish the underlying connection, these connection-level errors (which occur during the ssh handshake phase) result in the same non-zero exit status regardless of whether the subsequent file transfer protocol is legacy SCP or SFTP [8][4][9].
Citations:
- 1: https://man.openbsd.org/OpenBSD-7.2/scp.1
- 2: https://manpages.debian.org/bullseye/openssh-client/scp.1.en.html
- 3: https://man.archlinux.org/man/core/openssh/scp.1.en
- 4: https://man7.org/linux/man-pages/man1/scp.1.html
- 5: https://manpages.debian.org/unstable/openssh-client/scp.1.en.html
- 6: https://www.openssh.org/txt/release-9.0
- 7: http://lists.mindrot.org/pipermail/openssh-unix-dev/2021-September/039648.html
- 8: https://superuser.com/questions/577911/scp-lost-connection-but-ssh-works-fine
- 9: https://stackoverflow.com/questions/62445517/scp-to-a-remote-server-ssh-connect-to-host-serveraddress-it-port-22-connecti
🏁 Script executed:
sed -n '150,240p' jenkins/L0_Test.groovyRepository: NVIDIA/TensorRT-LLM
Length of output: 4353
scp failover never runs for non-255 connection errors.
OpenSSH scp commonly returns 1 for connection failures like connection refused, no route to host, and DNS resolution errors, but this branch exits immediately for any rc != 255 before retryableConnectionCheck can run. That means a dead frontend can still stop failover instead of trying the next remote.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@jenkins/L0_Test.groovy` around lines 200 - 221, The failover logic in the scp
retry block exits too early on any non-255 return code, so
`retryableConnectionCheck` in the `attempts` loop never gets a chance to decide
whether connection errors should be retried. Update the `attempts` generation
inside the script returned by this helper so that `scpFromRemoteCmd` failures
like connection refused/DNS/no route can still fall through to
`retryableConnectionCheck` instead of immediately `exit`ing, while still
preserving the final failure exit behavior in the wrapper around
`__slurm_frontend_rc`.
…rc 255 scp/sshpass return connection failures as rc 1 (connection refused, no route to host, DNS) as often as 255, but the failover script exited on any rc != 255 before retryableConnectionCheck ran -- so a dead frontend returning 1 stopped failover instead of trying the next remote. Drop the exit-code gate and let the message-based check decide for every non-zero rc; its default case still exits with the original rc, so non-connection errors (e.g. missing file) give up as before. Signed-off-by: Derek Pitman <dpitman@nvidia.com>
|
/bot run |
|
PR_Github #56130 [ run ] triggered by Bot. Commit: |
|
PR_Github #56130 [ run ] completed with state
|
|
/bot run |
|
PR_Github #56379 [ run ] triggered by Bot. Commit: |
|
PR_Github #56379 [ run ] completed with state
|
|
/bot run |
👎 Promotion blocked, new vulnerability foundVulnerability report
|
|
/bot run |
|
PR_Github #56392 [ run ] triggered by Bot. Commit: |
…probe selectReachableSlurmRemote ran an ssh-true reachability probe to every frontend (each wrapped in a timeout) on entry to "Check If Node Is Online", but its result was never used -- the Phase-1 poll loop and checkSlurmJobActive both go through the failover wrapper independently. Removing the dead call cuts redundant SSH heartbeat load on the login nodes with no behavior change. Signed-off-by: Derek Pitman <dpitman@nvidia.com>
|
PR_Github #56392 [ run ] completed with state
|
…cycle The per-operation failover helpers each re-randomize the login node, so a stage's submit -> read-metadata -> track -> collect steps scattered across different frontends. On clusters whose login nodes don't share the job workspace, submit wrote slurm_job_id.txt to one frontend while the metadata read hit another -> empty -> "No job ID found". Observed on aws-pdx B300 in build 45235. Run the whole sbatch submit/metadata/track sequence on the single frontend the enclosing withSlurmFrontendFailover already pins (failover re-runs the closure as a unit; the submit script reuses an active job), and pin one reachable frontend for the uploadResults collect. Drop the now-unused scpFromSlurmFrontendCmd. Controller-side sacct/scontrol queries keep per-call failover -- they are frontend-agnostic. Signed-off-by: Derek Pitman <dpitman@nvidia.com>
|
/bot run |
|
PR_Github #56438 [ run ] triggered by Bot. Commit: |
|
PR_Github #56438 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #56464 [ run ] triggered by Bot. Commit: |
|
PR_Github #56464 [ run ] completed with state
|
Summary by CodeRabbit
Description
SLURM frontend swapping capabilities have been added to the infrastructure supporting libraries, now making it so that the L0_Test code makes use of the helper functions that do the frontend swapping.
Test Coverage
N/A, this is a CI change
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.