[TRTLLMINF-99][infra] Add SLURM frontend failover to L0 by dpitman-nvda · Pull Request #15674 · NVIDIA/TensorRT-LLM

dpitman-nvda · 2026-06-26T20:29:51Z

Summary by CodeRabbit

Bug Fixes
- Improved SLURM job handling during frontend/controller interruptions, making test runs and cleanup more resilient.
- Test result uploads now recover more reliably when a SLURM frontend becomes unreachable.
- Job tracking now better distinguishes timeouts from transient failures, reducing unnecessary retries.
- Existing active SLURM jobs are now reused more consistently instead of being canceled and resubmitted.
- Startup and metadata checks now surface connection issues more accurately for better retry behavior.

Description

SLURM frontend swapping capabilities have been added to the infrastructure supporting libraries, now making it so that the L0_Test code makes use of the helper functions that do the frontend swapping.

Test Coverage

N/A, this is a CI change

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Derek Pitman <dpitman@nvidia.com>

coderabbitai · 2026-06-26T20:36:36Z

📝 Walkthrough

Walkthrough

The Jenkins SLURM pipeline now routes SSH and SCP operations through frontend-failover helpers across result upload, agent bring-up, cleanup, and sbatch execution. The sbatch path reuses active jobs from stored job ids, re-queries allocation state on tracking failures, and raises UserFailure on TIMEOUT.

Changes

SLURM frontend failover

Layer / File(s)	Summary
Result upload failover `jenkins/L0_Test.groovy`	Adds `scpFromSlurmFrontendCmd()` and switches `uploadResults()` to frontend-aware remotes for timeout, regular, and perf artifact transfers.
Agent bring-up checks `jenkins/L0_Test.groovy`	Routes `runLLMTestlistWithAgent()` request-node, node-online, Phase 1 polling, and ENROOT log retrieval through reachable remotes and frontend failover.
Cleanup and node metadata `jenkins/L0_Test.groovy`	Moves SLURM cleanup, job-state lookup, and node-list capture onto `withSlurmFrontendFailover`.
SBATCH retry and tracking `jenkins/L0_Test.groovy`	Updates `runLLMTestlistWithSbatch()` to reuse active jobs, submit and track via frontend failover, and rethrow frontend connection failures while reading job metadata.

Sequence Diagram(s)

sequenceDiagram
  participant runLLMTestlistWithSbatch
  participant withSlurmFrontendFailover
  participant "SLURM frontend" as SLURMFrontend
  participant "SLURM allocation" as SLURMAllocation

  runLLMTestlistWithSbatch->>withSlurmFrontendFailover: submit Run Pytest and read slurm_job_id.txt
  withSlurmFrontendFailover->>SLURMFrontend: SSH commands through frontend failover
  SLURMFrontend->>SLURMAllocation: reuse or submit sbatch job
  SLURMAllocation-->>SLURMFrontend: job id and state
  runLLMTestlistWithSbatch->>withSlurmFrontendFailover: track job via frontend failover
  withSlurmFrontendFailover->>SLURMAllocation: query allocation-level state
  SLURMAllocation-->>withSlurmFrontendFailover: TIMEOUT
  withSlurmFrontendFailover-->>runLLMTestlistWithSbatch: throw UserFailure

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise, specific, and accurately summarizes the main SLURM frontend failover change.
Description check	✅ Passed	The description follows the template and includes a clear summary, test coverage note, and completed PR checklist.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 200-221: The failover logic in the scp retry block exits too early
on any non-255 return code, so `retryableConnectionCheck` in the `attempts` loop
never gets a chance to decide whether connection errors should be retried.
Update the `attempts` generation inside the script returned by this helper so
that `scpFromRemoteCmd` failures like connection refused/DNS/no route can still
fall through to `retryableConnectionCheck` instead of immediately `exit`ing,
while still preserving the final failure exit behavior in the wrapper around
`__slurm_frontend_rc`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 755c117c-d635-445d-8070-f415b423cc49

📥 Commits

Reviewing files that changed from the base of the PR and between a9f90a4 and b4a2e4b.

📒 Files selected for processing (1)

jenkins/L0_Test.groovy

coderabbitai · 2026-06-26T20:36:39Z

+    def attempts = remotes.collect { remote -> """
+        echo '[SLURM-FRONTEND] trying ${remote.host}' >&2
+        __slurm_frontend_output=\$({ ${scpFromRemoteCmd(remote, remotePath, localPath)}; } 2>&1)
+        __slurm_frontend_rc=\$?
+        printf '%s\\n' "\$__slurm_frontend_output" >&2
+        if [ \$__slurm_frontend_rc -eq 0 ]; then
+            exit 0
+        fi
+        if [ \$__slurm_frontend_rc -ne 255 ]; then
+            exit \$__slurm_frontend_rc
+        fi
+        ${retryableConnectionCheck}
+    """.stripIndent().trim() }
+
+    return """
+        (
+            __slurm_frontend_rc=255
+            ${attempts.join("\n")}
+            exit \$__slurm_frontend_rc
+        )
+    """.stripIndent().replaceAll(/\s+$/, "")
+}


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

What exit code does OpenSSH scp return on a connection failure (connection refused, no route to host, could not resolve hostname), and did this change in OpenSSH 9.0 SFTP-mode scp?

💡 Result:

When OpenSSH scp encounters a connection failure, such as "connection refused," "no route to host," or a failure to resolve the hostname, it exits with a non-zero status code, typically 1 [1][2][3][4][5]. Official documentation for OpenSSH scp states generally that the utility exits with 0 on success and >0 if an error occurs [1][4]. The transition in OpenSSH 9.0 to using the SFTP protocol by default did not change this exit behavior [6][4][5]. The primary changes introduced in OpenSSH 9.0 regarding scp involved the underlying protocol used for file transfers (switching from the legacy scp/rcp protocol to SFTP) and differences in how remote path characters are handled [6][7]. The exit status remains defined as 0 for success and a non-zero value (typically 1) when an error occurs [1][4]. Because scp relies on the ssh(1) utility to establish the underlying connection, these connection-level errors (which occur during the ssh handshake phase) result in the same non-zero exit status regardless of whether the subsequent file transfer protocol is legacy SCP or SFTP [8][4][9].

Citations:

1: https://man.openbsd.org/OpenBSD-7.2/scp.1

2: https://manpages.debian.org/bullseye/openssh-client/scp.1.en.html

3: https://man.archlinux.org/man/core/openssh/scp.1.en

4: https://man7.org/linux/man-pages/man1/scp.1.html

5: https://manpages.debian.org/unstable/openssh-client/scp.1.en.html

6: https://www.openssh.org/txt/release-9.0

7: http://lists.mindrot.org/pipermail/openssh-unix-dev/2021-September/039648.html

8: https://superuser.com/questions/577911/scp-lost-connection-but-ssh-works-fine

9: https://stackoverflow.com/questions/62445517/scp-to-a-remote-server-ssh-connect-to-host-serveraddress-it-port-22-connecti

🏁 Script executed:

sed -n '150,240p' jenkins/L0_Test.groovy

Repository: NVIDIA/TensorRT-LLM

Length of output: 4353

scp failover never runs for non-255 connection errors.

OpenSSH scp commonly returns 1 for connection failures like connection refused, no route to host, and DNS resolution errors, but this branch exits immediately for any rc != 255 before retryableConnectionCheck can run. That means a dead frontend can still stop failover instead of trying the next remote.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@jenkins/L0_Test.groovy` around lines 200 - 221, The failover logic in the scp retry block exits too early on any non-255 return code, so `retryableConnectionCheck` in the `attempts` loop never gets a chance to decide whether connection errors should be retried. Update the `attempts` generation inside the script returned by this helper so that `scpFromRemoteCmd` failures like connection refused/DNS/no route can still fall through to `retryableConnectionCheck` instead of immediately `exit`ing, while still preserving the final failure exit behavior in the wrapper around `__slurm_frontend_rc`.

…rc 255 scp/sshpass return connection failures as rc 1 (connection refused, no route to host, DNS) as often as 255, but the failover script exited on any rc != 255 before retryableConnectionCheck ran -- so a dead frontend returning 1 stopped failover instead of trying the next remote. Drop the exit-code gate and let the message-based check decide for every non-zero rc; its default case still exits with the original rc, so non-connection errors (e.g. missing file) give up as before. Signed-off-by: Derek Pitman <dpitman@nvidia.com>

dpitman-nvda · 2026-06-26T21:00:22Z

/bot run

tensorrt-cicd · 2026-06-26T21:05:53Z

PR_Github #56130 [ run ] triggered by Bot. Commit: e875d38 Link to invocation

tensorrt-cicd · 2026-06-27T02:40:57Z

PR_Github #56130 [ run ] completed with state FAILURE. Commit: e875d38
/LLM/main/L0_MergeRequest_PR pipeline #44993 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dpitman-nvda · 2026-06-29T13:49:39Z

/bot run

tensorrt-cicd · 2026-06-29T13:55:50Z

PR_Github #56379 [ run ] triggered by Bot. Commit: e875d38 Link to invocation

tensorrt-cicd · 2026-06-29T14:38:31Z

PR_Github #56379 [ run ] completed with state SUCCESS. Commit: e875d38
/LLM/main/L0_MergeRequest_PR pipeline #45223 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dpitman-nvda · 2026-06-29T14:40:49Z

/bot run

github-actions · 2026-06-29T14:46:06Z

👎 Promotion blocked, new vulnerability found

Vulnerability report

Component	Vulnerability	Description	Severity
pytorch	CVE-2025-3000	A vulnerability classified as critical has been found in PyTorch 2.6.0. This affects the function torch.jit.script. The manipulation leads to memory corruption. It is possible to launch the attack on the local host. The exploit has been disclosed to the public and may be used.	MEDIUM

dpitman-nvda · 2026-06-29T14:55:55Z

/bot run

tensorrt-cicd · 2026-06-29T15:02:05Z

PR_Github #56392 [ run ] triggered by Bot. Commit: e875d38 Link to invocation

…probe selectReachableSlurmRemote ran an ssh-true reachability probe to every frontend (each wrapped in a timeout) on entry to "Check If Node Is Online", but its result was never used -- the Phase-1 poll loop and checkSlurmJobActive both go through the failover wrapper independently. Removing the dead call cuts redundant SSH heartbeat load on the login nodes with no behavior change. Signed-off-by: Derek Pitman <dpitman@nvidia.com>

tensorrt-cicd · 2026-06-29T17:13:23Z

PR_Github #56392 [ run ] completed with state SUCCESS. Commit: e875d38
/LLM/main/L0_MergeRequest_PR pipeline #45235 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…cycle The per-operation failover helpers each re-randomize the login node, so a stage's submit -> read-metadata -> track -> collect steps scattered across different frontends. On clusters whose login nodes don't share the job workspace, submit wrote slurm_job_id.txt to one frontend while the metadata read hit another -> empty -> "No job ID found". Observed on aws-pdx B300 in build 45235. Run the whole sbatch submit/metadata/track sequence on the single frontend the enclosing withSlurmFrontendFailover already pins (failover re-runs the closure as a unit; the submit script reuses an active job), and pin one reachable frontend for the uploadResults collect. Drop the now-unused scpFromSlurmFrontendCmd. Controller-side sacct/scontrol queries keep per-call failover -- they are frontend-agnostic. Signed-off-by: Derek Pitman <dpitman@nvidia.com>

dpitman-nvda · 2026-06-29T18:26:15Z

/bot run

tensorrt-cicd · 2026-06-29T18:33:13Z

PR_Github #56438 [ run ] triggered by Bot. Commit: 4390ce2 Link to invocation

tensorrt-cicd · 2026-06-29T21:09:40Z

PR_Github #56438 [ run ] completed with state FAILURE. Commit: 4390ce2
/LLM/main/L0_MergeRequest_PR pipeline #45281 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dpitman-nvda · 2026-06-29T21:19:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-29T21:25:11Z

PR_Github #56464 [ run ] triggered by Bot. Commit: fc21e59 Link to invocation

tensorrt-cicd · 2026-06-30T03:01:21Z

PR_Github #56464 [ run ] completed with state FAILURE. Commit: fc21e59
/LLM/main/L0_MergeRequest_PR pipeline #45305 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dpitman-nvda requested review from a team as code owners June 26, 2026 20:29

dpitman-nvda requested review from mlefeb01 and tburt-nv June 26, 2026 20:29

github-actions Bot assigned dpitman-nvda Jun 26, 2026

dpitman-nvda changed the title ~~[TRTLLMINF-99][test] Add SLURM frontend failover to L0~~ [TRTLLMINF-99][infra] Add SLURM frontend failover to L0 Jun 26, 2026

[TRTLLMINF-81][test] Add SLURM frontend failover to L0

742fbcf

Signed-off-by: Derek Pitman <dpitman@nvidia.com>

dpitman-nvda force-pushed the feat/slurm-frontend-failover-l0 branch from b4a2e4b to 742fbcf Compare June 26, 2026 20:32

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Merge branch 'main' into feat/slurm-frontend-failover-l0

fc21e59

Uh oh!

Conversation

dpitman-nvda commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Jun 26, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

dpitman-nvda commented Jun 26, 2026

Uh oh!

tensorrt-cicd commented Jun 26, 2026

Uh oh!

tensorrt-cicd commented Jun 27, 2026

Uh oh!

dpitman-nvda commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

dpitman-nvda commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

👎 Promotion blocked, new vulnerability found

Vulnerability report

Uh oh!

dpitman-nvda commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

dpitman-nvda commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

dpitman-nvda commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dpitman-nvda commented Jun 26, 2026 •

edited

Loading