Skip to content

[TRTLLMINF-99][infra] Add SLURM frontend failover to L0#15674

Open
dpitman-nvda wants to merge 5 commits into
NVIDIA:mainfrom
dpitman-nvda:feat/slurm-frontend-failover-l0
Open

[TRTLLMINF-99][infra] Add SLURM frontend failover to L0#15674
dpitman-nvda wants to merge 5 commits into
NVIDIA:mainfrom
dpitman-nvda:feat/slurm-frontend-failover-l0

Conversation

@dpitman-nvda

@dpitman-nvda dpitman-nvda commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • Bug Fixes
    • Improved SLURM job handling during frontend/controller interruptions, making test runs and cleanup more resilient.
    • Test result uploads now recover more reliably when a SLURM frontend becomes unreachable.
    • Job tracking now better distinguishes timeouts from transient failures, reducing unnecessary retries.
    • Existing active SLURM jobs are now reused more consistently instead of being canceled and resubmitted.
    • Startup and metadata checks now surface connection issues more accurately for better retry behavior.

Description

SLURM frontend swapping capabilities have been added to the infrastructure supporting libraries, now making it so that the L0_Test code makes use of the helper functions that do the frontend swapping.

Test Coverage

N/A, this is a CI change

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@dpitman-nvda dpitman-nvda requested review from a team as code owners June 26, 2026 20:29
@dpitman-nvda dpitman-nvda changed the title [TRTLLMINF-99][test] Add SLURM frontend failover to L0 [TRTLLMINF-99][infra] Add SLURM frontend failover to L0 Jun 26, 2026
Signed-off-by: Derek Pitman <dpitman@nvidia.com>
@dpitman-nvda dpitman-nvda force-pushed the feat/slurm-frontend-failover-l0 branch from b4a2e4b to 742fbcf Compare June 26, 2026 20:32
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The Jenkins SLURM pipeline now routes SSH and SCP operations through frontend-failover helpers across result upload, agent bring-up, cleanup, and sbatch execution. The sbatch path reuses active jobs from stored job ids, re-queries allocation state on tracking failures, and raises UserFailure on TIMEOUT.

Changes

SLURM frontend failover

Layer / File(s) Summary
Result upload failover
jenkins/L0_Test.groovy
Adds scpFromSlurmFrontendCmd() and switches uploadResults() to frontend-aware remotes for timeout, regular, and perf artifact transfers.
Agent bring-up checks
jenkins/L0_Test.groovy
Routes runLLMTestlistWithAgent() request-node, node-online, Phase 1 polling, and ENROOT log retrieval through reachable remotes and frontend failover.
Cleanup and node metadata
jenkins/L0_Test.groovy
Moves SLURM cleanup, job-state lookup, and node-list capture onto withSlurmFrontendFailover.
SBATCH retry and tracking
jenkins/L0_Test.groovy
Updates runLLMTestlistWithSbatch() to reuse active jobs, submit and track via frontend failover, and rethrow frontend connection failures while reading job metadata.

Sequence Diagram(s)

sequenceDiagram
  participant runLLMTestlistWithSbatch
  participant withSlurmFrontendFailover
  participant "SLURM frontend" as SLURMFrontend
  participant "SLURM allocation" as SLURMAllocation

  runLLMTestlistWithSbatch->>withSlurmFrontendFailover: submit Run Pytest and read slurm_job_id.txt
  withSlurmFrontendFailover->>SLURMFrontend: SSH commands through frontend failover
  SLURMFrontend->>SLURMAllocation: reuse or submit sbatch job
  SLURMAllocation-->>SLURMFrontend: job id and state
  runLLMTestlistWithSbatch->>withSlurmFrontendFailover: track job via frontend failover
  withSlurmFrontendFailover->>SLURMAllocation: query allocation-level state
  SLURMAllocation-->>withSlurmFrontendFailover: TIMEOUT
  withSlurmFrontendFailover-->>runLLMTestlistWithSbatch: throw UserFailure
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise, specific, and accurately summarizes the main SLURM frontend failover change.
Description check ✅ Passed The description follows the template and includes a clear summary, test coverage note, and completed PR checklist.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 200-221: The failover logic in the scp retry block exits too early
on any non-255 return code, so `retryableConnectionCheck` in the `attempts` loop
never gets a chance to decide whether connection errors should be retried.
Update the `attempts` generation inside the script returned by this helper so
that `scpFromRemoteCmd` failures like connection refused/DNS/no route can still
fall through to `retryableConnectionCheck` instead of immediately `exit`ing,
while still preserving the final failure exit behavior in the wrapper around
`__slurm_frontend_rc`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 755c117c-d635-445d-8070-f415b423cc49

📥 Commits

Reviewing files that changed from the base of the PR and between a9f90a4 and b4a2e4b.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy

Comment thread jenkins/L0_Test.groovy Outdated
Comment on lines +200 to +221
def attempts = remotes.collect { remote -> """
echo '[SLURM-FRONTEND] trying ${remote.host}' >&2
__slurm_frontend_output=\$({ ${scpFromRemoteCmd(remote, remotePath, localPath)}; } 2>&1)
__slurm_frontend_rc=\$?
printf '%s\\n' "\$__slurm_frontend_output" >&2
if [ \$__slurm_frontend_rc -eq 0 ]; then
exit 0
fi
if [ \$__slurm_frontend_rc -ne 255 ]; then
exit \$__slurm_frontend_rc
fi
${retryableConnectionCheck}
""".stripIndent().trim() }

return """
(
__slurm_frontend_rc=255
${attempts.join("\n")}
exit \$__slurm_frontend_rc
)
""".stripIndent().replaceAll(/\s+$/, "")
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

What exit code does OpenSSH scp return on a connection failure (connection refused, no route to host, could not resolve hostname), and did this change in OpenSSH 9.0 SFTP-mode scp?

💡 Result:

When OpenSSH scp encounters a connection failure, such as "connection refused," "no route to host," or a failure to resolve the hostname, it exits with a non-zero status code, typically 1 [1][2][3][4][5]. Official documentation for OpenSSH scp states generally that the utility exits with 0 on success and >0 if an error occurs [1][4]. The transition in OpenSSH 9.0 to using the SFTP protocol by default did not change this exit behavior [6][4][5]. The primary changes introduced in OpenSSH 9.0 regarding scp involved the underlying protocol used for file transfers (switching from the legacy scp/rcp protocol to SFTP) and differences in how remote path characters are handled [6][7]. The exit status remains defined as 0 for success and a non-zero value (typically 1) when an error occurs [1][4]. Because scp relies on the ssh(1) utility to establish the underlying connection, these connection-level errors (which occur during the ssh handshake phase) result in the same non-zero exit status regardless of whether the subsequent file transfer protocol is legacy SCP or SFTP [8][4][9].

Citations:


🏁 Script executed:

sed -n '150,240p' jenkins/L0_Test.groovy

Repository: NVIDIA/TensorRT-LLM

Length of output: 4353


scp failover never runs for non-255 connection errors.

OpenSSH scp commonly returns 1 for connection failures like connection refused, no route to host, and DNS resolution errors, but this branch exits immediately for any rc != 255 before retryableConnectionCheck can run. That means a dead frontend can still stop failover instead of trying the next remote.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/L0_Test.groovy` around lines 200 - 221, The failover logic in the scp
retry block exits too early on any non-255 return code, so
`retryableConnectionCheck` in the `attempts` loop never gets a chance to decide
whether connection errors should be retried. Update the `attempts` generation
inside the script returned by this helper so that `scpFromRemoteCmd` failures
like connection refused/DNS/no route can still fall through to
`retryableConnectionCheck` instead of immediately `exit`ing, while still
preserving the final failure exit behavior in the wrapper around
`__slurm_frontend_rc`.

…rc 255

scp/sshpass return connection failures as rc 1 (connection refused, no
route to host, DNS) as often as 255, but the failover script exited on any
rc != 255 before retryableConnectionCheck ran -- so a dead frontend
returning 1 stopped failover instead of trying the next remote. Drop the
exit-code gate and let the message-based check decide for every non-zero
rc; its default case still exits with the original rc, so non-connection
errors (e.g. missing file) give up as before.

Signed-off-by: Derek Pitman <dpitman@nvidia.com>
@dpitman-nvda

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56130 [ run ] triggered by Bot. Commit: e875d38 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56130 [ run ] completed with state FAILURE. Commit: e875d38
/LLM/main/L0_MergeRequest_PR pipeline #44993 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@dpitman-nvda

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56379 [ run ] triggered by Bot. Commit: e875d38 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56379 [ run ] completed with state SUCCESS. Commit: e875d38
/LLM/main/L0_MergeRequest_PR pipeline #45223 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@dpitman-nvda

Copy link
Copy Markdown
Collaborator Author

/bot run

@github-actions

Copy link
Copy Markdown

👎 Promotion blocked, new vulnerability found

Vulnerability report

Component Vulnerability Description Severity
pytorch CVE-2025-3000 A vulnerability classified as critical has been found in PyTorch 2.6.0. This affects the function torch.jit.script. The manipulation leads to memory corruption. It is possible to launch the attack on the local host. The exploit has been disclosed to the public and may be used. MEDIUM

@dpitman-nvda

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56392 [ run ] triggered by Bot. Commit: e875d38 Link to invocation

…probe

selectReachableSlurmRemote ran an ssh-true reachability probe to every
frontend (each wrapped in a timeout) on entry to "Check If Node Is Online",
but its result was never used -- the Phase-1 poll loop and checkSlurmJobActive
both go through the failover wrapper independently. Removing the dead call
cuts redundant SSH heartbeat load on the login nodes with no behavior change.

Signed-off-by: Derek Pitman <dpitman@nvidia.com>
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56392 [ run ] completed with state SUCCESS. Commit: e875d38
/LLM/main/L0_MergeRequest_PR pipeline #45235 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…cycle

The per-operation failover helpers each re-randomize the login node, so a
stage's submit -> read-metadata -> track -> collect steps scattered across
different frontends. On clusters whose login nodes don't share the job
workspace, submit wrote slurm_job_id.txt to one frontend while the metadata
read hit another -> empty -> "No job ID found". Observed on aws-pdx B300 in
build 45235.

Run the whole sbatch submit/metadata/track sequence on the single frontend the
enclosing withSlurmFrontendFailover already pins (failover re-runs the closure
as a unit; the submit script reuses an active job), and pin one reachable
frontend for the uploadResults collect. Drop the now-unused
scpFromSlurmFrontendCmd. Controller-side sacct/scontrol queries keep
per-call failover -- they are frontend-agnostic.

Signed-off-by: Derek Pitman <dpitman@nvidia.com>
@dpitman-nvda

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56438 [ run ] triggered by Bot. Commit: 4390ce2 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56438 [ run ] completed with state FAILURE. Commit: 4390ce2
/LLM/main/L0_MergeRequest_PR pipeline #45281 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@dpitman-nvda

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56464 [ run ] triggered by Bot. Commit: fc21e59 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56464 [ run ] completed with state FAILURE. Commit: fc21e59
/LLM/main/L0_MergeRequest_PR pipeline #45305 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants