Skip to content

[TRTLLMINF-111][infra] Reuse image sqsh file#15147

Open
EmmaQiaoCh wants to merge 3 commits into
NVIDIA:mainfrom
EmmaQiaoCh:emma/reduce_image_download_time
Open

[TRTLLMINF-111][infra] Reuse image sqsh file#15147
EmmaQiaoCh wants to merge 3 commits into
NVIDIA:mainfrom
EmmaQiaoCh:emma/reduce_image_download_time

Conversation

@EmmaQiaoCh

@EmmaQiaoCh EmmaQiaoCh commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • Chores
    • Optimized container image caching in the testing pipeline to improve resource efficiency and reduce redundant downloads across concurrent jobs
    • Enhanced cache management to retain commonly-used images while removing aged artifacts
    • Improved concurrent job handling for safer image cache sharing during parallel test execution

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@EmmaQiaoCh EmmaQiaoCh requested review from a team as code owners June 9, 2026 07:23
@EmmaQiaoCh EmmaQiaoCh requested review from dpitman-nvda and niukuo June 9, 2026 07:23
@EmmaQiaoCh EmmaQiaoCh changed the title [TRTLLMINF-112][infra] Reuse image sqsh file [TRTLLMINF-111][infra] Reuse image sqsh file Jun 9, 2026
@EmmaQiaoCh

Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The Jenkins pipeline updates Enroot container image caching from a per-job model to a shared digest-based approach. The SLURM prologue now computes image digests, caches images in shared directories with flock-based concurrency control, and retries imports with backoff. Cleanup functions are updated to age-prune shared cached images instead of removing job-specific artifacts.

Changes

Shared Enroot image caching by digest

Layer / File(s) Summary
Shared image caching prologue with digest and locking
jenkins/L0_Test.groovy
The srunPrologue in runLLMTestlistWithSbatch is rewritten to cache Enroot images by Docker image digest under a shared cluster scratch cache directory. Image import is protected by flock on a lock file, written to a temp file first, and atomically published via mv after successful import. Existing cached images are reused with mtime refresh. Import operations are retried up to max_attempts with exponential backoff.
Cleanup commands for shared cache pruning
jenkins/L0_Test.groovy
cleanUpSlurmResources and cleanUpNodeResources are updated to age-prune shared container-*.sqsh cache files and delete stale container-*.tmp and container-*.lock artifacts, replacing the prior deletion of job-specific cached images.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is empty—only the repository template remains unfilled with no explanation of the issue, solution, test coverage, or checklist items addressed. Fill in the Description section to explain the issue and rationale, the Test Coverage section to list relevant safeguarding tests, and mark PR Checklist items to confirm guidelines compliance.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: reusing Enroot container image sqsh files instead of deleting them per-job, which aligns with the AI-generated summary of the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 1128-1129: The cache key currently uses a sha256 of the container
string (variable container) via imageDigest and then builds enrootImagePath from
containerDir/container-${imageDigest}.sqsh, which yields a tag-based key not the
registry content digest; change the logic to resolve the image's
manifest/registry digest first (or require a digest-pinned reference) and assign
that canonical digest to imageDigest before constructing enrootImagePath (e.g.,
use a registry inspection tool to obtain the manifest digest for the reference
in container and fall back to error if a digest cannot be resolved), ensuring
subsequent cache lookups use the real image content digest rather than a hash of
the reference string.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 384250be-19d0-4970-a6bc-e1ba46f38611

📥 Commits

Reviewing files that changed from the base of the PR and between 6254f3a and f29cfff.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy

Comment thread jenkins/L0_Test.groovy
Comment on lines +1128 to +1129
imageDigest=\$(printf '%s' "$container" | sha256sum | cut -d' ' -f1)
export enrootImagePath="\$containerDir/container-\${imageDigest}.sqsh"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Key the shared cache on the real image digest, not a hash of the image string.

Line 1128 hashes "$container" itself, so this is still a tag/reference cache, not a content-addressed cache. If the same tag is republished, later jobs will keep reusing the stale .sqsh until it ages out, which means CI can run against the wrong container contents. Resolve the registry/manifest digest first (or require a digest-pinned image reference) and use that value for enrootImagePath.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/L0_Test.groovy` around lines 1128 - 1129, The cache key currently
uses a sha256 of the container string (variable container) via imageDigest and
then builds enrootImagePath from containerDir/container-${imageDigest}.sqsh,
which yields a tag-based key not the registry content digest; change the logic
to resolve the image's manifest/registry digest first (or require a
digest-pinned reference) and assign that canonical digest to imageDigest before
constructing enrootImagePath (e.g., use a registry inspection tool to obtain the
manifest digest for the reference in container and fall back to error if a
digest cannot be resolved), ensuring subsequent cache lookups use the real image
content digest rather than a hash of the reference string.

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53000 [ run ] triggered by Bot. Commit: f29cfff Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53000 [ run ] completed with state SUCCESS. Commit: f29cfff
/LLM/main/L0_MergeRequest_PR pipeline #42227 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
@EmmaQiaoCh EmmaQiaoCh force-pushed the emma/reduce_image_download_time branch from f29cfff to dbdcc82 Compare June 10, 2026 06:08
@EmmaQiaoCh

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53249 [ run ] triggered by Bot. Commit: dbdcc82 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53249 [ run ] completed with state SUCCESS. Commit: dbdcc82
/LLM/main/L0_MergeRequest_PR pipeline #42444 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@EmmaQiaoCh

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53494 [ run ] triggered by Bot. Commit: dbdcc82 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53494 [ run ] completed with state SUCCESS. Commit: dbdcc82
/LLM/main/L0_MergeRequest_PR pipeline #42653 completed with status: 'SUCCESS'

CI Report

Link to invocation

@EmmaQiaoCh

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53544 [ run ] triggered by Bot. Commit: 8a22dd6 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53544 [ run ] completed with state SUCCESS. Commit: 8a22dd6
/LLM/main/L0_MergeRequest_PR pipeline #42695 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@EmmaQiaoCh

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53833 [ run ] triggered by Bot. Commit: 8a22dd6 Link to invocation

@EmmaQiaoCh

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53894 [ run ] triggered by Bot. Commit: 8a22dd6 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53833 [ run ] completed with state ABORTED. Commit: 8a22dd6

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53894 [ run ] completed with state SUCCESS. Commit: 8a22dd6
/LLM/main/L0_MergeRequest_PR pipeline #42992 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@EmmaQiaoCh EmmaQiaoCh force-pushed the emma/reduce_image_download_time branch from 8a22dd6 to dbdcc82 Compare June 13, 2026 02:45
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
@EmmaQiaoCh

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53999 [ run ] triggered by Bot. Commit: 155b3ed Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53999 [ run ] completed with state SUCCESS. Commit: 155b3ed
/LLM/main/L0_MergeRequest_PR pipeline #43084 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@EmmaQiaoCh

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54032 [ run ] triggered by Bot. Commit: 155b3ed Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54032 [ run ] completed with state SUCCESS. Commit: 155b3ed
/LLM/main/L0_MergeRequest_PR pipeline #43116 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@EmmaQiaoCh

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54045 [ run ] triggered by Bot. Commit: 155b3ed Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54045 [ run ] completed with state FAILURE. Commit: 155b3ed
/LLM/main/L0_MergeRequest_PR pipeline #43129 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@EmmaQiaoCh

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54052 [ run ] triggered by Bot. Commit: 155b3ed Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54052 [ run ] completed with state SUCCESS. Commit: 155b3ed
/LLM/main/L0_MergeRequest_PR pipeline #43135 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@EmmaQiaoCh

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54073 [ run ] triggered by Bot. Commit: 155b3ed Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54073 [ run ] completed with state SUCCESS. Commit: 155b3ed
/LLM/main/L0_MergeRequest_PR pipeline #43156 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
@EmmaQiaoCh

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54097 [ run ] triggered by Bot. Commit: d2adbe6 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54097 [ run ] completed with state FAILURE. Commit: d2adbe6
/LLM/main/L0_MergeRequest_PR pipeline #43181 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@EmmaQiaoCh

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54105 [ run ] triggered by Bot. Commit: d2adbe6 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54105 [ run ] completed with state SUCCESS. Commit: d2adbe6
/LLM/main/L0_MergeRequest_PR pipeline #43188 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants