[TRTLLMINF-111][infra] Reuse image sqsh file by EmmaQiaoCh · Pull Request #15147 · NVIDIA/TensorRT-LLM

EmmaQiaoCh · 2026-06-09T07:23:54Z

Summary by CodeRabbit

Chores
- Optimized container image caching in the testing pipeline to improve resource efficiency and reduce redundant downloads across concurrent jobs
- Enhanced cache management to retain commonly-used images while removing aged artifacts
- Improved concurrent job handling for safer image cache sharing during parallel test execution

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

EmmaQiaoCh · 2026-06-09T07:28:17Z

/bot run

coderabbitai · 2026-06-09T07:29:04Z

📝 Walkthrough

Walkthrough

The Jenkins pipeline updates Enroot container image caching from a per-job model to a shared digest-based approach. The SLURM prologue now computes image digests, caches images in shared directories with flock-based concurrency control, and retries imports with backoff. Cleanup functions are updated to age-prune shared cached images instead of removing job-specific artifacts.

Changes

Shared Enroot image caching by digest

Layer / File(s)	Summary
Shared image caching prologue with digest and locking `jenkins/L0_Test.groovy`	The `srunPrologue` in `runLLMTestlistWithSbatch` is rewritten to cache Enroot images by Docker image digest under a shared cluster scratch cache directory. Image import is protected by `flock` on a lock file, written to a temp file first, and atomically published via `mv` after successful import. Existing cached images are reused with mtime refresh. Import operations are retried up to `max_attempts` with exponential backoff.
Cleanup commands for shared cache pruning `jenkins/L0_Test.groovy`	`cleanUpSlurmResources` and `cleanUpNodeResources` are updated to age-prune shared `container-.sqsh` cache files and delete stale `container-.tmp` and `container-*.lock` artifacts, replacing the prior deletion of job-specific cached images.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is empty—only the repository template remains unfilled with no explanation of the issue, solution, test coverage, or checklist items addressed.	Fill in the Description section to explain the issue and rationale, the Test Coverage section to list relevant safeguarding tests, and mark PR Checklist items to confirm guidelines compliance.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: reusing Enroot container image sqsh files instead of deleting them per-job, which aligns with the AI-generated summary of the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 1128-1129: The cache key currently uses a sha256 of the container
string (variable container) via imageDigest and then builds enrootImagePath from
containerDir/container-${imageDigest}.sqsh, which yields a tag-based key not the
registry content digest; change the logic to resolve the image's
manifest/registry digest first (or require a digest-pinned reference) and assign
that canonical digest to imageDigest before constructing enrootImagePath (e.g.,
use a registry inspection tool to obtain the manifest digest for the reference
in container and fall back to error if a digest cannot be resolved), ensuring
subsequent cache lookups use the real image content digest rather than a hash of
the reference string.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 384250be-19d0-4970-a6bc-e1ba46f38611

📥 Commits

Reviewing files that changed from the base of the PR and between 6254f3a and f29cfff.

📒 Files selected for processing (1)

jenkins/L0_Test.groovy

coderabbitai · 2026-06-09T07:29:07Z

+                    imageDigest=\$(printf '%s' "$container" | sha256sum | cut -d' ' -f1)
+                    export enrootImagePath="\$containerDir/container-\${imageDigest}.sqsh"


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Key the shared cache on the real image digest, not a hash of the image string.

Line 1128 hashes "$container" itself, so this is still a tag/reference cache, not a content-addressed cache. If the same tag is republished, later jobs will keep reusing the stale .sqsh until it ages out, which means CI can run against the wrong container contents. Resolve the registry/manifest digest first (or require a digest-pinned image reference) and use that value for enrootImagePath.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@jenkins/L0_Test.groovy` around lines 1128 - 1129, The cache key currently uses a sha256 of the container string (variable container) via imageDigest and then builds enrootImagePath from containerDir/container-${imageDigest}.sqsh, which yields a tag-based key not the registry content digest; change the logic to resolve the image's manifest/registry digest first (or require a digest-pinned reference) and assign that canonical digest to imageDigest before constructing enrootImagePath (e.g., use a registry inspection tool to obtain the manifest digest for the reference in container and fall back to error if a digest cannot be resolved), ensuring subsequent cache lookups use the real image content digest rather than a hash of the reference string.

tensorrt-cicd · 2026-06-09T07:34:00Z

PR_Github #53000 [ run ] triggered by Bot. Commit: f29cfff Link to invocation

tensorrt-cicd · 2026-06-09T13:59:50Z

PR_Github #53000 [ run ] completed with state SUCCESS. Commit: f29cfff
/LLM/main/L0_MergeRequest_PR pipeline #42227 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

EmmaQiaoCh · 2026-06-10T06:09:05Z

/bot run

tensorrt-cicd · 2026-06-10T06:14:31Z

PR_Github #53249 [ run ] triggered by Bot. Commit: dbdcc82 Link to invocation

tensorrt-cicd · 2026-06-10T12:29:10Z

PR_Github #53249 [ run ] completed with state SUCCESS. Commit: dbdcc82
/LLM/main/L0_MergeRequest_PR pipeline #42444 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

EmmaQiaoCh · 2026-06-11T06:01:41Z

/bot run

tensorrt-cicd · 2026-06-11T06:07:15Z

PR_Github #53494 [ run ] triggered by Bot. Commit: dbdcc82 Link to invocation

tensorrt-cicd · 2026-06-11T07:53:19Z

PR_Github #53494 [ run ] completed with state SUCCESS. Commit: dbdcc82
/LLM/main/L0_MergeRequest_PR pipeline #42653 completed with status: 'SUCCESS'

CI Report

Link to invocation

EmmaQiaoCh · 2026-06-11T09:38:09Z

/bot run

tensorrt-cicd · 2026-06-11T09:45:04Z

PR_Github #53544 [ run ] triggered by Bot. Commit: 8a22dd6 Link to invocation

tensorrt-cicd · 2026-06-11T11:45:07Z

PR_Github #53544 [ run ] completed with state SUCCESS. Commit: 8a22dd6
/LLM/main/L0_MergeRequest_PR pipeline #42695 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

EmmaQiaoCh · 2026-06-12T07:30:30Z

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1"

tensorrt-cicd · 2026-06-12T07:37:34Z

PR_Github #53833 [ run ] triggered by Bot. Commit: 8a22dd6 Link to invocation

EmmaQiaoCh · 2026-06-12T13:57:23Z

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1"

tensorrt-cicd · 2026-06-12T14:03:12Z

PR_Github #53894 [ run ] triggered by Bot. Commit: 8a22dd6 Link to invocation

tensorrt-cicd · 2026-06-12T14:04:56Z

PR_Github #53833 [ run ] completed with state ABORTED. Commit: 8a22dd6

Link to invocation

tensorrt-cicd · 2026-06-12T17:09:02Z

PR_Github #53894 [ run ] completed with state SUCCESS. Commit: 8a22dd6
/LLM/main/L0_MergeRequest_PR pipeline #42992 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

EmmaQiaoCh · 2026-06-13T02:47:46Z

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1"

tensorrt-cicd · 2026-06-13T02:53:11Z

PR_Github #53999 [ run ] triggered by Bot. Commit: 155b3ed Link to invocation

tensorrt-cicd · 2026-06-13T05:25:20Z

PR_Github #53999 [ run ] completed with state SUCCESS. Commit: 155b3ed
/LLM/main/L0_MergeRequest_PR pipeline #43084 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

EmmaQiaoCh · 2026-06-13T09:03:36Z

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1"

tensorrt-cicd · 2026-06-13T09:09:00Z

PR_Github #54032 [ run ] triggered by Bot. Commit: 155b3ed Link to invocation

tensorrt-cicd · 2026-06-13T09:33:57Z

PR_Github #54032 [ run ] completed with state SUCCESS. Commit: 155b3ed
/LLM/main/L0_MergeRequest_PR pipeline #43116 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

EmmaQiaoCh · 2026-06-13T12:24:32Z

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test

tensorrt-cicd · 2026-06-13T12:32:00Z

PR_Github #54045 [ run ] triggered by Bot. Commit: 155b3ed Link to invocation

tensorrt-cicd · 2026-06-13T13:05:55Z

PR_Github #54045 [ run ] completed with state FAILURE. Commit: 155b3ed
/LLM/main/L0_MergeRequest_PR pipeline #43129 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

EmmaQiaoCh · 2026-06-13T14:06:21Z

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test

tensorrt-cicd · 2026-06-13T14:12:21Z

PR_Github #54052 [ run ] triggered by Bot. Commit: 155b3ed Link to invocation

tensorrt-cicd · 2026-06-13T15:07:11Z

PR_Github #54052 [ run ] completed with state SUCCESS. Commit: 155b3ed
/LLM/main/L0_MergeRequest_PR pipeline #43135 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

EmmaQiaoCh · 2026-06-14T02:03:27Z

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test

tensorrt-cicd · 2026-06-14T02:08:50Z

PR_Github #54073 [ run ] triggered by Bot. Commit: 155b3ed Link to invocation

tensorrt-cicd · 2026-06-14T05:06:39Z

PR_Github #54073 [ run ] completed with state SUCCESS. Commit: 155b3ed
/LLM/main/L0_MergeRequest_PR pipeline #43156 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

EmmaQiaoCh · 2026-06-14T08:14:49Z

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test

tensorrt-cicd · 2026-06-14T08:20:22Z

PR_Github #54097 [ run ] triggered by Bot. Commit: d2adbe6 Link to invocation

tensorrt-cicd · 2026-06-14T08:56:01Z

PR_Github #54097 [ run ] completed with state FAILURE. Commit: d2adbe6
/LLM/main/L0_MergeRequest_PR pipeline #43181 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

EmmaQiaoCh · 2026-06-14T09:11:22Z

/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test --disable-fail-fast

tensorrt-cicd · 2026-06-14T09:16:52Z

PR_Github #54105 [ run ] triggered by Bot. Commit: d2adbe6 Link to invocation

tensorrt-cicd · 2026-06-14T11:34:03Z

PR_Github #54105 [ run ] completed with state SUCCESS. Commit: d2adbe6
/LLM/main/L0_MergeRequest_PR pipeline #43188 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

EmmaQiaoCh requested review from a team as code owners June 9, 2026 07:23

EmmaQiaoCh requested review from dpitman-nvda and niukuo June 9, 2026 07:23

github-actions Bot assigned EmmaQiaoCh Jun 9, 2026

EmmaQiaoCh changed the title ~~[TRTLLMINF-112][infra] Reuse image sqsh file~~ [TRTLLMINF-111][infra] Reuse image sqsh file Jun 9, 2026

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Reuse image sqsh file

dbdcc82

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

EmmaQiaoCh force-pushed the emma/reduce_image_download_time branch from f29cfff to dbdcc82 Compare June 10, 2026 06:08

EmmaQiaoCh force-pushed the emma/reduce_image_download_time branch from 8a22dd6 to dbdcc82 Compare June 13, 2026 02:45

Check for agent flow

155b3ed

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Add some output to check agent flow

d2adbe6

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

		imageDigest=\$(printf '%s' "$container" \| sha256sum \| cut -d' ' -f1)
		export enrootImagePath="\$containerDir/container-\${imageDigest}.sqsh"

Conversation

EmmaQiaoCh commented Jun 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

EmmaQiaoCh commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jun 9, 2026

Uh oh!

tensorrt-cicd commented Jun 9, 2026

Uh oh!

EmmaQiaoCh commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

EmmaQiaoCh commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

EmmaQiaoCh commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

EmmaQiaoCh commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

EmmaQiaoCh commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

EmmaQiaoCh commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

EmmaQiaoCh commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

EmmaQiaoCh commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

EmmaQiaoCh commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

EmmaQiaoCh commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading