[TRTLLMINF-111][infra] Reuse image sqsh file#15147
Conversation
|
/bot run |
📝 WalkthroughWalkthroughThe Jenkins pipeline updates Enroot container image caching from a per-job model to a shared digest-based approach. The SLURM prologue now computes image digests, caches images in shared directories with flock-based concurrency control, and retries imports with backoff. Cleanup functions are updated to age-prune shared cached images instead of removing job-specific artifacts. ChangesShared Enroot image caching by digest
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 1128-1129: The cache key currently uses a sha256 of the container
string (variable container) via imageDigest and then builds enrootImagePath from
containerDir/container-${imageDigest}.sqsh, which yields a tag-based key not the
registry content digest; change the logic to resolve the image's
manifest/registry digest first (or require a digest-pinned reference) and assign
that canonical digest to imageDigest before constructing enrootImagePath (e.g.,
use a registry inspection tool to obtain the manifest digest for the reference
in container and fall back to error if a digest cannot be resolved), ensuring
subsequent cache lookups use the real image content digest rather than a hash of
the reference string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 384250be-19d0-4970-a6bc-e1ba46f38611
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
| imageDigest=\$(printf '%s' "$container" | sha256sum | cut -d' ' -f1) | ||
| export enrootImagePath="\$containerDir/container-\${imageDigest}.sqsh" |
There was a problem hiding this comment.
Key the shared cache on the real image digest, not a hash of the image string.
Line 1128 hashes "$container" itself, so this is still a tag/reference cache, not a content-addressed cache. If the same tag is republished, later jobs will keep reusing the stale .sqsh until it ages out, which means CI can run against the wrong container contents. Resolve the registry/manifest digest first (or require a digest-pinned image reference) and use that value for enrootImagePath.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@jenkins/L0_Test.groovy` around lines 1128 - 1129, The cache key currently
uses a sha256 of the container string (variable container) via imageDigest and
then builds enrootImagePath from containerDir/container-${imageDigest}.sqsh,
which yields a tag-based key not the registry content digest; change the logic
to resolve the image's manifest/registry digest first (or require a
digest-pinned reference) and assign that canonical digest to imageDigest before
constructing enrootImagePath (e.g., use a registry inspection tool to obtain the
manifest digest for the reference in container and fall back to error if a
digest cannot be resolved), ensuring subsequent cache lookups use the real image
content digest rather than a hash of the reference string.
|
PR_Github #53000 [ run ] triggered by Bot. Commit: |
|
PR_Github #53000 [ run ] completed with state
|
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
f29cfff to
dbdcc82
Compare
|
/bot run |
|
PR_Github #53249 [ run ] triggered by Bot. Commit: |
|
PR_Github #53249 [ run ] completed with state
|
|
/bot run |
|
PR_Github #53494 [ run ] triggered by Bot. Commit: |
|
PR_Github #53494 [ run ] completed with state |
|
/bot run |
|
PR_Github #53544 [ run ] triggered by Bot. Commit: |
|
PR_Github #53544 [ run ] completed with state
|
|
/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" |
|
PR_Github #53833 [ run ] triggered by Bot. Commit: |
|
/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" |
|
PR_Github #53894 [ run ] triggered by Bot. Commit: |
|
PR_Github #53833 [ run ] completed with state |
|
PR_Github #53894 [ run ] completed with state
|
8a22dd6 to
dbdcc82
Compare
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
|
/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" |
|
PR_Github #53999 [ run ] triggered by Bot. Commit: |
|
PR_Github #53999 [ run ] completed with state |
|
/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" |
|
PR_Github #54032 [ run ] triggered by Bot. Commit: |
|
PR_Github #54032 [ run ] completed with state |
|
/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test |
|
PR_Github #54045 [ run ] triggered by Bot. Commit: |
|
PR_Github #54045 [ run ] completed with state
|
|
/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test |
|
PR_Github #54052 [ run ] triggered by Bot. Commit: |
|
PR_Github #54052 [ run ] completed with state
|
|
/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test |
|
PR_Github #54073 [ run ] triggered by Bot. Commit: |
|
PR_Github #54073 [ run ] completed with state |
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
|
/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test |
|
PR_Github #54097 [ run ] triggered by Bot. Commit: |
|
PR_Github #54097 [ run ] completed with state
|
|
/bot run --stage-list "DGX_H100-PyTorch-1, DGX_B200-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-1" --disable-reuse-test --disable-fail-fast |
|
PR_Github #54105 [ run ] triggered by Bot. Commit: |
|
PR_Github #54105 [ run ] completed with state |
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.