engine: support more robust handling of links in tar stream#1377
engine: support more robust handling of links in tar stream#1377caxu-rh wants to merge 1 commit intoredhat-openshift-ecosystem:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: caxu-rh The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
5b2aa20 to
755fb96
Compare
|
from change #1377: |
|
/retest |
|
Please rebase so the Gemini reviews can run. |
755fb96 to
26c373b
Compare
|
from change #1377: |
There was a problem hiding this comment.
This PR significantly improves the robustness of tar extraction when handling complex link structures (symlinks and hardlinks), particularly in the context of file filtering and directory symlinks. It introduces a LinkGraph and directory alias mapping to correctly identify files that should be extracted based on patterns, even when accessed through various link paths.
🔍 General Feedback
- Robustness: The move to a two-pass approach (planning and execution) is a solid architectural decision that ensures dependencies are correctly identified and resolved before extraction.
- Topological Sorting: Implementing Kahn's algorithm for hardlink dependency sorting is excellent and addresses previous issues where hardlinks could fail if their targets weren't yet extracted.
- Security: The use of
os.Root(viaos.OpenRoot) provides strong protection against directory traversal attacks during extraction. - Bug Fixes: A critical bug in
ExpandFilePathAliasesneeds to be addressed as it could cause a crash due to infinite recursion in the presence of symlink loops. - Optimization: Minor optimizations in
engine.goand the general cleanup of the extraction logic are positive highlights. - Performance: While the two-pass approach is more robust, it does involve reading the tar stream twice, which may impact performance for remote images.
| inDegree[hlName] = 0 | ||
| adjList[hlName] = []string{} | ||
| } | ||
|
|
There was a problem hiding this comment.
🟡 If a hardlink cycle exists in the tarball, Kahn's algorithm will complete with len(sorted) < len(hardlinks), and the cyclic links will be silently skipped. It might be better to return an error or log a warning if the number of sorted hardlinks doesn't match the input.
There was a problem hiding this comment.
The pull request introduces a more robust graph-based approach for handling symlinks and hardlinks during image extraction. While the architectural direction significantly improves the resolution of complex link chains and avoids multiple extraction passes, it introduces a critical infinite recursion bug in ExpandFilePathAliases and a high-priority regression in pattern matching support.
🔍 General Feedback
- Robust Link Resolution: The new approach for tracking links via a graph and building an extraction plan is a major improvement over the previous progressive resolution method.
- Topological Sorting of Hardlinks: Deferring hardlink creation and sorting them by dependencies ensures that targets always exist before the links are created.
- Circular Symlink Risks: The current recursive implementation of
ExpandFilePathAliasesdoes not properly track visited nodes across recursive calls, which must be addressed to prevent stack overflows. - Pattern Matching Regression: Ensure that the
doublestarpackage is still used for pattern matching to maintain support for recursive globbing patterns (**).
| return linkNode.OriginalLinkname, true | ||
| } | ||
| if linkNode.VirtualLinkTarget != "" { | ||
| return linkNode.VirtualLinkTarget, true |
There was a problem hiding this comment.
🟡 BuildExtractionPlan has O(N * M) complexity where N is the number of files and M is the number of patterns, and it calls ExpandFilePathAliases for every file. For images with many files and many patterns, this could be a performance bottleneck. Consider pre-compiling patterns or optimization if performance issues arise in large images.
26c373b to
b356972
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
📝 WalkthroughSummary by CodeRabbit
WalkthroughRefactors tar extraction into a two-phase plan-and-run flow, adds a LinkGraph for symlink/hardlink aliasing and path expansion, changes per-entry bundle-hash write to use fmt.Fprintf, and adds/updates tests covering alias expansion, deep chains, deferred hardlink ordering, and canonicalization. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Extractor
participant TarReader as Tar Reader
participant LinkGraph as Link Graph
participant Filesystem as Filesystem Writer
Client->>Extractor: untar(requiredPatterns)
rect rgba(76, 175, 80, 0.5)
Note over Extractor,LinkGraph: Planning Phase
Extractor->>TarReader: read all entries (single pass)
TarReader-->>Extractor: stream entries
Extractor->>LinkGraph: build graph (files, symlinks, hardlinks)
LinkGraph-->>Extractor: alias maps / virtual targets
Extractor->>LinkGraph: ExpandFilePathAliases(matches)
LinkGraph-->>Extractor: expanded paths
Extractor->>Extractor: compute transitive dependencies (plan)
Extractor-->>Client: return extraction plan
end
rect rgba(33, 150, 243, 0.5)
Note over Extractor,Filesystem: Execution Phase
Extractor->>TarReader: read tar stream again
TarReader-->>Extractor: stream planned entries
Extractor->>Filesystem: write regular files
Filesystem-->>Extractor: files written
Extractor->>Filesystem: create symlinks immediately
Filesystem-->>Extractor: symlinks created
Extractor->>LinkGraph: sort hardlinks by dependencies (topo)
LinkGraph-->>Extractor: ordered hardlink list
Extractor->>Filesystem: create hardlinks in order
Filesystem-->>Extractor: hardlinks created
end
Extractor-->>Client: extraction complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/engine/untar.go (1)
279-286:⚠️ Potential issue | 🟠 MajorSame error-overwriting issue as in
planExtraction.Apply the same fix here to preserve the original error when drain also fails.
Proposed fix
defer func() { // Drain any remaining data from the reader and capture any errors _, drainErr := io.Copy(io.Discard, fs) if drainErr != nil { - err = fmt.Errorf("failed to drain io reader: %w", drainErr) + if err != nil { + err = fmt.Errorf("%w; additionally failed to drain io reader: %v", err, drainErr) + } else { + err = fmt.Errorf("failed to drain io reader: %w", drainErr) + } } fs.Close() }()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engine/untar.go` around lines 279 - 286, The defer currently unconditionally sets err when draining the reader fails, which can overwrite a prior error; update the defer in the untar reader cleanup (the block using fs, io.Copy(io.Discard, fs), drainErr and err) to preserve the original error: after calling io.Copy check if drainErr != nil and then only set err to drainErr if err == nil, or else combine both errors (e.g., errors.Join(err, fmt.Errorf("failed to drain io reader: %w", drainErr))) so the original error is not lost; also ensure fs.Close() still runs.
🧹 Nitpick comments (1)
internal/engine/untar.go (1)
96-98: Stale comment reference.The comment refers to
untarOncewhich no longer exists; should referencerunExtraction.- // Order doesn't matter because hardlinks are deferred in untarOnce + // Order doesn't matter because hardlinks are deferred in runExtraction🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engine/untar.go` around lines 96 - 98, The comment above the slice conversion references a removed function name; update the comment to refer to runExtraction instead of untarOnce and keep the rest unchanged — specifically the line around the append to result that uses slices.Collect(maps.Keys(filesToExtract)) to build the extraction plan; ensure the comment now reads something like "Order doesn't matter because hardlinks are deferred in runExtraction" so it correctly points to runExtraction and clarifies why ordering is irrelevant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/engine/untar.go`:
- Around line 26-34: The defer that drains and closes fs can overwrite an
earlier returned error (e.g., the err set after tr.Next()), so change the defer
in untar.go to preserve and combine errors instead of clobbering them: detect
the existing named return error (err) and if io.Copy or fs.Close produce errors,
combine them with the original using errors.Join (or fmt.Errorf to wrap if
errors.Join unavailable), e.g. produce a combined error like "original error;
drain error: ...; close error: ..." and assign that back to err; reference the
fs variable and the defer block that currently does io.Copy(io.Discard, fs) and
fs.Close() as the location to apply this change.
---
Outside diff comments:
In `@internal/engine/untar.go`:
- Around line 279-286: The defer currently unconditionally sets err when
draining the reader fails, which can overwrite a prior error; update the defer
in the untar reader cleanup (the block using fs, io.Copy(io.Discard, fs),
drainErr and err) to preserve the original error: after calling io.Copy check if
drainErr != nil and then only set err to drainErr if err == nil, or else combine
both errors (e.g., errors.Join(err, fmt.Errorf("failed to drain io reader: %w",
drainErr))) so the original error is not lost; also ensure fs.Close() still
runs.
---
Nitpick comments:
In `@internal/engine/untar.go`:
- Around line 96-98: The comment above the slice conversion references a removed
function name; update the comment to refer to runExtraction instead of untarOnce
and keep the rest unchanged — specifically the line around the append to result
that uses slices.Collect(maps.Keys(filesToExtract)) to build the extraction
plan; ensure the comment now reads something like "Order doesn't matter because
hardlinks are deferred in runExtraction" so it correctly points to runExtraction
and clarifies why ordering is irrelevant.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f13bcffa-896c-4b97-bc14-4cbe0630be34
📒 Files selected for processing (5)
internal/engine/engine.gointernal/engine/graph.gointernal/engine/graph_test.gointernal/engine/untar.gointernal/engine/untar_test.go
b356972 to
7f3bb42
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
internal/engine/untar.go (1)
224-238:⚠️ Potential issue | 🟠 MajorDon't silently drop cyclic hardlinks.
If Kahn's pass leaves nodes behind,
sortedis shorter thanhardlinks; EOF extraction then never sees those selected entries and still returns success. Please surface that cycle whenlen(sorted) != len(hardlinks).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engine/untar.go` around lines 224 - 238, After running Kahn's algorithm you must detect and surface cycles instead of silently dropping entries: after the loop that builds `sorted` (the snippet iterating `queue` and decrementing `inDegree`), check if `len(sorted) != len(hardlinks)` and return a descriptive error (or propagate one) indicating a cyclic hardlink dependency and include context (e.g., remaining nodes or their names) so extraction fails rather than succeeding; update the function that performs this topological sort to return the error instead of proceeding when the lengths differ.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/engine/graph.go`:
- Around line 305-313: The code resolves the raw link target returned by
getEffectiveLinkTarget(targetPath, lg) relative to the symlink's path
(linkPath), which mis-resolves chained links; instead, resolve the returned
linkTarget relative to targetPath so nested chains map correctly. Change the
resolveRelativeLinkFrom call to use targetPath (e.g., resolvedFromSymlink :=
resolveRelativeLinkFrom(targetPath, linkTarget)) and then proceed to call
addAliasIfNew(aliases, resolvedFromSymlink, linkPath) so alias matching uses the
target-path-relative resolution for chained symlinks.
- Around line 122-133: walkGraphChain can loop forever on cycles (e.g.,
a->b->a); modify walkGraphChain to track visited node names (use a
map[string]bool or set) and check before visiting/advancing: if current or
node.Deps.Name is already seen, break out to prevent infinite loops. Ensure you
mark nodes as visited when first encountered and use the same visited check
before calling visitor(node.Deps.Name, ... ) or setting current = node.Deps.Name
so cycles stop. Keep function signature (walkGraphChain, graph LinkGraph,
linkNode, visitor) unchanged.
In `@internal/engine/untar.go`:
- Around line 127-141: Normalize symlink targets before inserting into
linkGraph: if header.Linkname is absolute (starts with "/") strip the leading
slash and Clean it (e.g., strings.TrimPrefix(header.Linkname, "/") then
filepath.Clean) so the stored dep key matches tar entry paths (e.g.,
"usr/bin/bash"); otherwise continue to resolve relative targets with
filepath.Join(filepath.Dir(header.Name), header.Linkname) and filepath.Clean.
Update the creation of resolvedTargetName used when constructing linkNode
(symbols: header.Name, header.Linkname, resolvedTargetName, linkGraph, linkNode,
addTransitiveDependencies) and add the necessary import for strings if missing.
---
Duplicate comments:
In `@internal/engine/untar.go`:
- Around line 224-238: After running Kahn's algorithm you must detect and
surface cycles instead of silently dropping entries: after the loop that builds
`sorted` (the snippet iterating `queue` and decrementing `inDegree`), check if
`len(sorted) != len(hardlinks)` and return a descriptive error (or propagate
one) indicating a cyclic hardlink dependency and include context (e.g.,
remaining nodes or their names) so extraction fails rather than succeeding;
update the function that performs this topological sort to return the error
instead of proceeding when the lengths differ.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fc1e0d55-4937-4abe-9560-10bc9b391e50
📒 Files selected for processing (5)
internal/engine/engine.gointernal/engine/graph.gointernal/engine/graph_test.gointernal/engine/untar.gointernal/engine/untar_test.go
✅ Files skipped from review due to trivial changes (2)
- internal/engine/engine.go
- internal/engine/untar_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/engine/graph_test.go
7f3bb42 to
49d921f
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/engine/graph.go`:
- Around line 273-340: The loop in build of the directory alias map uses a fixed
maxIterations (maxIterations=100) and silently returns a partial aliases map
when the ceiling is hit; change this to either (1) convert the algorithm to a
deterministic worklist/fixpoint approach (process a queue of changed nodes, only
re-enqueue neighbors when addAliasIfNew changes aliases) using the existing
symbols (lg, aliases, addAliasIfNew, resolveRelativeLinkFrom, getLinkTarget,
getEffectiveLinkTarget, linkNode.VirtualLinkTarget) so the loop terminates only
when the worklist is empty, or (2) if keeping the iterative approach, remove the
silent success path and have the function return an error when iteration >=
maxIterations (propagate an error out of the function so callers such as
planExtraction() can fail deterministically) and update relevant call sites to
handle the error; implement one of these fixes and remove the silent
log-that-it-may-be-incomplete behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 06496e13-592c-40d0-8788-6518bbfd8804
📒 Files selected for processing (5)
internal/engine/engine.gointernal/engine/graph.gointernal/engine/graph_test.gointernal/engine/untar.gointernal/engine/untar_test.go
✅ Files skipped from review due to trivial changes (1)
- internal/engine/engine.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/engine/untar.go
- internal/engine/graph_test.go
|
from change #1377: |
49d921f to
c439135
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
internal/engine/graph.go (1)
215-216:⚠️ Potential issue | 🟠 MajorNormalize absolute link targets before alias propagation.
Line 216 always joins
linkTargetunderfilepath.Dir(linkPath), butgetLinkTarget()can return an absoluteOriginalLinkname. For a nested link likefoo/sh -> /usr/bin/bash, alias propagation becomesfoo/usr/bin/bash, so chained symlink/hardlink aliases stop matching real tar entries. Please normalize absolute targets as archive-root-relative before further joins, and add a regression case with a non-root link path.In Go on Unix, what does filepath.Join(filepath.Dir("foo/sh"), "/usr/bin/bash") return, and when a tar symlink target is absolute but needs to be treated as archive-root-relative, how should it be normalized before further joins?Also applies to: 302-324
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engine/graph.go` around lines 215 - 216, resolveRelativeLinkFrom currently always joins linkTarget under filepath.Dir(linkPath), which breaks when getLinkTarget() returns an absolute OriginalLinkname; change resolveRelativeLinkFrom to detect absolute targets (filepath.IsAbs) and, for absolutes, normalize them to archive-root-relative by trimming the leading path separator (e.g. strings.TrimPrefix(linkTarget, string(os.PathSeparator))) and returning filepath.Clean of that trimmed value instead of joining with Dir(linkPath); keep the existing join behavior for non-absolute targets. Update the alias-propagation code that consumes getLinkTarget() (the same logic referenced around the other symlink handling block) to use this normalized result, and add a regression test with a non-root link path (e.g. linkPath "foo/sh" and linkTarget "/usr/bin/bash" expecting "usr/bin/bash") to verify correct behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/engine/untar.go`:
- Around line 441-458: The code checks containment using resolvedON/finalOldname
but still writes the original header.Linkname for relative symlinks; instead,
after the containment check in the untar logic (around
resolveLinkPaths/header.Linkname/header.Name), compute a sanitized relative
target for non-absolute links by calling filepath.Rel(dstRoot.Name(),
finalOldname) (or equivalent) and verify it doesn't escape (no leading ".."),
then set symlinkTarget to that sanitized relative path before creating the
symlink; keep the absolute-branch behavior (use finalOldname) for absolute
links. Also add a regression test that extracts a tar with an entry like
"malicious-link -> ../../mnt" and asserts the created symlink target is the
sanitized relative path (selecting the link by name).
---
Duplicate comments:
In `@internal/engine/graph.go`:
- Around line 215-216: resolveRelativeLinkFrom currently always joins linkTarget
under filepath.Dir(linkPath), which breaks when getLinkTarget() returns an
absolute OriginalLinkname; change resolveRelativeLinkFrom to detect absolute
targets (filepath.IsAbs) and, for absolutes, normalize them to
archive-root-relative by trimming the leading path separator (e.g.
strings.TrimPrefix(linkTarget, string(os.PathSeparator))) and returning
filepath.Clean of that trimmed value instead of joining with Dir(linkPath); keep
the existing join behavior for non-absolute targets. Update the
alias-propagation code that consumes getLinkTarget() (the same logic referenced
around the other symlink handling block) to use this normalized result, and add
a regression test with a non-root link path (e.g. linkPath "foo/sh" and
linkTarget "/usr/bin/bash" expecting "usr/bin/bash") to verify correct behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: cb37a364-3c84-4374-92c5-19b7fbc6d862
📒 Files selected for processing (5)
internal/engine/engine.gointernal/engine/graph.gointernal/engine/graph_test.gointernal/engine/untar.gointernal/engine/untar_test.go
✅ Files skipped from review due to trivial changes (1)
- internal/engine/engine.go
c439135 to
6e7a11d
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
internal/engine/graph.go (2)
215-216:⚠️ Potential issue | 🟠 MajorNormalize absolute
Linknamevalues before building alias keys.When
linkTargetstarts with/, this helper still joins it againstfilepath.Dir(linkPath). In Go that produces paths likeusr/lib/sysimage/usr/share/rpm, not the archive-relativeusr/share/rpm, so the alias map gets populated under the wrong key. That breaks both the hardlink propagation at Line 302 and the chained-symlink propagation at Line 322. The new test ininternal/engine/untar_test.goonly checksDeps.Name, so this regression still slips through.Possible fix
import ( "archive/tar" + "path" "path/filepath" "slices" + "strings" @@ func resolveRelativeLinkFrom(linkPath, linkTarget string) string { - return filepath.Clean(filepath.Join(filepath.Dir(linkPath), linkTarget)) + if strings.HasPrefix(linkTarget, "/") { + return strings.TrimPrefix(path.Clean(linkTarget), "/") + } + return path.Clean(path.Join(path.Dir(linkPath), linkTarget)) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engine/graph.go` around lines 215 - 216, resolveRelativeLinkFrom currently joins absolute linkTarget values against filepath.Dir(linkPath), producing wrong archive-relative keys; update resolveRelativeLinkFrom to detect absolute linkTarget (use filepath.IsAbs) and strip the leading path separator (e.g., strings.TrimPrefix(linkTarget, string(os.PathSeparator))) or otherwise convert it to a relative form before calling filepath.Join/Clean so alias keys are built as archive-relative paths; refer to the resolveRelativeLinkFrom function and ensure related callers that populate alias maps (the hardlink and chained-symlink propagation code paths) continue to receive normalized paths.
66-117:⚠️ Potential issue | 🔴 Critical
LinkGraph.ExpandFilePathAliasesstill has an unbounded cycle case.The on-stack check at Line 103 only catches exact path repetition. A directory symlink back into an ancestor, like
a/x -> ., yieldsa/file,a/x/file,a/x/x/file, ... without ever reusing the same full path, so planning can recurse until it exhausts stack or memory on a crafted image. Track repeated directory-alias substitutions on the recursion path instead of only checking the fully expanded file path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engine/graph.go` around lines 66 - 117, The recursion only tracks full file paths in expanding, which misses cycles caused by repeating directory substitutions (e.g., substituting dir -> symlinkPath repeatedly); update expandFilePathAliasesRec to track directory-substitution keys (e.g., dir + "->" + symlinkPath) on the recursion stack instead of (or in addition to) the full aliasedPath so you can detect and skip repeated directory-alias substitutions that would loop indefinitely. Concretely: when iterating aliases inside expandFilePathAliasesRec and before recursing for aliasedPath, construct a unique key from the source directory (dir) and symlinkPath, check that key in the expanding map to skip recursion on repeats, add the key to expanding before calling lg.expandFilePathAliasesRec and remove it after (defer) so nested recursion sees correct state; keep visited for deduping returned paths as before.
🧹 Nitpick comments (1)
internal/engine/untar_test.go (1)
728-759:writeTarballWithLinkChainnever drives the deferred-hardlink branch.Because Lines 732-745 always write the regular file before any link header, every new hardlink-chain integration test stays in the easy ordering where targets already exist in the tar stream. That leaves the new defer-and-replay logic only unit-tested via
sortHardlinksByDependencies. Consider letting this helper control entry order and add one end-to-end case with a hardlink before its target.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engine/untar_test.go` around lines 728 - 759, writeTarballWithLinkChain currently always writes the regular file (fileHeader) before any link entries so the deferred-hardlink code path is never exercised; change writeTarballWithLinkChain to allow emitting link headers before the regular file (e.g., add a boolean parameter or detect a chain ordering flag and write the loop that emits linkHeader entries prior to writing fileHeader when requested) so tests can create a tar where a hardlink appears before its target; update or add an end-to-end test that uses writeTarballWithLinkChain with the new ordering to place a hardlink entry (from chain []linkChainEntry) before its target file to validate the deferred-and-replay hardlink handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/engine/graph.go`:
- Around line 215-216: resolveRelativeLinkFrom currently joins absolute
linkTarget values against filepath.Dir(linkPath), producing wrong
archive-relative keys; update resolveRelativeLinkFrom to detect absolute
linkTarget (use filepath.IsAbs) and strip the leading path separator (e.g.,
strings.TrimPrefix(linkTarget, string(os.PathSeparator))) or otherwise convert
it to a relative form before calling filepath.Join/Clean so alias keys are built
as archive-relative paths; refer to the resolveRelativeLinkFrom function and
ensure related callers that populate alias maps (the hardlink and
chained-symlink propagation code paths) continue to receive normalized paths.
- Around line 66-117: The recursion only tracks full file paths in expanding,
which misses cycles caused by repeating directory substitutions (e.g.,
substituting dir -> symlinkPath repeatedly); update expandFilePathAliasesRec to
track directory-substitution keys (e.g., dir + "->" + symlinkPath) on the
recursion stack instead of (or in addition to) the full aliasedPath so you can
detect and skip repeated directory-alias substitutions that would loop
indefinitely. Concretely: when iterating aliases inside expandFilePathAliasesRec
and before recursing for aliasedPath, construct a unique key from the source
directory (dir) and symlinkPath, check that key in the expanding map to skip
recursion on repeats, add the key to expanding before calling
lg.expandFilePathAliasesRec and remove it after (defer) so nested recursion sees
correct state; keep visited for deduping returned paths as before.
---
Nitpick comments:
In `@internal/engine/untar_test.go`:
- Around line 728-759: writeTarballWithLinkChain currently always writes the
regular file (fileHeader) before any link entries so the deferred-hardlink code
path is never exercised; change writeTarballWithLinkChain to allow emitting link
headers before the regular file (e.g., add a boolean parameter or detect a chain
ordering flag and write the loop that emits linkHeader entries prior to writing
fileHeader when requested) so tests can create a tar where a hardlink appears
before its target; update or add an end-to-end test that uses
writeTarballWithLinkChain with the new ordering to place a hardlink entry (from
chain []linkChainEntry) before its target file to validate the
deferred-and-replay hardlink handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 85f023cf-752f-4ed6-949d-3eda1e7a3bc0
📒 Files selected for processing (5)
internal/engine/engine.gointernal/engine/graph.gointernal/engine/graph_test.gointernal/engine/untar.gointernal/engine/untar_test.go
✅ Files skipped from review due to trivial changes (2)
- internal/engine/engine.go
- internal/engine/graph_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/engine/untar.go
Signed-off-by: Caleb Xu <caxu@redhat.com> Assisted-by: Cursor
6e7a11d to
47fc1a1
Compare
|
from change #1377: |
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Problem(s)
This PR addresses several corner cases in the handling of hardlinks/symlinks in the tar stream:
rpmdb.sqlitemay be canonically at/usr/share/rpm/rpmdb.sqlite, with a symlink/usr/lib/sysimage/rpm -> ../../share/rpm, thereby making the file available at/usr/lib/sysimage/rpm/rpmdb.sqlite.Problems 2 and 3 were introduced with #1364, problem 1 has been present since #1179.
Existing process
Walk through the tar stream; extract regular file entries, symlinks, and hardlinks which match the filters. In the case where the symlink/hardlink points to a target that is not extracted yet, make additional passes through the tar stream as needed to resolve and extract the target (and any intermediary symlinks), repeating until there are no outstanding files to search for.
New process
Walk through the tar stream once, collect a list (
allFiles) of the names of all regular files, symlinks, and hardlinks, and also build a graph of all the symlinks and hardlinks.Using the symlink/hardlink graph, build a map that shows, for a given link, what are the other paths by which that location is reachable (basically, backward traversal of the graph) - this occurs in
BuildDirectoryAliasMap. In the example described above, there would be an entry in the map for/usr/share/rpm -> /usr/lib/sysimage/rpm, since you can arrive at/usr/share/rpmby traversing the/usr/lib/sysimage/rpmsymlink.Then, take all files in the tar stream (collected in
allFiles) and determine all paths by which that location is reachable (this occurs inExpandFilePathAliases). For example, we provide input/usr/share/rpm/rpmdb.sqlite, and it provides output["/usr/share/rpm/rpmdb.sqlite", "/usr/lib/sysimage/rpm/rpmdb.sqlite"]- and if any of these "aliases" match the filter pattern, we'll want to extract the original file and any links needed to be able to traverse to the file at the location.Next, take a look at any files we want to extract - if they are symlinks or hardlinks, traverse them forward in the graph to find any additional links/files that we also need to extract.
Finally, walk through the tar stream again and extract the exact list of files we need. When encountering hardlinks, if the target already exists, we can create it, otherwise save it for later.
After traversing the tar stream the second time, we may still have some hardlinks that need to be created. Sort them topologically to create the remaining hardlinks in the appropriate order.
Performance
I built a container locally and tried running it on an OpenShift cluster. Runtime is about the same as before, if not a little faster. Memory usage during the image extraction phase looks about the same (probably a few MiB more to store the maps/graphs described above).
How do other tools handle this?
oc image extractallows filtering by paths but doesn't handle links (tries to link to non-existent files, does not resolve and extract link targets, etc.)crane exportdoes not allow filtering by pathstarallows for the extraction of specific files in the archive but will not resolve "virtual paths" that only exist when traversing linksAlternatives
Todo
graph.go. Getting this up early to get some feedback before investing more time.