Skip to content

engine: support more robust handling of links in tar stream#1377

Draft
caxu-rh wants to merge 1 commit intoredhat-openshift-ecosystem:mainfrom
caxu-rh:engine-link-handling
Draft

engine: support more robust handling of links in tar stream#1377
caxu-rh wants to merge 1 commit intoredhat-openshift-ecosystem:mainfrom
caxu-rh:engine-link-handling

Conversation

@caxu-rh
Copy link
Copy Markdown
Contributor

@caxu-rh caxu-rh commented Feb 26, 2026

Problem(s)

This PR addresses several corner cases in the handling of hardlinks/symlinks in the tar stream:

  1. A tar entry may be a hardlink to a symlink to a location on disk. The symlink may be a relative path, which, if evaluated from the location of the hardlink, resolves successfully. We currently rewrite these relative paths into absolute paths during extraction, but this relative path to absolute path conversion is performed from the location of the symlink, thus breaking the hardlink.
  2. Hardlinks can only be created when the target file exists. Since we are now filtering/being selective about which files we extract, it's possible that a hardlink is eligible for extraction for which we have skipped over the link target. Unlike the symlink case, we cannot just create the link and then come back for the target; we must extract the link target and then try to create the link.
  3. A file might not match the filters for extraction via its canonical path, but one of its parent directories may be a symlink which, if traversed through that path, does match a filter for extraction. For example, rpmdb.sqlite may be canonically at /usr/share/rpm/rpmdb.sqlite, with a symlink /usr/lib/sysimage/rpm -> ../../share/rpm, thereby making the file available at /usr/lib/sysimage/rpm/rpmdb.sqlite.

Problems 2 and 3 were introduced with #1364, problem 1 has been present since #1179.

Existing process

Walk through the tar stream; extract regular file entries, symlinks, and hardlinks which match the filters. In the case where the symlink/hardlink points to a target that is not extracted yet, make additional passes through the tar stream as needed to resolve and extract the target (and any intermediary symlinks), repeating until there are no outstanding files to search for.

New process

Walk through the tar stream once, collect a list (allFiles) of the names of all regular files, symlinks, and hardlinks, and also build a graph of all the symlinks and hardlinks.

Using the symlink/hardlink graph, build a map that shows, for a given link, what are the other paths by which that location is reachable (basically, backward traversal of the graph) - this occurs in BuildDirectoryAliasMap. In the example described above, there would be an entry in the map for /usr/share/rpm -> /usr/lib/sysimage/rpm, since you can arrive at /usr/share/rpm by traversing the /usr/lib/sysimage/rpm symlink.

Then, take all files in the tar stream (collected in allFiles) and determine all paths by which that location is reachable (this occurs in ExpandFilePathAliases). For example, we provide input /usr/share/rpm/rpmdb.sqlite, and it provides output ["/usr/share/rpm/rpmdb.sqlite", "/usr/lib/sysimage/rpm/rpmdb.sqlite"] - and if any of these "aliases" match the filter pattern, we'll want to extract the original file and any links needed to be able to traverse to the file at the location.

Next, take a look at any files we want to extract - if they are symlinks or hardlinks, traverse them forward in the graph to find any additional links/files that we also need to extract.

Finally, walk through the tar stream again and extract the exact list of files we need. When encountering hardlinks, if the target already exists, we can create it, otherwise save it for later.

After traversing the tar stream the second time, we may still have some hardlinks that need to be created. Sort them topologically to create the remaining hardlinks in the appropriate order.

Performance

I built a container locally and tried running it on an OpenShift cluster. Runtime is about the same as before, if not a little faster. Memory usage during the image extraction phase looks about the same (probably a few MiB more to store the maps/graphs described above).

How do other tools handle this?

  • oc image extract allows filtering by paths but doesn't handle links (tries to link to non-existent files, does not resolve and extract link targets, etc.)
  • crane export does not allow filtering by paths
  • tar allows for the extraction of specific files in the archive but will not resolve "virtual paths" that only exist when traversing links

Alternatives

  • Ack and note the limitations and adjust filter paths as necessary when we need to. This may require staying on top of changes in Fedora/UBI.
  • Solve only problem 2 mentioned above; problem 1 and 3 require most of the refactor here while problem 2 alone might be able to be solved more simply.

Todo

  • More unit tests for graph.go. Getting this up early to get some feedback before investing more time.

@openshift-ci openshift-ci Bot requested review from bcrochet and jomkz February 26, 2026 16:30
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Feb 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: caxu-rh
Once this PR has been reviewed and has the lgtm label, please assign acornett21 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@caxu-rh caxu-rh force-pushed the engine-link-handling branch from 5b2aa20 to 755fb96 Compare February 26, 2026 16:36
@dcibot
Copy link
Copy Markdown

dcibot commented Feb 26, 2026

@bcrochet
Copy link
Copy Markdown
Contributor

/retest

@bcrochet bcrochet added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 28, 2026
@bcrochet
Copy link
Copy Markdown
Contributor

Please rebase so the Gemini reviews can run.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 28, 2026
@caxu-rh caxu-rh force-pushed the engine-link-handling branch from 755fb96 to 26c373b Compare March 2, 2026 15:03
@dcibot
Copy link
Copy Markdown

dcibot commented Mar 2, 2026

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing review submission

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This PR significantly improves the robustness of tar extraction when handling complex link structures (symlinks and hardlinks), particularly in the context of file filtering and directory symlinks. It introduces a LinkGraph and directory alias mapping to correctly identify files that should be extracted based on patterns, even when accessed through various link paths.

🔍 General Feedback

  • Robustness: The move to a two-pass approach (planning and execution) is a solid architectural decision that ensures dependencies are correctly identified and resolved before extraction.
  • Topological Sorting: Implementing Kahn's algorithm for hardlink dependency sorting is excellent and addresses previous issues where hardlinks could fail if their targets weren't yet extracted.
  • Security: The use of os.Root (via os.OpenRoot) provides strong protection against directory traversal attacks during extraction.
  • Bug Fixes: A critical bug in ExpandFilePathAliases needs to be addressed as it could cause a crash due to infinite recursion in the presence of symlink loops.
  • Optimization: Minor optimizations in engine.go and the general cleanup of the extraction logic are positive highlights.
  • Performance: While the two-pass approach is more robust, it does involve reading the tar stream twice, which may impact performance for remote images.

Comment thread internal/engine/engine.go
Comment thread internal/engine/untar.go Outdated
Comment thread internal/engine/graph.go
Comment thread internal/engine/untar.go
Comment thread internal/engine/untar.go
inDegree[hlName] = 0
adjList[hlName] = []string{}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 If a hardlink cycle exists in the tarball, Kahn's algorithm will complete with len(sorted) < len(hardlinks), and the cyclic links will be silently skipped. It might be better to return an error or log a warning if the number of sorted hardlinks doesn't match the input.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2026
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

The pull request introduces a more robust graph-based approach for handling symlinks and hardlinks during image extraction. While the architectural direction significantly improves the resolution of complex link chains and avoids multiple extraction passes, it introduces a critical infinite recursion bug in ExpandFilePathAliases and a high-priority regression in pattern matching support.

🔍 General Feedback

  • Robust Link Resolution: The new approach for tracking links via a graph and building an extraction plan is a major improvement over the previous progressive resolution method.
  • Topological Sorting of Hardlinks: Deferring hardlink creation and sorting them by dependencies ensures that targets always exist before the links are created.
  • Circular Symlink Risks: The current recursive implementation of ExpandFilePathAliases does not properly track visited nodes across recursive calls, which must be addressed to prevent stack overflows.
  • Pattern Matching Regression: Ensure that the doublestar package is still used for pattern matching to maintain support for recursive globbing patterns (**).

Comment thread internal/engine/graph.go
Comment thread internal/engine/graph.go
return linkNode.OriginalLinkname, true
}
if linkNode.VirtualLinkTarget != "" {
return linkNode.VirtualLinkTarget, true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 BuildExtractionPlan has O(N * M) complexity where N is the number of files and M is the number of patterns, and it calls ExpandFilePathAliases for every file. For images with many files and many patterns, this could be a performance bottleneck. Consider pre-compiling patterns or optimization if performance issues arise in large images.

Comment thread internal/engine/engine.go
Comment thread internal/engine/graph.go
@caxu-rh caxu-rh force-pushed the engine-link-handling branch from 26c373b to b356972 Compare April 1, 2026 19:45
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d3fbf6cb-de52-4480-9ae1-4b5f1b2ac962

📥 Commits

Reviewing files that changed from the base of the PR and between 9d502fa and 47fc1a1.

📒 Files selected for processing (5)
  • internal/engine/engine.go
  • internal/engine/graph.go
  • internal/engine/graph_test.go
  • internal/engine/untar.go
  • internal/engine/untar_test.go

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Improved symlink and hardlink handling during bundle extraction to prevent circular dependency issues.
    • Enhanced error detection and reporting for invalid link dependency chains.
  • Refactor

    • Optimized tar extraction pipeline for improved reliability and performance.

Walkthrough

Refactors tar extraction into a two-phase plan-and-run flow, adds a LinkGraph for symlink/hardlink aliasing and path expansion, changes per-entry bundle-hash write to use fmt.Fprintf, and adds/updates tests covering alias expansion, deep chains, deferred hardlink ordering, and canonicalization.

Changes

Cohort / File(s) Summary
Bundle hash write change
internal/engine/engine.go
Replaced per-entry hashBuffer.WriteString(fmt.Sprintf(...)) with fmt.Fprintf(&hashBuffer, ...) for bundle-hash line writes (format and data preserved).
Link graph feature & tests
internal/engine/graph.go, internal/engine/graph_test.go
Added LinkGraph (linkNode, linkType) with BuildDirectoryAliasMap(logger) map[string][]string and ExpandFilePathAliases(filePath, symlinkAliases) []string; implements alias propagation, virtual targets, cycle prevention, and chain traversal. New tests validate expansion, propagation, deep chains, and cycle handling.
Extraction two-phase refactor & tests
internal/engine/untar.go, internal/engine/untar_test.go
Reworked extraction into planExtraction (single tar scan building LinkGraph, matching patterns, computing transitive deps) and runExtraction (single targeted extraction). Deferred hardlink creation until after EOF with topological ordering and cycle detection. Tests updated/added for planning, deferred hardlink ordering/cycles, symlink canonicalization, and multiple complex link-chain scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Extractor
    participant TarReader as Tar Reader
    participant LinkGraph as Link Graph
    participant Filesystem as Filesystem Writer

    Client->>Extractor: untar(requiredPatterns)

    rect rgba(76, 175, 80, 0.5)
    Note over Extractor,LinkGraph: Planning Phase
    Extractor->>TarReader: read all entries (single pass)
    TarReader-->>Extractor: stream entries
    Extractor->>LinkGraph: build graph (files, symlinks, hardlinks)
    LinkGraph-->>Extractor: alias maps / virtual targets
    Extractor->>LinkGraph: ExpandFilePathAliases(matches)
    LinkGraph-->>Extractor: expanded paths
    Extractor->>Extractor: compute transitive dependencies (plan)
    Extractor-->>Client: return extraction plan
    end

    rect rgba(33, 150, 243, 0.5)
    Note over Extractor,Filesystem: Execution Phase
    Extractor->>TarReader: read tar stream again
    TarReader-->>Extractor: stream planned entries
    Extractor->>Filesystem: write regular files
    Filesystem-->>Extractor: files written
    Extractor->>Filesystem: create symlinks immediately
    Filesystem-->>Extractor: symlinks created
    Extractor->>LinkGraph: sort hardlinks by dependencies (topo)
    LinkGraph-->>Extractor: ordered hardlink list
    Extractor->>Filesystem: create hardlinks in order
    Filesystem-->>Extractor: hardlinks created
    end

    Extractor-->>Client: extraction complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: improving robust handling of hardlinks and symlinks in tar stream extraction, which is the central focus of all modifications across engine.go, graph.go, and untar.go.
Description check ✅ Passed The description thoroughly explains the three corner cases being addressed, the new extraction process with detailed steps, performance considerations, and alternative approaches considered, all directly relevant to the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coveralls
Copy link
Copy Markdown

coveralls commented Apr 1, 2026

Coverage Status

coverage: 84.757% (+0.3%) from 84.442%
when pulling 47fc1a1 on caxu-rh:engine-link-handling
into 9d502fa on redhat-openshift-ecosystem:main.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/engine/untar.go (1)

279-286: ⚠️ Potential issue | 🟠 Major

Same error-overwriting issue as in planExtraction.

Apply the same fix here to preserve the original error when drain also fails.

Proposed fix
 defer func() {
     // Drain any remaining data from the reader and capture any errors
     _, drainErr := io.Copy(io.Discard, fs)
     if drainErr != nil {
-        err = fmt.Errorf("failed to drain io reader: %w", drainErr)
+        if err != nil {
+            err = fmt.Errorf("%w; additionally failed to drain io reader: %v", err, drainErr)
+        } else {
+            err = fmt.Errorf("failed to drain io reader: %w", drainErr)
+        }
     }
     fs.Close()
 }()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/engine/untar.go` around lines 279 - 286, The defer currently
unconditionally sets err when draining the reader fails, which can overwrite a
prior error; update the defer in the untar reader cleanup (the block using fs,
io.Copy(io.Discard, fs), drainErr and err) to preserve the original error: after
calling io.Copy check if drainErr != nil and then only set err to drainErr if
err == nil, or else combine both errors (e.g., errors.Join(err,
fmt.Errorf("failed to drain io reader: %w", drainErr))) so the original error is
not lost; also ensure fs.Close() still runs.
🧹 Nitpick comments (1)
internal/engine/untar.go (1)

96-98: Stale comment reference.

The comment refers to untarOnce which no longer exists; should reference runExtraction.

-		// Order doesn't matter because hardlinks are deferred in untarOnce
+		// Order doesn't matter because hardlinks are deferred in runExtraction
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/engine/untar.go` around lines 96 - 98, The comment above the slice
conversion references a removed function name; update the comment to refer to
runExtraction instead of untarOnce and keep the rest unchanged — specifically
the line around the append to result that uses
slices.Collect(maps.Keys(filesToExtract)) to build the extraction plan; ensure
the comment now reads something like "Order doesn't matter because hardlinks are
deferred in runExtraction" so it correctly points to runExtraction and clarifies
why ordering is irrelevant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/engine/untar.go`:
- Around line 26-34: The defer that drains and closes fs can overwrite an
earlier returned error (e.g., the err set after tr.Next()), so change the defer
in untar.go to preserve and combine errors instead of clobbering them: detect
the existing named return error (err) and if io.Copy or fs.Close produce errors,
combine them with the original using errors.Join (or fmt.Errorf to wrap if
errors.Join unavailable), e.g. produce a combined error like "original error;
drain error: ...; close error: ..." and assign that back to err; reference the
fs variable and the defer block that currently does io.Copy(io.Discard, fs) and
fs.Close() as the location to apply this change.

---

Outside diff comments:
In `@internal/engine/untar.go`:
- Around line 279-286: The defer currently unconditionally sets err when
draining the reader fails, which can overwrite a prior error; update the defer
in the untar reader cleanup (the block using fs, io.Copy(io.Discard, fs),
drainErr and err) to preserve the original error: after calling io.Copy check if
drainErr != nil and then only set err to drainErr if err == nil, or else combine
both errors (e.g., errors.Join(err, fmt.Errorf("failed to drain io reader: %w",
drainErr))) so the original error is not lost; also ensure fs.Close() still
runs.

---

Nitpick comments:
In `@internal/engine/untar.go`:
- Around line 96-98: The comment above the slice conversion references a removed
function name; update the comment to refer to runExtraction instead of untarOnce
and keep the rest unchanged — specifically the line around the append to result
that uses slices.Collect(maps.Keys(filesToExtract)) to build the extraction
plan; ensure the comment now reads something like "Order doesn't matter because
hardlinks are deferred in runExtraction" so it correctly points to runExtraction
and clarifies why ordering is irrelevant.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f13bcffa-896c-4b97-bc14-4cbe0630be34

📥 Commits

Reviewing files that changed from the base of the PR and between 8381364 and b356972.

📒 Files selected for processing (5)
  • internal/engine/engine.go
  • internal/engine/graph.go
  • internal/engine/graph_test.go
  • internal/engine/untar.go
  • internal/engine/untar_test.go

Comment thread internal/engine/untar.go
@caxu-rh caxu-rh force-pushed the engine-link-handling branch from b356972 to 7f3bb42 Compare April 1, 2026 20:20
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
internal/engine/untar.go (1)

224-238: ⚠️ Potential issue | 🟠 Major

Don't silently drop cyclic hardlinks.

If Kahn's pass leaves nodes behind, sorted is shorter than hardlinks; EOF extraction then never sees those selected entries and still returns success. Please surface that cycle when len(sorted) != len(hardlinks).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/engine/untar.go` around lines 224 - 238, After running Kahn's
algorithm you must detect and surface cycles instead of silently dropping
entries: after the loop that builds `sorted` (the snippet iterating `queue` and
decrementing `inDegree`), check if `len(sorted) != len(hardlinks)` and return a
descriptive error (or propagate one) indicating a cyclic hardlink dependency and
include context (e.g., remaining nodes or their names) so extraction fails
rather than succeeding; update the function that performs this topological sort
to return the error instead of proceeding when the lengths differ.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/engine/graph.go`:
- Around line 305-313: The code resolves the raw link target returned by
getEffectiveLinkTarget(targetPath, lg) relative to the symlink's path
(linkPath), which mis-resolves chained links; instead, resolve the returned
linkTarget relative to targetPath so nested chains map correctly. Change the
resolveRelativeLinkFrom call to use targetPath (e.g., resolvedFromSymlink :=
resolveRelativeLinkFrom(targetPath, linkTarget)) and then proceed to call
addAliasIfNew(aliases, resolvedFromSymlink, linkPath) so alias matching uses the
target-path-relative resolution for chained symlinks.
- Around line 122-133: walkGraphChain can loop forever on cycles (e.g.,
a->b->a); modify walkGraphChain to track visited node names (use a
map[string]bool or set) and check before visiting/advancing: if current or
node.Deps.Name is already seen, break out to prevent infinite loops. Ensure you
mark nodes as visited when first encountered and use the same visited check
before calling visitor(node.Deps.Name, ... ) or setting current = node.Deps.Name
so cycles stop. Keep function signature (walkGraphChain, graph LinkGraph,
linkNode, visitor) unchanged.

In `@internal/engine/untar.go`:
- Around line 127-141: Normalize symlink targets before inserting into
linkGraph: if header.Linkname is absolute (starts with "/") strip the leading
slash and Clean it (e.g., strings.TrimPrefix(header.Linkname, "/") then
filepath.Clean) so the stored dep key matches tar entry paths (e.g.,
"usr/bin/bash"); otherwise continue to resolve relative targets with
filepath.Join(filepath.Dir(header.Name), header.Linkname) and filepath.Clean.
Update the creation of resolvedTargetName used when constructing linkNode
(symbols: header.Name, header.Linkname, resolvedTargetName, linkGraph, linkNode,
addTransitiveDependencies) and add the necessary import for strings if missing.

---

Duplicate comments:
In `@internal/engine/untar.go`:
- Around line 224-238: After running Kahn's algorithm you must detect and
surface cycles instead of silently dropping entries: after the loop that builds
`sorted` (the snippet iterating `queue` and decrementing `inDegree`), check if
`len(sorted) != len(hardlinks)` and return a descriptive error (or propagate
one) indicating a cyclic hardlink dependency and include context (e.g.,
remaining nodes or their names) so extraction fails rather than succeeding;
update the function that performs this topological sort to return the error
instead of proceeding when the lengths differ.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fc1e0d55-4937-4abe-9560-10bc9b391e50

📥 Commits

Reviewing files that changed from the base of the PR and between b356972 and 7f3bb42.

📒 Files selected for processing (5)
  • internal/engine/engine.go
  • internal/engine/graph.go
  • internal/engine/graph_test.go
  • internal/engine/untar.go
  • internal/engine/untar_test.go
✅ Files skipped from review due to trivial changes (2)
  • internal/engine/engine.go
  • internal/engine/untar_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/engine/graph_test.go

Comment thread internal/engine/graph.go
Comment thread internal/engine/graph.go
Comment thread internal/engine/untar.go
@caxu-rh caxu-rh force-pushed the engine-link-handling branch from 7f3bb42 to 49d921f Compare April 1, 2026 20:56
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/engine/graph.go`:
- Around line 273-340: The loop in build of the directory alias map uses a fixed
maxIterations (maxIterations=100) and silently returns a partial aliases map
when the ceiling is hit; change this to either (1) convert the algorithm to a
deterministic worklist/fixpoint approach (process a queue of changed nodes, only
re-enqueue neighbors when addAliasIfNew changes aliases) using the existing
symbols (lg, aliases, addAliasIfNew, resolveRelativeLinkFrom, getLinkTarget,
getEffectiveLinkTarget, linkNode.VirtualLinkTarget) so the loop terminates only
when the worklist is empty, or (2) if keeping the iterative approach, remove the
silent success path and have the function return an error when iteration >=
maxIterations (propagate an error out of the function so callers such as
planExtraction() can fail deterministically) and update relevant call sites to
handle the error; implement one of these fixes and remove the silent
log-that-it-may-be-incomplete behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 06496e13-592c-40d0-8788-6518bbfd8804

📥 Commits

Reviewing files that changed from the base of the PR and between 7f3bb42 and 49d921f.

📒 Files selected for processing (5)
  • internal/engine/engine.go
  • internal/engine/graph.go
  • internal/engine/graph_test.go
  • internal/engine/untar.go
  • internal/engine/untar_test.go
✅ Files skipped from review due to trivial changes (1)
  • internal/engine/engine.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/engine/untar.go
  • internal/engine/graph_test.go

Comment thread internal/engine/graph.go
Comment thread internal/engine/graph.go Outdated
@dcibot
Copy link
Copy Markdown

dcibot commented Apr 1, 2026

@caxu-rh caxu-rh force-pushed the engine-link-handling branch from 49d921f to c439135 Compare April 2, 2026 14:51
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
internal/engine/graph.go (1)

215-216: ⚠️ Potential issue | 🟠 Major

Normalize absolute link targets before alias propagation.

Line 216 always joins linkTarget under filepath.Dir(linkPath), but getLinkTarget() can return an absolute OriginalLinkname. For a nested link like foo/sh -> /usr/bin/bash, alias propagation becomes foo/usr/bin/bash, so chained symlink/hardlink aliases stop matching real tar entries. Please normalize absolute targets as archive-root-relative before further joins, and add a regression case with a non-root link path.

In Go on Unix, what does filepath.Join(filepath.Dir("foo/sh"), "/usr/bin/bash") return, and when a tar symlink target is absolute but needs to be treated as archive-root-relative, how should it be normalized before further joins?

Also applies to: 302-324

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/engine/graph.go` around lines 215 - 216, resolveRelativeLinkFrom
currently always joins linkTarget under filepath.Dir(linkPath), which breaks
when getLinkTarget() returns an absolute OriginalLinkname; change
resolveRelativeLinkFrom to detect absolute targets (filepath.IsAbs) and, for
absolutes, normalize them to archive-root-relative by trimming the leading path
separator (e.g. strings.TrimPrefix(linkTarget, string(os.PathSeparator))) and
returning filepath.Clean of that trimmed value instead of joining with
Dir(linkPath); keep the existing join behavior for non-absolute targets. Update
the alias-propagation code that consumes getLinkTarget() (the same logic
referenced around the other symlink handling block) to use this normalized
result, and add a regression test with a non-root link path (e.g. linkPath
"foo/sh" and linkTarget "/usr/bin/bash" expecting "usr/bin/bash") to verify
correct behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/engine/untar.go`:
- Around line 441-458: The code checks containment using resolvedON/finalOldname
but still writes the original header.Linkname for relative symlinks; instead,
after the containment check in the untar logic (around
resolveLinkPaths/header.Linkname/header.Name), compute a sanitized relative
target for non-absolute links by calling filepath.Rel(dstRoot.Name(),
finalOldname) (or equivalent) and verify it doesn't escape (no leading ".."),
then set symlinkTarget to that sanitized relative path before creating the
symlink; keep the absolute-branch behavior (use finalOldname) for absolute
links. Also add a regression test that extracts a tar with an entry like
"malicious-link -> ../../mnt" and asserts the created symlink target is the
sanitized relative path (selecting the link by name).

---

Duplicate comments:
In `@internal/engine/graph.go`:
- Around line 215-216: resolveRelativeLinkFrom currently always joins linkTarget
under filepath.Dir(linkPath), which breaks when getLinkTarget() returns an
absolute OriginalLinkname; change resolveRelativeLinkFrom to detect absolute
targets (filepath.IsAbs) and, for absolutes, normalize them to
archive-root-relative by trimming the leading path separator (e.g.
strings.TrimPrefix(linkTarget, string(os.PathSeparator))) and returning
filepath.Clean of that trimmed value instead of joining with Dir(linkPath); keep
the existing join behavior for non-absolute targets. Update the
alias-propagation code that consumes getLinkTarget() (the same logic referenced
around the other symlink handling block) to use this normalized result, and add
a regression test with a non-root link path (e.g. linkPath "foo/sh" and
linkTarget "/usr/bin/bash" expecting "usr/bin/bash") to verify correct behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cb37a364-3c84-4374-92c5-19b7fbc6d862

📥 Commits

Reviewing files that changed from the base of the PR and between 49d921f and c439135.

📒 Files selected for processing (5)
  • internal/engine/engine.go
  • internal/engine/graph.go
  • internal/engine/graph_test.go
  • internal/engine/untar.go
  • internal/engine/untar_test.go
✅ Files skipped from review due to trivial changes (1)
  • internal/engine/engine.go

Comment thread internal/engine/untar.go
@caxu-rh caxu-rh force-pushed the engine-link-handling branch from c439135 to 6e7a11d Compare April 2, 2026 15:30
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
internal/engine/graph.go (2)

215-216: ⚠️ Potential issue | 🟠 Major

Normalize absolute Linkname values before building alias keys.

When linkTarget starts with /, this helper still joins it against filepath.Dir(linkPath). In Go that produces paths like usr/lib/sysimage/usr/share/rpm, not the archive-relative usr/share/rpm, so the alias map gets populated under the wrong key. That breaks both the hardlink propagation at Line 302 and the chained-symlink propagation at Line 322. The new test in internal/engine/untar_test.go only checks Deps.Name, so this regression still slips through.

Possible fix
 import (
 	"archive/tar"
+	"path"
 	"path/filepath"
 	"slices"
+	"strings"
@@
 func resolveRelativeLinkFrom(linkPath, linkTarget string) string {
-	return filepath.Clean(filepath.Join(filepath.Dir(linkPath), linkTarget))
+	if strings.HasPrefix(linkTarget, "/") {
+		return strings.TrimPrefix(path.Clean(linkTarget), "/")
+	}
+	return path.Clean(path.Join(path.Dir(linkPath), linkTarget))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/engine/graph.go` around lines 215 - 216, resolveRelativeLinkFrom
currently joins absolute linkTarget values against filepath.Dir(linkPath),
producing wrong archive-relative keys; update resolveRelativeLinkFrom to detect
absolute linkTarget (use filepath.IsAbs) and strip the leading path separator
(e.g., strings.TrimPrefix(linkTarget, string(os.PathSeparator))) or otherwise
convert it to a relative form before calling filepath.Join/Clean so alias keys
are built as archive-relative paths; refer to the resolveRelativeLinkFrom
function and ensure related callers that populate alias maps (the hardlink and
chained-symlink propagation code paths) continue to receive normalized paths.

66-117: ⚠️ Potential issue | 🔴 Critical

LinkGraph.ExpandFilePathAliases still has an unbounded cycle case.

The on-stack check at Line 103 only catches exact path repetition. A directory symlink back into an ancestor, like a/x -> ., yields a/file, a/x/file, a/x/x/file, ... without ever reusing the same full path, so planning can recurse until it exhausts stack or memory on a crafted image. Track repeated directory-alias substitutions on the recursion path instead of only checking the fully expanded file path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/engine/graph.go` around lines 66 - 117, The recursion only tracks
full file paths in expanding, which misses cycles caused by repeating directory
substitutions (e.g., substituting dir -> symlinkPath repeatedly); update
expandFilePathAliasesRec to track directory-substitution keys (e.g., dir + "->"
+ symlinkPath) on the recursion stack instead of (or in addition to) the full
aliasedPath so you can detect and skip repeated directory-alias substitutions
that would loop indefinitely. Concretely: when iterating aliases inside
expandFilePathAliasesRec and before recursing for aliasedPath, construct a
unique key from the source directory (dir) and symlinkPath, check that key in
the expanding map to skip recursion on repeats, add the key to expanding before
calling lg.expandFilePathAliasesRec and remove it after (defer) so nested
recursion sees correct state; keep visited for deduping returned paths as
before.
🧹 Nitpick comments (1)
internal/engine/untar_test.go (1)

728-759: writeTarballWithLinkChain never drives the deferred-hardlink branch.

Because Lines 732-745 always write the regular file before any link header, every new hardlink-chain integration test stays in the easy ordering where targets already exist in the tar stream. That leaves the new defer-and-replay logic only unit-tested via sortHardlinksByDependencies. Consider letting this helper control entry order and add one end-to-end case with a hardlink before its target.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/engine/untar_test.go` around lines 728 - 759,
writeTarballWithLinkChain currently always writes the regular file (fileHeader)
before any link entries so the deferred-hardlink code path is never exercised;
change writeTarballWithLinkChain to allow emitting link headers before the
regular file (e.g., add a boolean parameter or detect a chain ordering flag and
write the loop that emits linkHeader entries prior to writing fileHeader when
requested) so tests can create a tar where a hardlink appears before its target;
update or add an end-to-end test that uses writeTarballWithLinkChain with the
new ordering to place a hardlink entry (from chain []linkChainEntry) before its
target file to validate the deferred-and-replay hardlink handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/engine/graph.go`:
- Around line 215-216: resolveRelativeLinkFrom currently joins absolute
linkTarget values against filepath.Dir(linkPath), producing wrong
archive-relative keys; update resolveRelativeLinkFrom to detect absolute
linkTarget (use filepath.IsAbs) and strip the leading path separator (e.g.,
strings.TrimPrefix(linkTarget, string(os.PathSeparator))) or otherwise convert
it to a relative form before calling filepath.Join/Clean so alias keys are built
as archive-relative paths; refer to the resolveRelativeLinkFrom function and
ensure related callers that populate alias maps (the hardlink and
chained-symlink propagation code paths) continue to receive normalized paths.
- Around line 66-117: The recursion only tracks full file paths in expanding,
which misses cycles caused by repeating directory substitutions (e.g.,
substituting dir -> symlinkPath repeatedly); update expandFilePathAliasesRec to
track directory-substitution keys (e.g., dir + "->" + symlinkPath) on the
recursion stack instead of (or in addition to) the full aliasedPath so you can
detect and skip repeated directory-alias substitutions that would loop
indefinitely. Concretely: when iterating aliases inside expandFilePathAliasesRec
and before recursing for aliasedPath, construct a unique key from the source
directory (dir) and symlinkPath, check that key in the expanding map to skip
recursion on repeats, add the key to expanding before calling
lg.expandFilePathAliasesRec and remove it after (defer) so nested recursion sees
correct state; keep visited for deduping returned paths as before.

---

Nitpick comments:
In `@internal/engine/untar_test.go`:
- Around line 728-759: writeTarballWithLinkChain currently always writes the
regular file (fileHeader) before any link entries so the deferred-hardlink code
path is never exercised; change writeTarballWithLinkChain to allow emitting link
headers before the regular file (e.g., add a boolean parameter or detect a chain
ordering flag and write the loop that emits linkHeader entries prior to writing
fileHeader when requested) so tests can create a tar where a hardlink appears
before its target; update or add an end-to-end test that uses
writeTarballWithLinkChain with the new ordering to place a hardlink entry (from
chain []linkChainEntry) before its target file to validate the
deferred-and-replay hardlink handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 85f023cf-752f-4ed6-949d-3eda1e7a3bc0

📥 Commits

Reviewing files that changed from the base of the PR and between c439135 and 6e7a11d.

📒 Files selected for processing (5)
  • internal/engine/engine.go
  • internal/engine/graph.go
  • internal/engine/graph_test.go
  • internal/engine/untar.go
  • internal/engine/untar_test.go
✅ Files skipped from review due to trivial changes (2)
  • internal/engine/engine.go
  • internal/engine/graph_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/engine/untar.go

Signed-off-by: Caleb Xu <caxu@redhat.com>
Assisted-by: Cursor
@caxu-rh caxu-rh force-pushed the engine-link-handling branch from 6e7a11d to 47fc1a1 Compare April 2, 2026 16:20
@dcibot
Copy link
Copy Markdown

dcibot commented Apr 2, 2026

@bcrochet
Copy link
Copy Markdown
Contributor

bcrochet commented Apr 8, 2026

@CodeRabbit full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 8, 2026

✅ Actions performed

Full review triggered.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 14, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 14, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@caxu-rh caxu-rh marked this pull request as draft April 24, 2026 14:35
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants