Skip to content

TIKA-4705 -- resourceName of nested tarball should not contain the pa…#2730

Closed
iachimoe wants to merge 2 commits intoapache:branch_3xfrom
iachimoe:feat/TIKA-4705_draft
Closed

TIKA-4705 -- resourceName of nested tarball should not contain the pa…#2730
iachimoe wants to merge 2 commits intoapache:branch_3xfrom
iachimoe:feat/TIKA-4705_draft

Conversation

@iachimoe
Copy link
Copy Markdown
Contributor

@iachimoe iachimoe commented Apr 2, 2026

Detailed description of issue is at https://issues.apache.org/jira/browse/TIKA-4705

if (StringUtils.isBlank(name)) {
return;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use FilenameUtils.getName(name)

@Test
public void testNestedTarball() throws Exception {
List<Metadata> list = getRecursiveMetadata("test-nested-tarball.tar");
List<String> actualInternalPaths =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be actualResourceNames

.map(m -> m.get(TikaCoreProperties.RESOURCE_NAME_KEY))
.collect(Collectors.toList());

List<String> expectedInternalPaths = Arrays.asList("test-nested-tarball.tar",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be just the file names, right?

"/nested.tgz/nested.tar/testTXT.txt",
"/nested.tgz/nested.tar",
"/nested.tgz"), actualEmbeddedPaths);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test for internalPaths?

There are three things we care about with this change: resource names (should be just the file name), embedded resource path and the internal path.

Copy link
Copy Markdown
Contributor

@tballison tballison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please aim this against main. I'll cherrypick back to 3.x.

I think this makes sense and is a good catch.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses TIKA-4705 by ensuring the resourceName assigned to a decompressed “inner tar” (e.g., nested.tgznested.tar) does not include the parent archive’s internal path components, and adds a regression test + fixture for nested tarball parsing.

Changes:

  • Add a regression test validating RESOURCE_NAME_KEY, INTERNAL_PATH, and EMBEDDED_RESOURCE_PATH for a nested tar-in-tgz-in-tar scenario.
  • Add a new nested tarball test fixture (test-nested-tarball.tar) under test resources.
  • Update CompressorParser#setName to strip path components from the parent RESOURCE_NAME_KEY before deriving the uncompressed name.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated no comments.

File Description
tika-parsers/.../RecursiveParserWrapperTest.java Adds testNestedTarball() assertions covering nested tarball naming/path metadata.
tika-parsers/.../test-nested-tarball.tar New fixture to reproduce/guard the nested tarball naming behavior.
tika-parsers/.../CompressorParser.java Normalizes derived embedded resource names by dropping parent path segments before extension-based renaming.
Comments suppressed due to low confidence (1)

tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/CompressorParser.java:265

  • In setName(), the condition name.endsWith("gz") is missing the leading dot. This can match filenames that don't have an extension separator (e.g., somethinggz) and then name.substring(0, name.lastIndexOf(".")) will throw when there is no .. Consider changing this to name.endsWith(".gz") (consistent with the other extensions) and guarding against lastIndexOf('.') == -1 (or using a safer uncompress-name helper) before substring.
        if (name.endsWith(".tgz") || name.endsWith(".tbz") || name.endsWith(".tbz2")) {
            name = name.substring(0, name.lastIndexOf(".")) + ".tar";
        } else if (name.endsWith(".bz") || name.endsWith("gz") || name.endsWith(".bz2") || name.endsWith(".xz") || name.endsWith(".zlib") || name.endsWith(".pack") ||
                name.endsWith(".br")) {
            name = name.substring(0, name.lastIndexOf("."));

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@THausherr
Copy link
Copy Markdown
Contributor

"Spotless reports import ordering issues. Specifically, org.apache.tika.io.FilenameUtils needs to be moved to the correct place in the import block (with the rest of the org.apache.tika.* imports), rather than being grouped with the non-Tika imports."


if (name.endsWith(".tgz") || name.endsWith(".tbz") || name.endsWith(".tbz2")) {
name = name.substring(0, name.lastIndexOf(".")) + ".tar";
} else if (name.endsWith(".bz") || name.endsWith("gz") || name.endsWith(".bz2") || name.endsWith(".xz") || name.endsWith(".zlib") || name.endsWith(".pack") ||
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copilot correctly found the typo in "gz", we should fix this to ".gz" while we're working on this topic. It is unrelated and could be a separate PR, but for the sake of efficiency...

@iachimoe
Copy link
Copy Markdown
Contributor Author

iachimoe commented Apr 7, 2026

Hi folks, thanks for the comments. I am closing this PR as I have made a fresh PR against main - https://github.com/apache/tika/pull/2750/changes

@iachimoe iachimoe closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants