TIKA-4705 -- resourceName of nested tarball should not contain the pa…#2730
TIKA-4705 -- resourceName of nested tarball should not contain the pa…#2730iachimoe wants to merge 2 commits intoapache:branch_3xfrom
Conversation
…rent directories of its parent gzip file.
| if (StringUtils.isBlank(name)) { | ||
| return; | ||
| } | ||
|
|
There was a problem hiding this comment.
Maybe use FilenameUtils.getName(name)
| @Test | ||
| public void testNestedTarball() throws Exception { | ||
| List<Metadata> list = getRecursiveMetadata("test-nested-tarball.tar"); | ||
| List<String> actualInternalPaths = |
There was a problem hiding this comment.
should be actualResourceNames
| .map(m -> m.get(TikaCoreProperties.RESOURCE_NAME_KEY)) | ||
| .collect(Collectors.toList()); | ||
|
|
||
| List<String> expectedInternalPaths = Arrays.asList("test-nested-tarball.tar", |
There was a problem hiding this comment.
These should be just the file names, right?
| "/nested.tgz/nested.tar/testTXT.txt", | ||
| "/nested.tgz/nested.tar", | ||
| "/nested.tgz"), actualEmbeddedPaths); | ||
| } |
There was a problem hiding this comment.
Add test for internalPaths?
There are three things we care about with this change: resource names (should be just the file name), embedded resource path and the internal path.
tballison
left a comment
There was a problem hiding this comment.
Please aim this against main. I'll cherrypick back to 3.x.
I think this makes sense and is a good catch.
There was a problem hiding this comment.
Pull request overview
Addresses TIKA-4705 by ensuring the resourceName assigned to a decompressed “inner tar” (e.g., nested.tgz → nested.tar) does not include the parent archive’s internal path components, and adds a regression test + fixture for nested tarball parsing.
Changes:
- Add a regression test validating
RESOURCE_NAME_KEY,INTERNAL_PATH, andEMBEDDED_RESOURCE_PATHfor a nested tar-in-tgz-in-tar scenario. - Add a new nested tarball test fixture (
test-nested-tarball.tar) under test resources. - Update
CompressorParser#setNameto strip path components from the parentRESOURCE_NAME_KEYbefore deriving the uncompressed name.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
tika-parsers/.../RecursiveParserWrapperTest.java |
Adds testNestedTarball() assertions covering nested tarball naming/path metadata. |
tika-parsers/.../test-nested-tarball.tar |
New fixture to reproduce/guard the nested tarball naming behavior. |
tika-parsers/.../CompressorParser.java |
Normalizes derived embedded resource names by dropping parent path segments before extension-based renaming. |
Comments suppressed due to low confidence (1)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/CompressorParser.java:265
- In setName(), the condition
name.endsWith("gz")is missing the leading dot. This can match filenames that don't have an extension separator (e.g.,somethinggz) and thenname.substring(0, name.lastIndexOf("."))will throw when there is no.. Consider changing this toname.endsWith(".gz")(consistent with the other extensions) and guarding againstlastIndexOf('.') == -1(or using a safer uncompress-name helper) before substring.
if (name.endsWith(".tgz") || name.endsWith(".tbz") || name.endsWith(".tbz2")) {
name = name.substring(0, name.lastIndexOf(".")) + ".tar";
} else if (name.endsWith(".bz") || name.endsWith("gz") || name.endsWith(".bz2") || name.endsWith(".xz") || name.endsWith(".zlib") || name.endsWith(".pack") ||
name.endsWith(".br")) {
name = name.substring(0, name.lastIndexOf("."));
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
"Spotless reports import ordering issues. Specifically, org.apache.tika.io.FilenameUtils needs to be moved to the correct place in the import block (with the rest of the org.apache.tika.* imports), rather than being grouped with the non-Tika imports." |
|
|
||
| if (name.endsWith(".tgz") || name.endsWith(".tbz") || name.endsWith(".tbz2")) { | ||
| name = name.substring(0, name.lastIndexOf(".")) + ".tar"; | ||
| } else if (name.endsWith(".bz") || name.endsWith("gz") || name.endsWith(".bz2") || name.endsWith(".xz") || name.endsWith(".zlib") || name.endsWith(".pack") || |
There was a problem hiding this comment.
copilot correctly found the typo in "gz", we should fix this to ".gz" while we're working on this topic. It is unrelated and could be a separate PR, but for the sake of efficiency...
|
Hi folks, thanks for the comments. I am closing this PR as I have made a fresh PR against main - https://github.com/apache/tika/pull/2750/changes |
Detailed description of issue is at https://issues.apache.org/jira/browse/TIKA-4705