[SPARK-57590][SQL] Read and infer Parquet schema from tar and zip archives#56828
Open
akshatshenoi-db wants to merge 6 commits into
Open
[SPARK-57590][SQL] Read and infer Parquet schema from tar and zip archives#56828akshatshenoi-db wants to merge 6 commits into
akshatshenoi-db wants to merge 6 commits into
Conversation
…ve merge conflicts and cover mergeSchema inference
4bd11d0 to
784be96
Compare
Now that zip container support is in master, Parquet reads zip archives with no code change (dispatch is via ArchiveReader). Move the Parquet-specific archive tests into ParquetArchiveReadBase (container-agnostic) so both the tar and the new ParquetZipArchiveReadSuite exercise them; generalize the input_file_name assertion to the archive's own name.
…eader.inferArchiveSchema Add ArchiveReader.inferArchiveSchema(looseInfer)(foldEntries) -- shared partition + per-archive unpack/ignore-skip (withLocalizedArchive) + mergeSchema/sample policy + cleanup -- so ParquetFileFormat.inferArchiveSchema is a thin call supplying ParquetUtils.inferSchema + mergeArchiveEntrySchemas, mirroring the read path.
…ArchiveSchema inferArchiveSchema now owns the per-entry infer/merge/delete loop, taking inferOne + mergeSchemas; ParquetFileFormat.mergeArchiveEntrySchemas is removed and the footer reader is supplied inline. Trim verbose comments.
Distill the archive read/infer docs and inline comments to the essential intent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Extends the archive-read feature (gated by
spark.sql.files.archive.reader.enabled, default off) to Parquet, for both tar (.tar/.tar.gz/.tgz) and.ziparchives. The earlier formats — CSV (SPARK-57135, SPARK-57321), JSON (SPARK-57419), text (SPARK-57478), XML (SPARK-57479), Avro (SPARK-57481), and the zip container (SPARK-57705) — stream each entry throughArchiveReader. Parquet can't: it needs random access to its footer, so an entry must be a complete, seekable file.This unpacks entries to local temp files, one at a time:
ArchiveReadergainslocalizeEntries/readLocalizedEntries— the random-access counterpart toreadEntries: unpack a kept entry to a temp file, read it, and release the reader and file before the next entry opens. The temp dir is removed on task completion, andFileScanRDDcloses the entry iterator, so an abandoned read (e.g. aLIMIT) doesn't leak.ParquetFileFormat:isSplitableis false for archives (one split each); the per-file read is factored intoreadSingleFileand reused per entry;input_file_name()/_metadata.file_pathstay the archive path, not the temp file. Schema inference reads entry footers driver-side, folding one at a time (only the first whenmergeSchema=false). A corrupt archive is skipped underignoreCorruptFiles, a missing one underignoreMissingFiles.Reading works for tar and zip with no container-specific code — dispatch goes through
ArchiveReader. V2 sources are untouched; archive dispatch is V1-only.Why are the changes needed?
Parquet is the most common columnar format, and packing many small part-files into one archive is a natural way to ship them. This completes the archive-read series for Parquet, with the same gated, per-entry semantics as the other formats.
Does this PR introduce any user-facing change?
Yes, gated by
spark.sql.files.archive.reader.enabled(default false). When enabled, Parquet files in tar or zip archives can be read and their schema inferred; with the flag off there is no change.How was this patch tested?
ParquetArchiveReadBaseholds the Parquet-specific tests (read parity across the vectorized and row-based readers,input_file_name, an abandonedLIMIT, differing fields, andmergeSchemaunion).ParquetTarArchiveReadSuiteandParquetZipArchiveReadSuiteeach run those plus the sharedArchiveReadSuiteBaseparity/inference tests, over tar and zip respectively.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code