[SPARK-57590][SQL] Read and infer Parquet schema from tar and zip archives by akshatshenoi-db · Pull Request #56828 · apache/spark

akshatshenoi-db · 2026-06-26T23:57:28Z

What changes were proposed in this pull request?

Extends the archive-read feature (gated by spark.sql.files.archive.reader.enabled, default off) to Parquet, for both tar (.tar / .tar.gz / .tgz) and .zip archives. The earlier formats — CSV (SPARK-57135, SPARK-57321), JSON (SPARK-57419), text (SPARK-57478), XML (SPARK-57479), Avro (SPARK-57481), and the zip container (SPARK-57705) — stream each entry through ArchiveReader. Parquet can't: it needs random access to its footer, so an entry must be a complete, seekable file.

This unpacks entries to local temp files, one at a time:

ArchiveReader gains localizeEntries / readLocalizedEntries — the random-access counterpart to readEntries: unpack a kept entry to a temp file, read it, and release the reader and file before the next entry opens. The temp dir is removed on task completion, and FileScanRDD closes the entry iterator, so an abandoned read (e.g. a LIMIT) doesn't leak.
ParquetFileFormat: isSplitable is false for archives (one split each); the per-file read is factored into readSingleFile and reused per entry; input_file_name() / _metadata.file_path stay the archive path, not the temp file. Schema inference reads entry footers driver-side, folding one at a time (only the first when mergeSchema=false). A corrupt archive is skipped under ignoreCorruptFiles, a missing one under ignoreMissingFiles.

Reading works for tar and zip with no container-specific code — dispatch goes through ArchiveReader. V2 sources are untouched; archive dispatch is V1-only.

Why are the changes needed?

Parquet is the most common columnar format, and packing many small part-files into one archive is a natural way to ship them. This completes the archive-read series for Parquet, with the same gated, per-entry semantics as the other formats.

Does this PR introduce any user-facing change?

Yes, gated by spark.sql.files.archive.reader.enabled (default false). When enabled, Parquet files in tar or zip archives can be read and their schema inferred; with the flag off there is no change.

How was this patch tested?

ParquetArchiveReadBase holds the Parquet-specific tests (read parity across the vectorized and row-based readers, input_file_name, an abandoned LIMIT, differing fields, and mergeSchema union). ParquetTarArchiveReadSuite and ParquetZipArchiveReadSuite each run those plus the shared ArchiveReadSuiteBase parity/inference tests, over tar and zip respectively.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

…ve merge conflicts and cover mergeSchema inference

Now that zip container support is in master, Parquet reads zip archives with no code change (dispatch is via ArchiveReader). Move the Parquet-specific archive tests into ParquetArchiveReadBase (container-agnostic) so both the tar and the new ParquetZipArchiveReadSuite exercise them; generalize the input_file_name assertion to the archive's own name.

…eader.inferArchiveSchema Add ArchiveReader.inferArchiveSchema(looseInfer)(foldEntries) -- shared partition + per-archive unpack/ignore-skip (withLocalizedArchive) + mergeSchema/sample policy + cleanup -- so ParquetFileFormat.inferArchiveSchema is a thin call supplying ParquetUtils.inferSchema + mergeArchiveEntrySchemas, mirroring the read path.

…ArchiveSchema inferArchiveSchema now owns the per-entry infer/merge/delete loop, taking inferOne + mergeSchemas; ParquetFileFormat.mergeArchiveEntrySchemas is removed and the footer reader is supplied inline. Trim verbose comments.

Distill the archive read/infer docs and inline comments to the essential intent.

akshatshenoi-db added 2 commits June 29, 2026 18:27

[SPARK-57590][SQL] Read and infer Parquet schema from tar archives

a831273

[SPARK-57590][SQL] Address review: use CANNOT_MERGE_SCHEMAS for archi…

784be96

…ve merge conflicts and cover mergeSchema inference

akshatshenoi-db force-pushed the archive-parquet branch from 4bd11d0 to 784be96 Compare June 29, 2026 18:36

akshatshenoi-db changed the title ~~[SPARK-57590][SQL] Read and infer Parquet schema from tar archives~~ [SPARK-57590][SQL] Read and infer Parquet schema from tar and zip archives Jun 29, 2026

akshatshenoi-db added 3 commits June 30, 2026 22:04

[SPARK-57590][SQL] Trim ArchiveReader / ParquetFileFormat comments

cfc97a0

Distill the archive read/infer docs and inline comments to the essential intent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57590][SQL] Read and infer Parquet schema from tar and zip archives#56828

[SPARK-57590][SQL] Read and infer Parquet schema from tar and zip archives#56828
akshatshenoi-db wants to merge 6 commits into
apache:masterfrom
akshatshenoi-db:archive-parquet

akshatshenoi-db commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

akshatshenoi-db commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

akshatshenoi-db commented Jun 26, 2026 •

edited

Loading