Skip to content

[SPARK-57590][SQL] Read and infer Parquet schema from tar and zip archives#56828

Open
akshatshenoi-db wants to merge 6 commits into
apache:masterfrom
akshatshenoi-db:archive-parquet
Open

[SPARK-57590][SQL] Read and infer Parquet schema from tar and zip archives#56828
akshatshenoi-db wants to merge 6 commits into
apache:masterfrom
akshatshenoi-db:archive-parquet

Conversation

@akshatshenoi-db

@akshatshenoi-db akshatshenoi-db commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Extends the archive-read feature (gated by spark.sql.files.archive.reader.enabled, default off) to Parquet, for both tar (.tar / .tar.gz / .tgz) and .zip archives. The earlier formats — CSV (SPARK-57135, SPARK-57321), JSON (SPARK-57419), text (SPARK-57478), XML (SPARK-57479), Avro (SPARK-57481), and the zip container (SPARK-57705) — stream each entry through ArchiveReader. Parquet can't: it needs random access to its footer, so an entry must be a complete, seekable file.

This unpacks entries to local temp files, one at a time:

  • ArchiveReader gains localizeEntries / readLocalizedEntries — the random-access counterpart to readEntries: unpack a kept entry to a temp file, read it, and release the reader and file before the next entry opens. The temp dir is removed on task completion, and FileScanRDD closes the entry iterator, so an abandoned read (e.g. a LIMIT) doesn't leak.
  • ParquetFileFormat: isSplitable is false for archives (one split each); the per-file read is factored into readSingleFile and reused per entry; input_file_name() / _metadata.file_path stay the archive path, not the temp file. Schema inference reads entry footers driver-side, folding one at a time (only the first when mergeSchema=false). A corrupt archive is skipped under ignoreCorruptFiles, a missing one under ignoreMissingFiles.

Reading works for tar and zip with no container-specific code — dispatch goes through ArchiveReader. V2 sources are untouched; archive dispatch is V1-only.

Why are the changes needed?

Parquet is the most common columnar format, and packing many small part-files into one archive is a natural way to ship them. This completes the archive-read series for Parquet, with the same gated, per-entry semantics as the other formats.

Does this PR introduce any user-facing change?

Yes, gated by spark.sql.files.archive.reader.enabled (default false). When enabled, Parquet files in tar or zip archives can be read and their schema inferred; with the flag off there is no change.

How was this patch tested?

ParquetArchiveReadBase holds the Parquet-specific tests (read parity across the vectorized and row-based readers, input_file_name, an abandoned LIMIT, differing fields, and mergeSchema union). ParquetTarArchiveReadSuite and ParquetZipArchiveReadSuite each run those plus the shared ArchiveReadSuiteBase parity/inference tests, over tar and zip respectively.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

Now that zip container support is in master, Parquet reads zip archives with
no code change (dispatch is via ArchiveReader). Move the Parquet-specific
archive tests into ParquetArchiveReadBase (container-agnostic) so both the tar
and the new ParquetZipArchiveReadSuite exercise them; generalize the
input_file_name assertion to the archive's own name.
@akshatshenoi-db akshatshenoi-db changed the title [SPARK-57590][SQL] Read and infer Parquet schema from tar archives [SPARK-57590][SQL] Read and infer Parquet schema from tar and zip archives Jun 29, 2026
…eader.inferArchiveSchema

Add ArchiveReader.inferArchiveSchema(looseInfer)(foldEntries) -- shared partition + per-archive
unpack/ignore-skip (withLocalizedArchive) + mergeSchema/sample policy + cleanup -- so
ParquetFileFormat.inferArchiveSchema is a thin call supplying ParquetUtils.inferSchema +
mergeArchiveEntrySchemas, mirroring the read path.
…ArchiveSchema

inferArchiveSchema now owns the per-entry infer/merge/delete loop, taking inferOne + mergeSchemas;
ParquetFileFormat.mergeArchiveEntrySchemas is removed and the footer reader is supplied inline.
Trim verbose comments.
Distill the archive read/infer docs and inline comments to the essential intent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant