Skip to content

[GH-2824] Add geotiff.metadata data source for GeoTIFF file metadata#2846

Merged
jiayuasu merged 11 commits into
apache:masterfrom
jiayuasu:worktree-issue-2824-v2
Apr 23, 2026
Merged

[GH-2824] Add geotiff.metadata data source for GeoTIFF file metadata#2846
jiayuasu merged 11 commits into
apache:masterfrom
jiayuasu:worktree-issue-2824-v2

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu commented Apr 21, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Add a new read-only Spark DataSourceV2 (geotiff.metadata) that reads GeoTIFF file metadata without decoding pixel data, similar to gdalinfo.

Usage

spark.read.format("geotiff.metadata").load("/path/to/rasters/")

// COG detection
spark.read.format("geotiff.metadata").load("/path/to/*.tif")
  .filter("isTiled AND size(overviews) > 0")

Output schema

Returns one row per file with:

Column Type Notes
path String File path
driver String "GTiff"
fileSize Long File size in bytes
width, height Int Pixel dimensions
numBands Int Number of bands
srid, crs Int / String EPSG code and WKT
geoTransform Struct upperLeftX/Y, scaleX/Y, skewX/Y
cornerCoordinates Struct minX/Y, maxX/Y
bands Array[Struct] band, dataType, colorInterpretation, noDataValue, blockWidth/Height, description, unit
overviews Array[Struct] level, width, height
metadata Map[String,String] File-wide TIFF metadata tags
isTiled Boolean From TIFF TileWidth tag (not RenderedImage)
compression String LZW, Deflate, etc.

Key implementation details

  • isTiled: reads TIFF TileWidth tag (322) from IIO metadata, not RenderedImage tile size (which reports strips as tiles)
  • colorInterpretation: derived from TIFF Photometric tag (262) — Gray, Red, Green, Blue, Alpha, Palette
  • compression: reads TIFF tag 259 description attribute for human-readable names (LZW, Deflate)
  • overviews: uses DatasetLayout.getNumInternalOverviews() for real overview count, not synthetic tile-based levels
  • Schema-aware column pruning via readDataSchema
  • TIFF IIO metadata extracted before reader.read() to avoid stream state issues
  • Read-only: newWriteBuilder throws UnsupportedOperationException

Files

  • New package: spark/common/.../io/geotiffmetadata/ — 5 Scala files (GeoTiffMetadataDataSource, GeoTiffMetadataTable, GeoTiffMetadataScanBuilder, GeoTiffMetadataPartitionReaderFactory, GeoTiffMetadataPartitionReader)
  • Service registration: Added to META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
  • Documentation: docs/tutorial/files/geotiffmetadata-sedona-spark.md with schema reference, examples, COG detection
  • SVG visuals: docs/image/geotiff_metadata/schema_overview.svg and cog_structure.svg
  • mkdocs.yml: Added navigation entry under Files

How was this patch tested?

11 tests in geotiffMetadataTest.scala with exact-match assertions:

  • test1.tiff metadata: width=512, height=517, srid=3857, fileSize=174803, band type=UNSIGNED_8BITS, blockSize=256x256, colorInterpretation=Gray, geoTransform values, cornerCoordinates values
  • Cross-validation against format("raster") + RS_Width/RS_Height/RS_NumBands/RS_SRID
  • COG test: generates COG on-the-fly via RS_AsCOG, verifies isTiled=true, 2 overviews, blockSize=256x256
  • Empty overviews for non-COG test1.tiff (verified 1 IFD only)
  • Glob pattern loading (7 .tiff files) and recursive directory loading (9 files total)
  • LIMIT pushdown and column pruning

Did this PR include necessary documentation updates?

  • Yes, I have updated the documentation.

…adata

Add a new Spark DataSourceV2 that reads GeoTIFF file metadata without
decoding pixel data, similar to gdalinfo.

Usage: spark.read.format("geotiff.metadata").load("/path/to/*.tif")

Returns one row per file with: path, driver, fileSize, width, height,
numBands, srid, crs, geoTransform (struct), cornerCoordinates (struct),
bands (array with dataType, noData, blockSize, colorInterpretation),
overviews (struct with level/width/height), metadata (map), isTiled,
and compression.

Closes apache#2824
@jiayuasu jiayuasu force-pushed the worktree-issue-2824-v2 branch from 6375cab to 9ced509 Compare April 21, 2026 16:46
Rename package, classes, files, docs, and service registration
from GeoTiffInfo* to GeoTiffMetadata*. The data source shortName
remains "geotiff.metadata".
- schema_overview.svg: shows full output schema with nested structs
- cog_structure.svg: illustrates COG properties mapped to schema fields
- Add COG detection section with visual before the read examples
@jiayuasu jiayuasu changed the title [GH-2824] Add geotiffinfo data source for GeoTIFF file metadata [GH-2824] Add geotiff.metadata data source for GeoTIFF file metadata Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new read-only Spark DataSourceV2 (geotiff.metadata) to extract GeoTIFF file metadata (dimensions, CRS, band/tile info, overviews, compression, TIFF tags) without loading it as a raster column, plus docs and tests to validate expected outputs and pushdowns.

Changes:

  • Introduce geotiff.metadata FileDataSourceV2 implementation (table/scan/reader) and register it via DataSourceRegister.
  • Add a Scala test suite covering single-file reads, directory/glob loading, limit pushdown, column pruning, and COG detection.
  • Add a documentation page and SVG diagrams, and wire the page into MkDocs navigation.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala DataSourceV2 entrypoint and path/glob handling for geotiff.metadata.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala Table definition and metadata output schema.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala Scan planning with LIMIT pushdown and partition planning.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala Reader factory wiring for file partitions.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala Per-file metadata extraction logic and TIFF IIO helpers.
spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister Registers the new GeoTiffMetadataDataSource.
spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala End-to-end tests for schema/values, pushdowns, and generated COG detection.
docs/tutorial/files/geotiffmetadata-sedona-spark.md User-facing documentation and schema reference for geotiff.metadata.
docs/image/geotiff_metadata/schema_overview.svg Diagram of the output schema.
docs/image/geotiff_metadata/cog_structure.svg Diagram explaining COG-related fields.
mkdocs.yml Adds the new tutorial page to navigation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/tutorial/files/geotiffmetadata-sedona-spark.md
… match

- Add `override` to newWriteBuilder in GeoTiffMetadataTable
- Set fallbackFileFormat to null (consistent with GeoParquetMetadataDataSource
  for read-only data sources — prevents V1 fallback attempts)
- Replace unsafe asInstanceOf[FilePartition] in LIMIT pushdown with
  pattern match that throws IllegalArgumentException on unexpected
  partition types
@jiayuasu jiayuasu requested a review from Copilot April 22, 2026 06:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Spark DataSourceV2 (geotiff.metadata) for reading GeoTIFF file metadata (dimensions, CRS, tiling, compression, overviews, band info) without loading pixel data, plus documentation and tests to support GeoTIFF cataloging/COG detection workflows in Sedona Spark.

Changes:

  • Introduces geotiff.metadata DataSourceV2 implementation (table/scan/reader) with column pruning and LIMIT pushdown.
  • Registers the new data source and adds end-user documentation (including schema diagrams).
  • Adds a dedicated Scala test suite validating metadata extraction and pushdowns.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala Implements the DataSourceV2 entrypoint and path/glob handling for GeoTIFF metadata reads.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala Defines the table + output schema for metadata rows.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala Adds scan planning including LIMIT pushdown and reader factory wiring.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala Creates partition readers and attaches partition values.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala Extracts GeoTIFF metadata (TIFF tags, bands, CRS, overviews) into InternalRow.
spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister Registers geotiff.metadata for Spark discovery.
spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala Adds end-to-end tests for correctness, pushdowns, glob + recursive loading, and COG detection.
docs/tutorial/files/geotiffmetadata-sedona-spark.md Documents usage, schema, and example queries for COG detection and metadata inspection.
docs/image/geotiff_metadata/schema_overview.svg Adds a visual overview of the output schema.
docs/image/geotiff_metadata/cog_structure.svg Adds a visual mapping between COG structure and reported fields.
mkdocs.yml Adds the new documentation page to MkDocs navigation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…non-nullable

- Rebuild ArrayBasedMapData from a single pass over (k,v) entries to
  guarantee key/value index alignment
- Mark fileSize as non-nullable since it's always populated from a
  primitive Long (PartitionedFile.fileSize)
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new read-only Spark DataSourceV2 (geotiff.metadata) to extract GeoTIFF file metadata (dimensions, CRS, bands, tiling, overviews, compression) in a “gdalinfo-like” way, plus docs and tests to expose/validate the new format.

Changes:

  • Introduces geotiff.metadata DataSourceV2 + table/scan/reader implementation in spark/common.
  • Registers the new data source in Spark’s DataSourceRegister service file.
  • Adds user documentation (with schema reference) and supporting SVG diagrams; adds unit tests covering core behavior (including glob, recursion w/ trailing slash, LIMIT pushdown, column pruning, and generated COG detection).

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala DataSourceV2 entry point; path/glob/recursive lookup option handling.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala Defines the table and output schema; enforces read-only behavior.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala Scan builder/scan; LIMIT pushdown by trimming file partitions; reader factory wiring.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala Creates per-partition readers and attaches partition values.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala Per-file metadata extraction (TIFF tags, CRS, affine transform, bands, overviews, metadata map).
spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister Registers GeoTiffMetadataDataSource for Spark discovery.
spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala New unit tests for metadata correctness, glob/dir loading, pruning, LIMIT pushdown, and COG detection.
docs/tutorial/files/geotiffmetadata-sedona-spark.md New documentation page with schema tables and usage examples.
docs/image/geotiff_metadata/schema_overview.svg Diagram of output schema.
docs/image/geotiff_metadata/cog_structure.svg Diagram explaining COG properties mapping to output fields.
mkdocs.yml Adds the new tutorial page to navigation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…rectory check

- Gate TIFF IIO metadata calls (hasTiffTag, extractPhotometricInterpretation,
  extractMetadata, extractCompression) on whether their output columns are
  in readDataSchema. Queries like SELECT path, width, height no longer pay
  for unused TIFF tree traversals.
- Skip reader.read(null) entirely when the requested schema only needs
  cheap fields (path, driver, fileSize, overviews, metadata, isTiled,
  compression). Only fields requiring GridCoverage2D (width, height,
  numBands, srid, crs, geoTransform, cornerCoordinates, bands) trigger
  coverage read.
- Skip opening the file entirely when only path/driver/fileSize are
  requested.
- buildOverviewsArray now gets dimensions from reader.getOriginalGridRange
  instead of requiring coverage read.
- Detect directory via Hadoop FileSystem.getFileStatus.isDirectory in
  addition to the trailing-slash check. A directory path without trailing
  slash now correctly applies recursive lookup + *.tif glob filter.
- Add test for loading directory without trailing slash.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new read-only Spark DataSourceV2 (geotiff.metadata) to extract GeoTIFF file metadata (without decoding pixel data), plus accompanying documentation and test coverage.

Changes:

  • Introduces geotiff.metadata FileDataSourceV2 + Table/Scan/Reader implementation to read per-file GeoTIFF metadata.
  • Registers the new data source for Spark discovery and adds MkDocs navigation + a new tutorial page (with schema diagrams).
  • Adds a Scala test suite validating key metadata extraction behaviors (dimensions/CRS/bands/overviews, glob + recursive loading, LIMIT + column pruning).

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala New end-to-end tests for the geotiff.metadata data source.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala DataSourceV2 entrypoint (shortName = geotiff.metadata) and path handling (dir/glob).
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala Defines table capabilities and the output schema for metadata rows.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala Implements batch scan planning with LIMIT pushdown and column pruning.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala Builds partition readers for file partitions.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala Core per-file metadata extraction logic (GeoTools/IIO metadata + optional coverage read).
spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister Registers GeoTiffMetadataDataSource for Spark format(...) lookup.
mkdocs.yml Adds the new tutorial page to the docs navigation.
docs/tutorial/files/geotiffmetadata-sedona-spark.md New user documentation (usage, schema reference, examples).
docs/image/geotiff_metadata/schema_overview.svg Diagram of the output schema.
docs/image/geotiff_metadata/cog_structure.svg Diagram explaining COG detection via schema fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ests

- isDirectory now builds Hadoop conf via newHadoopConfWithOptions so
  per-read options (e.g. fs.s3a.*) flow through to directory detection
- buildMetadataMap returns empty MapData instead of null for empty
  metadata, making it consistent with overviews/bands collection columns
- Rewrite metadata extraction to walk TIFFField nodes and map each
  field's `name` attribute to the leaf value (previous approach looked
  for name+value on same node, which never matched the TIFF IIO tree
  structure — metadata column was always empty)
- Add test assertions for compression (test1.tiff=LZW, generated COG=LZW)
  and metadata (non-empty) columns
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new read-only Spark DataSourceV2 (geotiff.metadata) to extract GeoTIFF file metadata (dimensions/CRS/bands/tiling/overviews/compression/TIFF tags) without decoding pixel data, enabling fast cataloging and COG detection workflows in Sedona Spark.

Changes:

  • Introduces geotiff.metadata DataSourceV2 implementation (table/scan/reader pipeline) with column pruning and LIMIT pushdown.
  • Registers the data source and adds a dedicated test suite validating metadata extraction and COG detection.
  • Adds end-user documentation + nav entry and supporting diagrams for schema/COG concepts.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala Data source entry point (shortName = geotiff.metadata), directory/glob handling, read-only config.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala Defines output schema and read-only table capabilities/scan builder wiring.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala FileScanBuilder + LIMIT pushdown behavior and reader factory creation.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala Creates partition readers for file partitions (with partition values).
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala Core metadata extraction logic (GeoTools GeoTiffReader + TIFF IIO metadata parsing), builds InternalRows.
spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister Registers the new data source for Spark discovery.
spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala Adds tests for exact metadata, directory/glob loading, LIMIT/column pruning, and COG detection.
docs/tutorial/files/geotiffmetadata-sedona-spark.md New user documentation page with schema reference and examples.
mkdocs.yml Adds documentation navigation entry for GeoTIFF metadata.
docs/image/geotiff_metadata/schema_overview.svg Diagram of geotiff.metadata output schema.
docs/image/geotiff_metadata/cog_structure.svg Diagram explaining COG structure mapping to output fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…r call

Fetch reader.getMetadata().getRootNode() once per file and pass the root
to the helper methods, so the DOM tree is walked at most once per file
even when multiple TIFF tag-based columns (isTiled, bands, metadata,
compression) are selected together.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new read-only Spark DataSourceV2 (geotiff.metadata) to extract GeoTIFF file metadata (dimensions/CRS/bands/tiling/overviews/compression/TIFF tags) without decoding pixel data, enabling cataloging and COG detection workflows similar to gdalinfo.

Changes:

  • Added geotiff.metadata DataSourceV2 implementation (table/scan/reader) with column pruning and LIMIT pushdown.
  • Registered the new data source and added a dedicated Scala test suite validating core fields, COG detection, glob/recursive loading, and pruning.
  • Added end-user documentation (tutorial page + MkDocs nav entry) and supporting schema/COG diagrams.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala Data source entrypoint; path handling (dir vs glob), schema inference, read-only configuration.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala Defines table + output schema (structs/arrays/maps) and read-only table behavior.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala File scan planning plus LIMIT pushdown logic and reader factory wiring.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala Builds per-partition readers and wraps them with partition values.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala Core metadata extraction (IIO TIFF tags + optional coverage-derived fields) and row construction.
spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister Registers GeoTiffMetadataDataSource for Spark discovery.
spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala Adds test coverage validating metadata extraction, COG detection, and pruning/pushdown behavior.
docs/tutorial/files/geotiffmetadata-sedona-spark.md User documentation for the new data source (usage + schema reference + examples).
mkdocs.yml Adds docs navigation entry for GeoTIFF metadata.
docs/image/geotiff_metadata/schema_overview.svg Diagram of output schema.
docs/image/geotiff_metadata/cog_structure.svg Diagram explaining COG property mapping to schema fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu jiayuasu requested a review from Copilot April 23, 2026 04:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new read-only Spark DataSourceV2 (geotiff.metadata) to inventory GeoTIFF file metadata (dimensions/CRS/bands/overviews/tiling/compression) without using the raster pixel decode path, plus documentation and test coverage.

Changes:

  • Introduces geotiff.metadata DataSourceV2 implementation (table/scan/reader) and registers it via Spark service loader.
  • Adds Scala test suite covering exact metadata extraction, directory/glob loading behavior, LIMIT pushdown, and column pruning.
  • Adds user documentation and accompanying SVG diagrams, and wires the page into MkDocs navigation.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala DataSourceV2 entrypoint, path/directory handling, and read-only configuration.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala Table definition, schema, scan builder hookup, and read-only write rejection.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala Scan planning, LIMIT pushdown partition trimming, and reader-factory creation.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala Wires Spark partitions to per-file metadata readers.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala Core per-file metadata extraction + column-pruning-aware execution.
spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister Registers GeoTiffMetadataDataSource for format("geotiff.metadata").
spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala End-to-end tests for metadata correctness, COG detection, and pushdowns.
docs/tutorial/files/geotiffmetadata-sedona-spark.md User docs for usage, schema reference, and examples.
docs/image/geotiff_metadata/schema_overview.svg Diagram of the output schema.
docs/image/geotiff_metadata/cog_structure.svg Diagram explaining COG detection fields and overview structure.
mkdocs.yml Adds the new tutorial page to the docs navigation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

width and height can be obtained from reader.getOriginalGridRange()
without calling reader.read(null). Queries that only project width or
height no longer force a GridCoverage2D to be built.

Introduce READER_ONLY_FIELDS (width, height, overviews) that need the
reader but not the raster coverage.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new read-only Spark DataSourceV2 (geotiff.metadata) to inventory GeoTIFF file metadata (dimensions, CRS, bands, tiling, overviews, compression, TIFF tags) without decoding pixel data, plus docs and a test suite to validate correctness and pushdowns.

Changes:

  • Introduce geotiff.metadata DataSourceV2 implementation (table/scan/partition reader) with column pruning and LIMIT pushdown.
  • Register the data source via Spark service loader.
  • Add end-to-end tests plus user documentation and schema/COG visuals.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala Table provider + path handling (directory recursion, glob rewrite), read-only V2 source
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala Defines fixed output schema and table scan wiring
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala FileScanBuilder + LIMIT pushdown implementation
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala Creates partition readers with broadcast Hadoop conf
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala Core metadata extraction (GeoTools reader + TIFF IIO DOM helpers), schema-aware work avoidance
spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister Registers GeoTiffMetadataDataSource for format("geotiff.metadata")
spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala Validates metadata extraction, recursion/glob, LIMIT pushdown, and column pruning
docs/tutorial/files/geotiffmetadata-sedona-spark.md User-facing docs, schema reference, and examples (incl. COG detection)
docs/image/geotiff_metadata/schema_overview.svg Visual overview of output schema
docs/image/geotiff_metadata/cog_structure.svg Visual explanation of COG properties mapped to schema fields
mkdocs.yml Adds docs nav entry for GeoTIFF metadata

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu jiayuasu marked this pull request as ready for review April 23, 2026 05:25
@jiayuasu jiayuasu modified the milestones: sedona-1.9.0, sedona-1.9.1 Apr 23, 2026
@jiayuasu jiayuasu merged commit 4b60275 into apache:master Apr 23, 2026
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a Sedona Spark data source similar to gdalinfo

2 participants