[GH-2824] Add geotiff.metadata data source for GeoTIFF file metadata#2846
Conversation
…adata
Add a new Spark DataSourceV2 that reads GeoTIFF file metadata without
decoding pixel data, similar to gdalinfo.
Usage: spark.read.format("geotiff.metadata").load("/path/to/*.tif")
Returns one row per file with: path, driver, fileSize, width, height,
numBands, srid, crs, geoTransform (struct), cornerCoordinates (struct),
bands (array with dataType, noData, blockSize, colorInterpretation),
overviews (struct with level/width/height), metadata (map), isTiled,
and compression.
Closes apache#2824
6375cab to
9ced509
Compare
Rename package, classes, files, docs, and service registration from GeoTiffInfo* to GeoTiffMetadata*. The data source shortName remains "geotiff.metadata".
- schema_overview.svg: shows full output schema with nested structs - cog_structure.svg: illustrates COG properties mapped to schema fields - Add COG detection section with visual before the read examples
There was a problem hiding this comment.
Pull request overview
Adds a new read-only Spark DataSourceV2 (geotiff.metadata) to extract GeoTIFF file metadata (dimensions, CRS, band/tile info, overviews, compression, TIFF tags) without loading it as a raster column, plus docs and tests to validate expected outputs and pushdowns.
Changes:
- Introduce
geotiff.metadataFileDataSourceV2 implementation (table/scan/reader) and register it viaDataSourceRegister. - Add a Scala test suite covering single-file reads, directory/glob loading, limit pushdown, column pruning, and COG detection.
- Add a documentation page and SVG diagrams, and wire the page into MkDocs navigation.
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala | DataSourceV2 entrypoint and path/glob handling for geotiff.metadata. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala | Table definition and metadata output schema. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala | Scan planning with LIMIT pushdown and partition planning. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala | Reader factory wiring for file partitions. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala | Per-file metadata extraction logic and TIFF IIO helpers. |
| spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister | Registers the new GeoTiffMetadataDataSource. |
| spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala | End-to-end tests for schema/values, pushdowns, and generated COG detection. |
| docs/tutorial/files/geotiffmetadata-sedona-spark.md | User-facing documentation and schema reference for geotiff.metadata. |
| docs/image/geotiff_metadata/schema_overview.svg | Diagram of the output schema. |
| docs/image/geotiff_metadata/cog_structure.svg | Diagram explaining COG-related fields. |
| mkdocs.yml | Adds the new tutorial page to navigation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… match - Add `override` to newWriteBuilder in GeoTiffMetadataTable - Set fallbackFileFormat to null (consistent with GeoParquetMetadataDataSource for read-only data sources — prevents V1 fallback attempts) - Replace unsafe asInstanceOf[FilePartition] in LIMIT pushdown with pattern match that throws IllegalArgumentException on unexpected partition types
There was a problem hiding this comment.
Pull request overview
Adds a new Spark DataSourceV2 (geotiff.metadata) for reading GeoTIFF file metadata (dimensions, CRS, tiling, compression, overviews, band info) without loading pixel data, plus documentation and tests to support GeoTIFF cataloging/COG detection workflows in Sedona Spark.
Changes:
- Introduces
geotiff.metadataDataSourceV2 implementation (table/scan/reader) with column pruning and LIMIT pushdown. - Registers the new data source and adds end-user documentation (including schema diagrams).
- Adds a dedicated Scala test suite validating metadata extraction and pushdowns.
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala | Implements the DataSourceV2 entrypoint and path/glob handling for GeoTIFF metadata reads. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala | Defines the table + output schema for metadata rows. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala | Adds scan planning including LIMIT pushdown and reader factory wiring. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala | Creates partition readers and attaches partition values. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala | Extracts GeoTIFF metadata (TIFF tags, bands, CRS, overviews) into InternalRow. |
| spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister | Registers geotiff.metadata for Spark discovery. |
| spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala | Adds end-to-end tests for correctness, pushdowns, glob + recursive loading, and COG detection. |
| docs/tutorial/files/geotiffmetadata-sedona-spark.md | Documents usage, schema, and example queries for COG detection and metadata inspection. |
| docs/image/geotiff_metadata/schema_overview.svg | Adds a visual overview of the output schema. |
| docs/image/geotiff_metadata/cog_structure.svg | Adds a visual mapping between COG structure and reported fields. |
| mkdocs.yml | Adds the new documentation page to MkDocs navigation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…non-nullable - Rebuild ArrayBasedMapData from a single pass over (k,v) entries to guarantee key/value index alignment - Mark fileSize as non-nullable since it's always populated from a primitive Long (PartitionedFile.fileSize)
There was a problem hiding this comment.
Pull request overview
Adds a new read-only Spark DataSourceV2 (geotiff.metadata) to extract GeoTIFF file metadata (dimensions, CRS, bands, tiling, overviews, compression) in a “gdalinfo-like” way, plus docs and tests to expose/validate the new format.
Changes:
- Introduces
geotiff.metadataDataSourceV2 + table/scan/reader implementation inspark/common. - Registers the new data source in Spark’s
DataSourceRegisterservice file. - Adds user documentation (with schema reference) and supporting SVG diagrams; adds unit tests covering core behavior (including glob, recursion w/ trailing slash, LIMIT pushdown, column pruning, and generated COG detection).
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala | DataSourceV2 entry point; path/glob/recursive lookup option handling. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala | Defines the table and output schema; enforces read-only behavior. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala | Scan builder/scan; LIMIT pushdown by trimming file partitions; reader factory wiring. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala | Creates per-partition readers and attaches partition values. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala | Per-file metadata extraction (TIFF tags, CRS, affine transform, bands, overviews, metadata map). |
| spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister | Registers GeoTiffMetadataDataSource for Spark discovery. |
| spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala | New unit tests for metadata correctness, glob/dir loading, pruning, LIMIT pushdown, and COG detection. |
| docs/tutorial/files/geotiffmetadata-sedona-spark.md | New documentation page with schema tables and usage examples. |
| docs/image/geotiff_metadata/schema_overview.svg | Diagram of output schema. |
| docs/image/geotiff_metadata/cog_structure.svg | Diagram explaining COG properties mapping to output fields. |
| mkdocs.yml | Adds the new tutorial page to navigation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…rectory check - Gate TIFF IIO metadata calls (hasTiffTag, extractPhotometricInterpretation, extractMetadata, extractCompression) on whether their output columns are in readDataSchema. Queries like SELECT path, width, height no longer pay for unused TIFF tree traversals. - Skip reader.read(null) entirely when the requested schema only needs cheap fields (path, driver, fileSize, overviews, metadata, isTiled, compression). Only fields requiring GridCoverage2D (width, height, numBands, srid, crs, geoTransform, cornerCoordinates, bands) trigger coverage read. - Skip opening the file entirely when only path/driver/fileSize are requested. - buildOverviewsArray now gets dimensions from reader.getOriginalGridRange instead of requiring coverage read. - Detect directory via Hadoop FileSystem.getFileStatus.isDirectory in addition to the trailing-slash check. A directory path without trailing slash now correctly applies recursive lookup + *.tif glob filter. - Add test for loading directory without trailing slash.
There was a problem hiding this comment.
Pull request overview
Adds a new read-only Spark DataSourceV2 (geotiff.metadata) to extract GeoTIFF file metadata (without decoding pixel data), plus accompanying documentation and test coverage.
Changes:
- Introduces
geotiff.metadataFileDataSourceV2 + Table/Scan/Reader implementation to read per-file GeoTIFF metadata. - Registers the new data source for Spark discovery and adds MkDocs navigation + a new tutorial page (with schema diagrams).
- Adds a Scala test suite validating key metadata extraction behaviors (dimensions/CRS/bands/overviews, glob + recursive loading, LIMIT + column pruning).
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala | New end-to-end tests for the geotiff.metadata data source. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala | DataSourceV2 entrypoint (shortName = geotiff.metadata) and path handling (dir/glob). |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala | Defines table capabilities and the output schema for metadata rows. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala | Implements batch scan planning with LIMIT pushdown and column pruning. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala | Builds partition readers for file partitions. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala | Core per-file metadata extraction logic (GeoTools/IIO metadata + optional coverage read). |
| spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister | Registers GeoTiffMetadataDataSource for Spark format(...) lookup. |
| mkdocs.yml | Adds the new tutorial page to the docs navigation. |
| docs/tutorial/files/geotiffmetadata-sedona-spark.md | New user documentation (usage, schema reference, examples). |
| docs/image/geotiff_metadata/schema_overview.svg | Diagram of the output schema. |
| docs/image/geotiff_metadata/cog_structure.svg | Diagram explaining COG detection via schema fields. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ests - isDirectory now builds Hadoop conf via newHadoopConfWithOptions so per-read options (e.g. fs.s3a.*) flow through to directory detection - buildMetadataMap returns empty MapData instead of null for empty metadata, making it consistent with overviews/bands collection columns - Rewrite metadata extraction to walk TIFFField nodes and map each field's `name` attribute to the leaf value (previous approach looked for name+value on same node, which never matched the TIFF IIO tree structure — metadata column was always empty) - Add test assertions for compression (test1.tiff=LZW, generated COG=LZW) and metadata (non-empty) columns
There was a problem hiding this comment.
Pull request overview
Adds a new read-only Spark DataSourceV2 (geotiff.metadata) to extract GeoTIFF file metadata (dimensions/CRS/bands/tiling/overviews/compression/TIFF tags) without decoding pixel data, enabling fast cataloging and COG detection workflows in Sedona Spark.
Changes:
- Introduces
geotiff.metadataDataSourceV2 implementation (table/scan/reader pipeline) with column pruning and LIMIT pushdown. - Registers the data source and adds a dedicated test suite validating metadata extraction and COG detection.
- Adds end-user documentation + nav entry and supporting diagrams for schema/COG concepts.
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala | Data source entry point (shortName = geotiff.metadata), directory/glob handling, read-only config. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala | Defines output schema and read-only table capabilities/scan builder wiring. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala | FileScanBuilder + LIMIT pushdown behavior and reader factory creation. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala | Creates partition readers for file partitions (with partition values). |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala | Core metadata extraction logic (GeoTools GeoTiffReader + TIFF IIO metadata parsing), builds InternalRows. |
| spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister | Registers the new data source for Spark discovery. |
| spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala | Adds tests for exact metadata, directory/glob loading, LIMIT/column pruning, and COG detection. |
| docs/tutorial/files/geotiffmetadata-sedona-spark.md | New user documentation page with schema reference and examples. |
| mkdocs.yml | Adds documentation navigation entry for GeoTIFF metadata. |
| docs/image/geotiff_metadata/schema_overview.svg | Diagram of geotiff.metadata output schema. |
| docs/image/geotiff_metadata/cog_structure.svg | Diagram explaining COG structure mapping to output fields. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…r call Fetch reader.getMetadata().getRootNode() once per file and pass the root to the helper methods, so the DOM tree is walked at most once per file even when multiple TIFF tag-based columns (isTiled, bands, metadata, compression) are selected together.
There was a problem hiding this comment.
Pull request overview
Introduces a new read-only Spark DataSourceV2 (geotiff.metadata) to extract GeoTIFF file metadata (dimensions/CRS/bands/tiling/overviews/compression/TIFF tags) without decoding pixel data, enabling cataloging and COG detection workflows similar to gdalinfo.
Changes:
- Added
geotiff.metadataDataSourceV2 implementation (table/scan/reader) with column pruning and LIMIT pushdown. - Registered the new data source and added a dedicated Scala test suite validating core fields, COG detection, glob/recursive loading, and pruning.
- Added end-user documentation (tutorial page + MkDocs nav entry) and supporting schema/COG diagrams.
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala | Data source entrypoint; path handling (dir vs glob), schema inference, read-only configuration. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala | Defines table + output schema (structs/arrays/maps) and read-only table behavior. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala | File scan planning plus LIMIT pushdown logic and reader factory wiring. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala | Builds per-partition readers and wraps them with partition values. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala | Core metadata extraction (IIO TIFF tags + optional coverage-derived fields) and row construction. |
| spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister | Registers GeoTiffMetadataDataSource for Spark discovery. |
| spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala | Adds test coverage validating metadata extraction, COG detection, and pruning/pushdown behavior. |
| docs/tutorial/files/geotiffmetadata-sedona-spark.md | User documentation for the new data source (usage + schema reference + examples). |
| mkdocs.yml | Adds docs navigation entry for GeoTIFF metadata. |
| docs/image/geotiff_metadata/schema_overview.svg | Diagram of output schema. |
| docs/image/geotiff_metadata/cog_structure.svg | Diagram explaining COG property mapping to schema fields. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Adds a new read-only Spark DataSourceV2 (geotiff.metadata) to inventory GeoTIFF file metadata (dimensions/CRS/bands/overviews/tiling/compression) without using the raster pixel decode path, plus documentation and test coverage.
Changes:
- Introduces
geotiff.metadataDataSourceV2 implementation (table/scan/reader) and registers it via Spark service loader. - Adds Scala test suite covering exact metadata extraction, directory/glob loading behavior, LIMIT pushdown, and column pruning.
- Adds user documentation and accompanying SVG diagrams, and wires the page into MkDocs navigation.
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala | DataSourceV2 entrypoint, path/directory handling, and read-only configuration. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala | Table definition, schema, scan builder hookup, and read-only write rejection. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala | Scan planning, LIMIT pushdown partition trimming, and reader-factory creation. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala | Wires Spark partitions to per-file metadata readers. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala | Core per-file metadata extraction + column-pruning-aware execution. |
| spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister | Registers GeoTiffMetadataDataSource for format("geotiff.metadata"). |
| spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala | End-to-end tests for metadata correctness, COG detection, and pushdowns. |
| docs/tutorial/files/geotiffmetadata-sedona-spark.md | User docs for usage, schema reference, and examples. |
| docs/image/geotiff_metadata/schema_overview.svg | Diagram of the output schema. |
| docs/image/geotiff_metadata/cog_structure.svg | Diagram explaining COG detection fields and overview structure. |
| mkdocs.yml | Adds the new tutorial page to the docs navigation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
width and height can be obtained from reader.getOriginalGridRange() without calling reader.read(null). Queries that only project width or height no longer force a GridCoverage2D to be built. Introduce READER_ONLY_FIELDS (width, height, overviews) that need the reader but not the raster coverage.
There was a problem hiding this comment.
Pull request overview
Adds a new read-only Spark DataSourceV2 (geotiff.metadata) to inventory GeoTIFF file metadata (dimensions, CRS, bands, tiling, overviews, compression, TIFF tags) without decoding pixel data, plus docs and a test suite to validate correctness and pushdowns.
Changes:
- Introduce
geotiff.metadataDataSourceV2 implementation (table/scan/partition reader) with column pruning and LIMIT pushdown. - Register the data source via Spark service loader.
- Add end-to-end tests plus user documentation and schema/COG visuals.
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataDataSource.scala | Table provider + path handling (directory recursion, glob rewrite), read-only V2 source |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataTable.scala | Defines fixed output schema and table scan wiring |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataScanBuilder.scala | FileScanBuilder + LIMIT pushdown implementation |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReaderFactory.scala | Creates partition readers with broadcast Hadoop conf |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/geotiffmetadata/GeoTiffMetadataPartitionReader.scala | Core metadata extraction (GeoTools reader + TIFF IIO DOM helpers), schema-aware work avoidance |
| spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister | Registers GeoTiffMetadataDataSource for format("geotiff.metadata") |
| spark/common/src/test/scala/org/apache/sedona/sql/geotiffMetadataTest.scala | Validates metadata extraction, recursion/glob, LIMIT pushdown, and column pruning |
| docs/tutorial/files/geotiffmetadata-sedona-spark.md | User-facing docs, schema reference, and examples (incl. COG detection) |
| docs/image/geotiff_metadata/schema_overview.svg | Visual overview of output schema |
| docs/image/geotiff_metadata/cog_structure.svg | Visual explanation of COG properties mapped to schema fields |
| mkdocs.yml | Adds docs nav entry for GeoTIFF metadata |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes Add a Sedona Spark data source similar to gdalinfo #2824What changes were proposed in this PR?
Add a new read-only Spark DataSourceV2 (
geotiff.metadata) that reads GeoTIFF file metadata without decoding pixel data, similar to gdalinfo.Usage
Output schema
Returns one row per file with:
pathdriver"GTiff"fileSizewidth,heightnumBandssrid,crsgeoTransformcornerCoordinatesbandsoverviewsmetadataisTiledcompressionKey implementation details
isTiled: reads TIFF TileWidth tag (322) from IIO metadata, not RenderedImage tile size (which reports strips as tiles)colorInterpretation: derived from TIFF Photometric tag (262) — Gray, Red, Green, Blue, Alpha, Palettecompression: reads TIFF tag 259 description attribute for human-readable names (LZW, Deflate)overviews: usesDatasetLayout.getNumInternalOverviews()for real overview count, not synthetic tile-based levelsreadDataSchemareader.read()to avoid stream state issuesnewWriteBuilderthrowsUnsupportedOperationExceptionFiles
spark/common/.../io/geotiffmetadata/— 5 Scala files (GeoTiffMetadataDataSource,GeoTiffMetadataTable,GeoTiffMetadataScanBuilder,GeoTiffMetadataPartitionReaderFactory,GeoTiffMetadataPartitionReader)META-INF/services/org.apache.spark.sql.sources.DataSourceRegisterdocs/tutorial/files/geotiffmetadata-sedona-spark.mdwith schema reference, examples, COG detectiondocs/image/geotiff_metadata/schema_overview.svgandcog_structure.svgHow was this patch tested?
11 tests in
geotiffMetadataTest.scalawith exact-match assertions:format("raster")+RS_Width/RS_Height/RS_NumBands/RS_SRIDRS_AsCOG, verifiesisTiled=true, 2 overviews,blockSize=256x256.tifffiles) and recursive directory loading (9 files total)Did this PR include necessary documentation updates?