Skip to content

Commit 6375cab

Browse files
committed
[GH-2824] Add geotiffinfo data source for GeoTIFF file metadata
Add a new Spark DataSourceV2 that reads GeoTIFF file metadata without decoding pixel data, similar to gdalinfo. Usage: spark.read.format("geotiffinfo").load("/path/to/*.tif") Returns one row per file with: path, driver, fileSize, width, height, numBands, srid, crs, geoTransform (struct), cornerCoordinates (struct), bands (array with dataType, noData, blockSize, colorInterpretation), overviews (struct with level/width/height), metadata (map), isTiled, and compression. Closes #2824
1 parent 5927588 commit 6375cab

9 files changed

Lines changed: 1206 additions & 0 deletions

File tree

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# GeoTiffInfo - GeoTIFF File Metadata
21+
22+
GeoTiffInfo is a Spark data source that reads GeoTIFF file metadata without decoding pixel data, similar to [gdalinfo](https://gdal.org/en/stable/programs/gdalinfo.html). It returns one row per file with metadata including dimensions, coordinate system, band information, tiling, overviews, and compression.
23+
24+
This is useful for:
25+
26+
* Cataloging and inventorying large collections of raster files
27+
* Detecting Cloud Optimized GeoTIFFs (COGs) by checking tiling and overview status
28+
* Inspecting file properties before loading full raster data
29+
* Building spatial indexes over raster file collections
30+
31+
## Read GeoTIFF metadata
32+
33+
=== "Scala"
34+
35+
```scala
36+
val df = sedona.read.format("geotiffinfo").load("/path/to/rasters/")
37+
df.show()
38+
```
39+
40+
=== "Java"
41+
42+
```java
43+
Dataset<Row> df = sedona.read().format("geotiffinfo").load("/path/to/rasters/");
44+
df.show();
45+
```
46+
47+
=== "Python"
48+
49+
```python
50+
df = sedona.read.format("geotiffinfo").load("/path/to/rasters/")
51+
df.show()
52+
```
53+
54+
You can also use glob patterns:
55+
56+
```python
57+
df = sedona.read.format("geotiffinfo").load("/path/to/rasters/*.tif")
58+
```
59+
60+
Or load a single file:
61+
62+
```python
63+
df = sedona.read.format("geotiffinfo").load("/path/to/image.tiff")
64+
```
65+
66+
## Output schema
67+
68+
Each row represents one GeoTIFF file with the following columns:
69+
70+
| Column | Type | Description |
71+
|--------|------|-------------|
72+
| `path` | String | File path |
73+
| `driver` | String | Format driver (`"GTiff"`) |
74+
| `fileSize` | Long | File size in bytes |
75+
| `width` | Int | Image width in pixels |
76+
| `height` | Int | Image height in pixels |
77+
| `numBands` | Int | Number of bands |
78+
| `srid` | Int | EPSG code (0 if unknown) |
79+
| `crs` | String | Coordinate Reference System as WKT |
80+
| `geoTransform` | Struct | Affine transform parameters |
81+
| `cornerCoordinates` | Struct | Bounding box |
82+
| `bands` | Array[Struct] | Per-band metadata |
83+
| `overviews` | Array[Struct] | Overview (pyramid) levels |
84+
| `metadata` | Map[String, String] | File-wide TIFF metadata tags |
85+
| `isTiled` | Boolean | Whether the file uses internal tiling |
86+
| `compression` | String | Compression type (e.g., `"LZW"`, `"Deflate"`) |
87+
88+
### geoTransform struct
89+
90+
| Field | Type | Description |
91+
|-------|------|-------------|
92+
| `upperLeftX` | Double | Origin X in world coordinates |
93+
| `upperLeftY` | Double | Origin Y in world coordinates |
94+
| `scaleX` | Double | Pixel size in X direction |
95+
| `scaleY` | Double | Pixel size in Y direction |
96+
| `skewX` | Double | Rotation/shear in X |
97+
| `skewY` | Double | Rotation/shear in Y |
98+
99+
### cornerCoordinates struct
100+
101+
| Field | Type | Description |
102+
|-------|------|-------------|
103+
| `minX` | Double | Minimum X (west) |
104+
| `minY` | Double | Minimum Y (south) |
105+
| `maxX` | Double | Maximum X (east) |
106+
| `maxY` | Double | Maximum Y (north) |
107+
108+
### bands array element
109+
110+
| Field | Type | Description |
111+
|-------|------|-------------|
112+
| `band` | Int | Band number (1-indexed) |
113+
| `dataType` | String | Data type (e.g., `"REAL_32BITS"`) |
114+
| `colorInterpretation` | String | Color interpretation (e.g., `"Gray"`, `"Red"`) |
115+
| `noDataValue` | Double | NoData value (null if not set) |
116+
| `blockWidth` | Int | Internal tile/block width |
117+
| `blockHeight` | Int | Internal tile/block height |
118+
| `description` | String | Band description |
119+
| `unit` | String | Unit type (e.g., `"meters"`) |
120+
121+
### overviews array element
122+
123+
| Field | Type | Description |
124+
|-------|------|-------------|
125+
| `level` | Int | Overview level (1, 2, 3, ...) |
126+
| `width` | Int | Overview width in pixels |
127+
| `height` | Int | Overview height in pixels |
128+
129+
## Examples
130+
131+
### Detect Cloud Optimized GeoTIFFs (COGs)
132+
133+
A COG is a GeoTIFF that is internally tiled and has overview levels:
134+
135+
```python
136+
df = sedona.read.format("geotiffinfo").load("/path/to/rasters/")
137+
cogs = df.filter("isTiled AND size(overviews) > 0")
138+
cogs.select("path", "compression", "overviews").show(truncate=False)
139+
```
140+
141+
### Inspect band information
142+
143+
```python
144+
df = sedona.read.format("geotiffinfo").load("/path/to/image.tif")
145+
df.selectExpr("path", "explode(bands) as band").selectExpr(
146+
"path",
147+
"band.band",
148+
"band.dataType",
149+
"band.noDataValue",
150+
"band.blockWidth",
151+
"band.blockHeight",
152+
).show()
153+
```
154+
155+
### Filter by spatial extent
156+
157+
```python
158+
df = sedona.read.format("geotiffinfo").load("/path/to/rasters/")
159+
df.filter("cornerCoordinates.minX > -120 AND cornerCoordinates.maxX < -100").select(
160+
"path", "width", "height", "srid"
161+
).show()
162+
```
163+
164+
### Get overview details
165+
166+
```python
167+
df = sedona.read.format("geotiffinfo").load("/path/to/image.tif")
168+
df.selectExpr("path", "explode(overviews) as ovr").selectExpr(
169+
"path", "ovr.level", "ovr.width", "ovr.height"
170+
).show()
171+
```
172+
173+
### Select specific columns
174+
175+
Select only the columns you need:
176+
177+
```python
178+
df = (
179+
sedona.read.format("geotiffinfo")
180+
.load("/path/to/rasters/")
181+
.select("path", "width", "height", "numBands")
182+
)
183+
df.show()
184+
```

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ nav:
5656
- GeoParquet: tutorial/files/geoparquet-sedona-spark.md
5757
- GeoJSON: tutorial/files/geojson-sedona-spark.md
5858
- Shapefiles: tutorial/files/shapefiles-sedona-spark.md
59+
- GeoTIFF metadata: tutorial/files/geotiffinfo-sedona-spark.md
5960
- STAC catalog: tutorial/files/stac-sedona-spark.md
6061
- Concepts:
6162
- Spatial Joins: tutorial/concepts/spatial-joins.md

spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@ org.apache.sedona.sql.datasources.spider.SpiderDataSource
44
org.apache.spark.sql.sedona_sql.io.stac.StacDataSource
55
org.apache.sedona.sql.datasources.osm.OsmPbfFormat
66
org.apache.spark.sql.execution.datasources.geoparquet.GeoParquetFileFormat
7+
org.apache.spark.sql.sedona_sql.io.geotiffinfo.GeoTiffInfoDataSource
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one
3+
* or more contributor license agreements. See the NOTICE file
4+
* distributed with this work for additional information
5+
* regarding copyright ownership. The ASF licenses this file
6+
* to you under the Apache License, Version 2.0 (the
7+
* "License"); you may not use this file except in compliance
8+
* with the License. You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing,
13+
* software distributed under the License is distributed on an
14+
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
* KIND, either express or implied. See the License for the
16+
* specific language governing permissions and limitations
17+
* under the License.
18+
*/
19+
package org.apache.spark.sql.sedona_sql.io.geotiffinfo
20+
21+
import org.apache.spark.sql.connector.catalog.Table
22+
import org.apache.spark.sql.connector.catalog.TableProvider
23+
import org.apache.spark.sql.execution.datasources.FileFormat
24+
import org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2
25+
import org.apache.spark.sql.sedona_sql.io.raster.RasterFileFormat
26+
import org.apache.spark.sql.sources.DataSourceRegister
27+
import org.apache.spark.sql.types.StructType
28+
import org.apache.spark.sql.util.CaseInsensitiveStringMap
29+
30+
import scala.collection.JavaConverters._
31+
32+
/**
33+
* A read-only Spark SQL data source that extracts GeoTIFF file metadata (dimensions, CRS, bands,
34+
* overviews, compression, etc.) without loading raster pixel data into memory.
35+
*/
36+
class GeoTiffInfoDataSource extends FileDataSourceV2 with TableProvider with DataSourceRegister {
37+
38+
override def shortName(): String = "geotiffinfo"
39+
40+
private val loadTifPattern = "(.*)/([^/]*\\*[^/]*\\.(?i:tif|tiff))$".r
41+
42+
private def createTable(
43+
options: CaseInsensitiveStringMap,
44+
userSchema: Option[StructType] = None): Table = {
45+
var paths = getPaths(options)
46+
var optionsWithoutPaths = getOptionsWithoutPaths(options)
47+
val tableName = getTableName(options, paths)
48+
49+
if (paths.size == 1) {
50+
if (paths.head.endsWith("/")) {
51+
// Trailing-slash directories: recurse and filter to GeoTIFF files
52+
val newOptions =
53+
new java.util.HashMap[String, String](optionsWithoutPaths.asCaseSensitiveMap())
54+
newOptions.put("recursiveFileLookup", "true")
55+
if (!newOptions.containsKey("pathGlobFilter")) {
56+
newOptions.put("pathGlobFilter", "*.{tif,tiff,TIF,TIFF}")
57+
}
58+
optionsWithoutPaths = new CaseInsensitiveStringMap(newOptions)
59+
} else {
60+
// Rewrite glob patterns like /path/to/some*glob*.tif into /path/to with
61+
// pathGlobFilter="some*glob*.tif" to avoid listing .tif files as directories
62+
paths.head match {
63+
case loadTifPattern(prefix, glob) =>
64+
paths = Seq(prefix)
65+
val newOptions =
66+
new java.util.HashMap[String, String](optionsWithoutPaths.asCaseSensitiveMap())
67+
newOptions.put("pathGlobFilter", glob)
68+
optionsWithoutPaths = new CaseInsensitiveStringMap(newOptions)
69+
case _ =>
70+
}
71+
}
72+
}
73+
74+
new GeoTiffInfoTable(
75+
tableName,
76+
sparkSession,
77+
optionsWithoutPaths,
78+
paths,
79+
userSchema,
80+
fallbackFileFormat)
81+
}
82+
83+
override def getTable(options: CaseInsensitiveStringMap): Table = {
84+
createTable(options)
85+
}
86+
87+
override def getTable(options: CaseInsensitiveStringMap, schema: StructType): Table = {
88+
createTable(options, Some(schema))
89+
}
90+
91+
override def inferSchema(options: CaseInsensitiveStringMap): StructType =
92+
GeoTiffInfoTable.SCHEMA
93+
94+
override def fallbackFileFormat: Class[_ <: FileFormat] = classOf[RasterFileFormat]
95+
}

0 commit comments

Comments
 (0)