Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions docs/image/geotiff_metadata/cog_structure.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
99 changes: 99 additions & 0 deletions docs/image/geotiff_metadata/schema_overview.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
188 changes: 188 additions & 0 deletions docs/tutorial/files/geotiffmetadata-sedona-spark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# GeoTiffMetadata - GeoTIFF File Metadata

GeoTiffMetadata is a Spark data source that reads GeoTIFF file metadata without decoding pixel data, similar to [gdalinfo](https://gdal.org/en/stable/programs/gdalinfo.html). It returns one row per file with metadata including dimensions, coordinate system, band information, tiling, overviews, and compression.

This is useful for:

* Cataloging and inventorying large collections of raster files
* Detecting Cloud Optimized GeoTIFFs (COGs) by checking tiling and overview status
* Inspecting file properties before loading full raster data
* Building spatial indexes over raster file collections

![Schema Overview](../../image/geotiff_metadata/schema_overview.svg "geotiff.metadata output schema")

## COG detection

Cloud Optimized GeoTIFFs (COGs) are GeoTIFF files with internal tiling and overviews optimized for cloud access. The `geotiff.metadata` data source reports these properties directly:

![COG Structure](../../image/geotiff_metadata/cog_structure.svg "How COG properties map to geotiff.metadata fields")

```python
df = sedona.read.format("geotiff.metadata").load("/path/to/rasters/")
cogs = df.filter("isTiled AND size(overviews) > 0")
cogs.select("path", "compression", "overviews").show(truncate=False)
```

## Read GeoTIFF metadata

=== "Scala"

```scala
val df = sedona.read.format("geotiff.metadata").load("/path/to/rasters/")
df.show()
```

=== "Java"

```java
Dataset<Row> df = sedona.read().format("geotiff.metadata").load("/path/to/rasters/");
df.show();
```

=== "Python"

```python
df = sedona.read.format("geotiff.metadata").load("/path/to/rasters/")
df.show()
```

You can also use glob patterns:

```python
df = sedona.read.format("geotiff.metadata").load("/path/to/rasters/*.tif")
```

Or load a single file:

```python
df = sedona.read.format("geotiff.metadata").load("/path/to/image.tiff")
```

## Output schema

Each row represents one GeoTIFF file with the following columns:

| Column | Type | Description |
|--------|------|-------------|
| `path` | String | File path |
| `driver` | String | Format driver (`"GTiff"`) |
| `fileSize` | Long | File size in bytes |
| `width` | Int | Image width in pixels |
| `height` | Int | Image height in pixels |
| `numBands` | Int | Number of bands |
| `srid` | Int | EPSG code (0 if unknown) |
| `crs` | String | Coordinate Reference System as WKT |
| `geoTransform` | Struct | Affine transform parameters |
| `cornerCoordinates` | Struct | Bounding box |
| `bands` | Array[Struct] | Per-band metadata |
| `overviews` | Array[Struct] | Overview (pyramid) levels |
| `metadata` | Map[String, String] | File-wide TIFF metadata tags |
| `isTiled` | Boolean | Whether the file uses internal tiling |
| `compression` | String | Compression type (e.g., `"LZW"`, `"Deflate"`) |

Comment thread
jiayuasu marked this conversation as resolved.
### geoTransform struct

| Field | Type | Description |
|-------|------|-------------|
| `upperLeftX` | Double | Origin X in world coordinates |
| `upperLeftY` | Double | Origin Y in world coordinates |
| `scaleX` | Double | Pixel size in X direction |
| `scaleY` | Double | Pixel size in Y direction |
| `skewX` | Double | Rotation/shear in X |
| `skewY` | Double | Rotation/shear in Y |

### cornerCoordinates struct

| Field | Type | Description |
|-------|------|-------------|
| `minX` | Double | Minimum X (west) |
| `minY` | Double | Minimum Y (south) |
| `maxX` | Double | Maximum X (east) |
| `maxY` | Double | Maximum Y (north) |

### bands array element

| Field | Type | Description |
|-------|------|-------------|
| `band` | Int | Band number (1-indexed) |
| `dataType` | String | Data type (e.g., `"REAL_32BITS"`) |
| `colorInterpretation` | String | Color interpretation (e.g., `"Gray"`, `"Red"`) |
| `noDataValue` | Double | NoData value (null if not set) |
| `blockWidth` | Int | Internal tile/block width |
| `blockHeight` | Int | Internal tile/block height |
| `description` | String | Band description |
| `unit` | String | Unit type (e.g., `"meters"`) |

### overviews array element

| Field | Type | Description |
|-------|------|-------------|
| `level` | Int | Overview level (1, 2, 3, ...) |
| `width` | Int | Overview width in pixels |
| `height` | Int | Overview height in pixels |

## Examples

### Inspect band information

```python
df = sedona.read.format("geotiff.metadata").load("/path/to/image.tif")
df.selectExpr("path", "explode(bands) as band").selectExpr(
"path",
"band.band",
"band.dataType",
"band.noDataValue",
"band.blockWidth",
"band.blockHeight",
).show()
```

### Filter by spatial extent

```python
df = sedona.read.format("geotiff.metadata").load("/path/to/rasters/")
df.filter("cornerCoordinates.minX > -120 AND cornerCoordinates.maxX < -100").select(
"path", "width", "height", "srid"
).show()
```

### Get overview details

```python
df = sedona.read.format("geotiff.metadata").load("/path/to/image.tif")
df.selectExpr("path", "explode(overviews) as ovr").selectExpr(
"path", "ovr.level", "ovr.width", "ovr.height"
).show()
```

### Select specific columns

Select only the columns you need:

```python
df = (
sedona.read.format("geotiff.metadata")
.load("/path/to/rasters/")
.select("path", "width", "height", "numBands")
)
df.show()
```
Loading
Loading