Skip to content

Commit 443ed78

Browse files
authored
Add dtype, compression_level, and VRT output to geotiff I/O (#1083) (#1085)
* Add design spec for geotiff performance and memory controls Covers dtype on read, compression_level on write, and VRT tiled output from to_geotiff for streaming dask writes. * Add implementation plan for geotiff performance controls (#1083) * Add compression_level parameter to to_geotiff and write Thread a compression_level: int | None = None parameter from the public to_geotiff() API down to the compress() call in the writer. Validation rejects out-of-range levels for deflate (1-9), zstd (1-22), and lz4 (0-16); codecs without a level concept (lzw, packbits, jpeg) accept any value and ignore it. * Add compression_level to write_geotiff_gpu and forward from to_geotiff write_geotiff_gpu now accepts compression_level: int | None = None so callers that pass the parameter through to_geotiff on the GPU path no longer get an unexpected keyword argument error. The value is accepted silently since nvCOMP does not currently expose level control. * Clean up compression_level plumbing: keyword args, docstring, dead import - Use keyword form compression_level=compression_level at the two _write_tiled call sites in write() for clarity and refactor safety. - Update compress() docstring to cover all codecs with level support (deflate, zstd, lz4) instead of only mentioning deflate. - Remove unused tempfile import from test_compression_level.py. * Add dtype parameter to open_geotiff for on-read dtype casting Adds `dtype=None` to `open_geotiff`, `read_geotiff_dask`, `read_geotiff_gpu`, and `read_vrt`. When specified, each path casts the array to the target dtype after nodata masking. Float-to-int casts raise ValueError to prevent accidental fractional data loss. Resolves part of #1083. * Fix dtype cast skipped on GPU non-tiled fallback early return The stripped-file fallback in read_geotiff_gpu returned before reaching the dtype cast block. Add the cast before the early return so the dtype parameter is honored for non-tiled GeoTIFFs on the GPU path. * Fix dask dtype validation against post-nodata-masking dtype The dask path validated the dtype cast against the raw file dtype (e.g. uint16), but nodata masking promotes integer arrays to float64 inside the delayed worker. Requesting dtype='int32' on a uint16 file with nodata would pass validation then silently cast float64+NaN to int32, producing garbage. Now compute the effective post-masking dtype and validate against that. Also remove the dead `dtype` positional parameter from _delayed_read_window and add tests for the nodata+dtype interaction on both eager and dask paths. * Add VRT tiled output to to_geotiff (#1083) When to_geotiff receives a path ending in .vrt, write a directory of tiled GeoTIFFs with a VRT index file instead of a monolithic TIFF. This lets dask arrays stream to disk without materializing the full array in RAM. - Dask inputs get one tile per chunk, written in parallel via dask.delayed - Numpy inputs are sliced by tile_size - cog=True and overview_levels raise ValueError with VRT output - Non-empty tiles directory raises FileExistsError - 11 new tests cover numpy/dask round-trips, tile naming, relative paths, compression forwarding, and edge cases * Fix OOM risk, 3D guard, and unused param in _write_vrt_tiled - Use scheduler='synchronous' in dask.compute so tiles write one at a time, keeping peak memory to a single chunk instead of N_cores chunks - Reject 3D arrays in both the dask and numpy VRT paths with a clear error message, since raw.chunks[0] would read the band dimension instead of rows for (bands, y, x) shaped arrays - Remove unused gpu parameter from _write_vrt_tiled signature and call site; CuPy detection already happens inside _write_single_tile * Add dtype, compression_level, and VRT output examples to README GeoTIFF I/O section * Add user guide notebook for GeoTIFF performance controls Demonstrates dtype on read, compression_level on write, and VRT tiled output using self-contained tempdir examples. * Fix VRT relative paths on Windows: use forward slashes in XML (#1083) os.path.relpath produces backslashes on Windows, but VRT XML expects forward slashes for cross-platform portability.
1 parent 7d25252 commit 443ed78

File tree

11 files changed

+1917
-30
lines changed

11 files changed

+1917
-30
lines changed

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,12 @@ to_geotiff(data, 'out.tif', gpu=True) # force GPU compress
165165
to_geotiff(data, 'ortho.tif', compression='jpeg') # JPEG for orthophotos
166166
write_vrt('mosaic.vrt', ['tile1.tif', 'tile2.tif']) # generate VRT
167167

168+
open_geotiff('dem.tif', dtype='float32') # half memory
169+
open_geotiff('dem.tif', dtype='float32', chunks=512) # Dask + half memory
170+
to_geotiff(data, 'out.tif', compression_level=1) # fast scratch write
171+
to_geotiff(data, 'out.tif', compression_level=22) # max compression
172+
to_geotiff(dask_da, 'mosaic.vrt') # stream Dask to VRT
173+
168174
# Accessor methods
169175
da.xrs.to_geotiff('out.tif', compression='lzw') # write from DataArray
170176
ds.xrs.open_geotiff('large_dem.tif') # read windowed to Dataset extent

docs/superpowers/plans/2026-03-30-geotiff-perf-controls.md

Lines changed: 813 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# GeoTIFF Performance and Memory Controls
2+
3+
Adds three parameters to `open_geotiff` and `to_geotiff` that let callers
4+
control memory usage, compression speed, and large-raster write strategy.
5+
All three are opt-in; default behaviour is unchanged.
6+
7+
## 1. `dtype` parameter on `open_geotiff`
8+
9+
### API
10+
11+
```python
12+
open_geotiff(source, *, dtype=None, ...)
13+
```
14+
15+
`dtype` accepts any numpy dtype string or object (`np.float32`, `'float32'`,
16+
etc.). `None` preserves the file's native dtype (current behaviour).
17+
18+
### Read paths
19+
20+
| Path | Behaviour |
21+
|------|-----------|
22+
| Eager (numpy) | Output array allocated at target dtype. Each decoded tile/strip cast before copy-in. Peak overhead: one tile at native dtype. |
23+
| Dask | Each delayed chunk function casts after decode. Output chunks are target dtype. Same per-tile overhead. |
24+
| GPU (CuPy) | Cast on device after decode. |
25+
| Dask + CuPy | Combination of dask and GPU paths. |
26+
27+
### Numba LZW fast path
28+
29+
The LZW decoder is a numba JIT function that emits values one at a time into a
30+
byte buffer. A variant will decode each value and cast inline to the target
31+
dtype so the per-tile buffer is never allocated at native dtype. Other codecs
32+
(deflate, zstd) return byte buffers from C libraries where per-value
33+
interception isn't possible, so those fall back to the tile-level cast.
34+
35+
### Validation
36+
37+
- Narrowing float casts (float64 to float32): allowed.
38+
- Narrowing int casts (int64 to int16): allowed (user asked for it explicitly).
39+
- Widening casts (float32 to float64, uint8 to int32): allowed.
40+
- Float to int: `ValueError` (lossy in a way users often don't intend).
41+
- Unsupported casts (e.g. complex128 to uint8): `ValueError`.
42+
43+
## 2. `compression_level` parameter on `to_geotiff`
44+
45+
### API
46+
47+
```python
48+
to_geotiff(data, path, *, compression='zstd', compression_level=None, ...)
49+
```
50+
51+
`compression_level` is `int | None`. `None` uses the codec's existing default.
52+
53+
### Ranges
54+
55+
| Codec | Range | Default | Direction |
56+
|-------|-------|---------|-----------|
57+
| deflate | 1 -- 9 | 6 | 1 = fastest, 9 = smallest |
58+
| zstd | 1 -- 22 | 3 | 1 = fastest, 22 = smallest |
59+
| lz4 | 0 -- 16 | 0 | 0 = fastest |
60+
| lzw | n/a | n/a | No level support; ignored silently |
61+
| jpeg | n/a | n/a | Quality is a separate axis; ignored |
62+
| packbits | n/a | n/a | Ignored |
63+
| none | n/a | n/a | Ignored |
64+
65+
### Plumbing
66+
67+
`to_geotiff` passes `compression_level` to `write()`, which passes it to
68+
`compress()`. The internal `compress()` already accepts a `level` argument; we
69+
just thread it through the two intermediate call sites that currently hardcode
70+
it.
71+
72+
### Validation
73+
74+
- Out-of-range level for a codec that supports levels: `ValueError`.
75+
- Level set for a codec without level support: silently ignored.
76+
77+
### GPU path
78+
79+
`write_geotiff_gpu` also accepts and forwards the level to nvCOMP batch
80+
compression, which supports levels for zstd and deflate.
81+
82+
## 3. VRT output from `to_geotiff` via `.vrt` extension
83+
84+
### Trigger
85+
86+
When `path` ends in `.vrt`, `to_geotiff` writes a tiled VRT instead of a
87+
monolithic TIFF. No new parameter needed.
88+
89+
### Output layout
90+
91+
```
92+
output.vrt
93+
output_tiles/
94+
tile_0000_0000.tif # row_col, zero-padded
95+
tile_0000_0001.tif
96+
...
97+
```
98+
99+
Directory name derived from the VRT stem (`foo.vrt` -> `foo_tiles/`).
100+
Zero-padding width scales to the grid dimensions.
101+
102+
### Behaviour per input type
103+
104+
| Input | Tiling strategy | Memory profile |
105+
|-------|----------------|----------------|
106+
| Dask DataArray | One tile per dask chunk. Each task computes its chunk and writes one `.tif`. | One chunk in RAM at a time (scheduler controlled). |
107+
| Dask + CuPy | Same, GPU compress per tile. | One chunk in GPU memory at a time. |
108+
| Numpy / ndarray | Slice into `tile_size`-sized pieces, write each. | Source array already in RAM; tile slices are views (no duplication). |
109+
| CuPy | Same as numpy but GPU compress. | Source on GPU; tiles are views. |
110+
111+
### Per-tile properties
112+
113+
- Same `compression`, `compression_level`, `predictor`, `nodata`, `crs` as the
114+
parent call.
115+
- `tiled=True` with the caller's `tile_size` (internal TIFF tiling within each
116+
chunk-file).
117+
- GeoTransform adjusted to each tile's spatial position (row/col offset from
118+
the full raster origin).
119+
- No COG overviews on individual tiles.
120+
121+
### VRT generation
122+
123+
After all tiles are written, call `write_vrt()` with relative paths. The VRT
124+
XML references each tile by its spatial extent and band mapping.
125+
126+
### Edge cases and validation
127+
128+
- `cog=True` with a `.vrt` path: `ValueError` (mutually exclusive).
129+
- Tiles directory exists and is non-empty: `FileExistsError` to prevent silent
130+
overwrites.
131+
- Tiles directory doesn't exist: created automatically.
132+
- `overview_levels` with `.vrt` path: `ValueError` (overviews don't apply).
133+
134+
### Dask scheduling
135+
136+
For dask inputs, all delayed tile-write tasks are submitted to
137+
`dask.compute()` at once. The scheduler manages parallelism and memory. Each
138+
task is: compute chunk, compress, write tile file. No coordination between
139+
tasks.
140+
141+
## Out of scope
142+
143+
- Streaming write of a monolithic `.tif` from dask input (tracked as a separate
144+
issue). Users who need a single file from a large dask array can write to VRT
145+
and convert externally, or ensure sufficient RAM.
146+
- JPEG quality parameter (separate concern from compression level).
147+
- Automatic chunk-size recommendation based on available memory.
Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# GeoTIFF Performance Controls\n",
8+
"\n",
9+
"Three opt-in controls for memory, compression speed, and large-raster writes:\n",
10+
"\n",
11+
"- **`dtype` on read** — cast to a narrower type at load time to reduce memory use\n",
12+
"- **`compression_level` on write** — trade write speed for file size (or vice versa)\n",
13+
"- **VRT tiled output** — stream a chunked dask array to a directory of tiles without loading it all at once"
14+
]
15+
},
16+
{
17+
"cell_type": "code",
18+
"execution_count": null,
19+
"metadata": {},
20+
"outputs": [],
21+
"source": [
22+
"import numpy as np\n",
23+
"import xarray as xr\n",
24+
"import os\n",
25+
"import tempfile\n",
26+
"from xrspatial.geotiff import open_geotiff, to_geotiff"
27+
]
28+
},
29+
{
30+
"cell_type": "code",
31+
"execution_count": null,
32+
"metadata": {},
33+
"outputs": [],
34+
"source": [
35+
"# Create a 1000x1000 float64 DEM-like raster (~8MB)\n",
36+
"rng = np.random.default_rng(42)\n",
37+
"elevation = rng.normal(loc=500, scale=100, size=(1000, 1000)).astype(np.float64)\n",
38+
"y = np.linspace(40.0, 41.0, 1000)\n",
39+
"x = np.linspace(-106.0, -105.0, 1000)\n",
40+
"dem = xr.DataArray(elevation, dims=['y', 'x'],\n",
41+
" coords={'y': y, 'x': x},\n",
42+
" attrs={'crs': 4326})\n",
43+
"print(f\"DEM shape: {dem.shape}, dtype: {dem.dtype}, size: {dem.nbytes / 1e6:.1f} MB\")"
44+
]
45+
},
46+
{
47+
"cell_type": "markdown",
48+
"metadata": {},
49+
"source": [
50+
"## 1. dtype on read\n",
51+
"\n",
52+
"Pass `dtype` to `open_geotiff` to cast the raster to a narrower type at load time.\n",
53+
"Reading a float64 file as float32 halves memory use without any extra copy.\n",
54+
"The cast happens inside rasterio before the array reaches Python, so it works on\n",
55+
"all read paths: eager, dask, and GPU."
56+
]
57+
},
58+
{
59+
"cell_type": "code",
60+
"execution_count": null,
61+
"metadata": {},
62+
"outputs": [],
63+
"source": [
64+
"with tempfile.TemporaryDirectory() as tmpdir:\n",
65+
" path = os.path.join(tmpdir, 'dem_f64.tif')\n",
66+
" to_geotiff(dem, path)\n",
67+
"\n",
68+
" # Read at native dtype (float64)\n",
69+
" native = open_geotiff(path)\n",
70+
" print(f\"Native: dtype={native.dtype}, size={native.nbytes / 1e6:.1f} MB\")\n",
71+
"\n",
72+
" # Read as float32 -- half the memory\n",
73+
" downcast = open_geotiff(path, dtype='float32')\n",
74+
" print(f\"Downcast: dtype={downcast.dtype}, size={downcast.nbytes / 1e6:.1f} MB\")\n",
75+
"\n",
76+
" # Works with dask too\n",
77+
" dask_f32 = open_geotiff(path, dtype='float32', chunks=256)\n",
78+
" print(f\"Dask chunks dtype: {dask_f32.dtype}\")"
79+
]
80+
},
81+
{
82+
"cell_type": "markdown",
83+
"metadata": {},
84+
"source": [
85+
"## 2. compression_level\n",
86+
"\n",
87+
"`to_geotiff` accepts a `compression_level` argument alongside `compression`.\n",
88+
"For zstd the range is 1–22 (1 = fastest, 22 = smallest file).\n",
89+
"For deflate the range is 1–9.\n",
90+
"The default is the codec's own default when `compression_level` is omitted.\n",
91+
"\n",
92+
"Use a low level when write speed matters (streaming pipelines, scratch files).\n",
93+
"Use a high level for archival or network transfer where file size dominates."
94+
]
95+
},
96+
{
97+
"cell_type": "code",
98+
"execution_count": null,
99+
"metadata": {},
100+
"outputs": [],
101+
"source": [
102+
"import time\n",
103+
"\n",
104+
"with tempfile.TemporaryDirectory() as tmpdir:\n",
105+
" results = []\n",
106+
" for level in [1, 3, 10, 22]:\n",
107+
" path = os.path.join(tmpdir, f'dem_zstd_l{level}.tif')\n",
108+
" t0 = time.perf_counter()\n",
109+
" to_geotiff(dem, path, compression='zstd', compression_level=level)\n",
110+
" elapsed = time.perf_counter() - t0\n",
111+
" size_kb = os.path.getsize(path) / 1024\n",
112+
" results.append((level, elapsed, size_kb))\n",
113+
" print(f\" level={level:2d} time={elapsed:.3f}s size={size_kb:.0f} KB\")\n",
114+
"\n",
115+
" print(f\"\\nLevel 1 vs 22: {results[0][2]/results[-1][2]:.1f}x size difference\")"
116+
]
117+
},
118+
{
119+
"cell_type": "markdown",
120+
"metadata": {},
121+
"source": [
122+
"## 3. VRT tiled output\n",
123+
"\n",
124+
"Pass a `.vrt` path to `to_geotiff` and it writes a directory of GeoTIFF tiles\n",
125+
"plus a VRT index file that GDAL treats as a single dataset.\n",
126+
"\n",
127+
"Each tile corresponds to one dask chunk and is written independently, so only\n",
128+
"one chunk is in memory at a time. This makes it practical to write arrays that\n",
129+
"are larger than RAM.\n",
130+
"\n",
131+
"The VRT uses relative paths, so the whole output directory is portable."
132+
]
133+
},
134+
{
135+
"cell_type": "code",
136+
"execution_count": null,
137+
"metadata": {},
138+
"outputs": [],
139+
"source": [
140+
"with tempfile.TemporaryDirectory() as tmpdir:\n",
141+
" # Chunk the DEM for dask processing\n",
142+
" dask_dem = dem.chunk({'y': 500, 'x': 500})\n",
143+
"\n",
144+
" vrt_path = os.path.join(tmpdir, 'tiled_dem.vrt')\n",
145+
" to_geotiff(dask_dem, vrt_path, compression='zstd')\n",
146+
"\n",
147+
" # Show what was created\n",
148+
" tiles_dir = os.path.join(tmpdir, 'tiled_dem_tiles')\n",
149+
" print(\"Files created:\")\n",
150+
" print(f\" {os.path.basename(vrt_path)}\")\n",
151+
" for f in sorted(os.listdir(tiles_dir)):\n",
152+
" size = os.path.getsize(os.path.join(tiles_dir, f)) / 1024\n",
153+
" print(f\" tiled_dem_tiles/{f} ({size:.0f} KB)\")\n",
154+
"\n",
155+
" # Read it back via VRT\n",
156+
" result = open_geotiff(vrt_path)\n",
157+
" print(f\"\\nRound-trip: shape={result.shape}, dtype={result.dtype}\")\n",
158+
" print(f\"Max difference: {float(np.abs(result.values - dem.values).max()):.2e}\")"
159+
]
160+
},
161+
{
162+
"cell_type": "markdown",
163+
"metadata": {},
164+
"source": [
165+
"## Summary\n",
166+
"\n",
167+
"| Feature | Parameter | Where to use |\n",
168+
"|---|---|---|\n",
169+
"| dtype cast | `open_geotiff(..., dtype='float32')` | Reduce read memory by half |\n",
170+
"| compression level | `to_geotiff(..., compression_level=1)` | Fast scratch writes; set high for archival |\n",
171+
"| VRT tiled output | `to_geotiff(..., 'out.vrt')` | Stream large dask arrays to disk without OOM |"
172+
]
173+
}
174+
],
175+
"metadata": {
176+
"kernelspec": {
177+
"display_name": "Python 3",
178+
"language": "python",
179+
"name": "python3"
180+
},
181+
"language_info": {
182+
"name": "python",
183+
"version": "3.10.0"
184+
}
185+
},
186+
"nbformat": 4,
187+
"nbformat_minor": 5
188+
}

0 commit comments

Comments
 (0)