perf(geotiff): _write_vrt_tiled uses threaded dask scheduler (#1714)#1725
Merged
brendancol merged 2 commits intoMay 12, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Improve performance of VRT tiled GeoTIFF writes by running per-tile dask.delayed write tasks using Dask’s threaded scheduler, and add tests that pin the scheduler choice and validate correctness/determinism of concurrent tile writes.
Changes:
- Switch
_write_vrt_tiledfromscheduler='synchronous'toscheduler='threads'when callingdask.compute. - Add a new test module validating scheduler selection, tile count output, and deterministic (byte-identical) output across repeated runs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
xrspatial/geotiff/__init__.py |
Uses Dask’s threaded scheduler for VRT tile write task execution and documents rationale. |
xrspatial/geotiff/tests/test_vrt_tiled_scheduler_1714.py |
Adds coverage for threaded scheduling and concurrent tile-write correctness/determinism. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import dask | ||
| import dask.array as da | ||
| import numpy as np | ||
| import pytest |
Comment on lines
+93
to
+96
| return { | ||
| os.path.basename(p): open(p, "rb").read() | ||
| for p in sorted(glob.glob(os.path.join(tiles_dir, "*.tif"))) | ||
| } |
Each delayed task in _write_vrt_tiled writes one tile to its own output path with no shared mutable Python state, so the writes are embarrassingly parallel. The prior code called dask.compute with scheduler='synchronous', which forced every tile through the calling thread one at a time. Switch to scheduler='threads'. zlib/zstd/LZW release the GIL during compression, so threading delivers real wall-time wins on the compression stage. Microbench: 4096x4096 float32 dask DataArray with chunks=256 (256 output tiles) at zstd compression drops from 0.49s to 0.33s (~33% reduction). Adds tests covering the scheduler choice, the tile-file inventory, and a determinism check that runs the same write twice and compares every tile byte-for-byte to catch any race regression.
c3fc5e4 to
05fea7a
Compare
- Remove unused pytest import in test_vrt_tiled_scheduler_1714.py - Use Path.read_bytes() in tile-byte comparison to avoid leaking file descriptors (the previous dict comprehension opened files via ``open(p, "rb").read()`` without a context manager)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_write_vrt_tiledbuilds onedask.delayedtask per output tile thenruns them all through
dask.compute. Each task writes to its own filepath and never touches shared mutable Python state, so the writes are
embarrassingly parallel. The prior code passed
scheduler='synchronous', which serialised every tile on the callingthread.
Switch to
scheduler='threads'. zlib / zstd / LZW release the GILduring compression, so threading delivers real wall-time wins on the
compression stage.
Microbench on a 16-thread box, 4096x4096 float32 dask DataArray with
chunks=256(256 output tiles), zstd compression:The gain grows with tile count and codec cost.
Closes #1714.
Test plan
test_vrt_tiled_uses_threaded_scheduler: patchesdask.computeand asserts the writer passesscheduler='threads'.test_vrt_tiled_threaded_write_produces_all_tiles: 4x4 chunked input produces exactly 16 tile files.test_vrt_tiled_threaded_write_is_deterministic: run the same write twice into separate dirs, byte-compare every tile, catch any concurrent-write race.xrspatial/geotiff/tests/test_vrt_tiled_metadata_1606.py,test_polish_1488.py,test_vrt_write.py,test_streaming_write_parallel.py, andtest_writer.py(67 tests) all pass.