Update benchmark with warm-kernel numbers (#1045)

brendancol · brendancol · commit 23c041c3ad5d · 2026-03-22T19:48:19.000-07:00
Separated reproject-only from full-pipeline timing. With warm
Numba/CUDA kernels:

- CuPy reproject: 73ms (2.0x faster than rioxarray)
- rioxarray reproject: 144ms
- NumPy reproject: 413ms

Full pipeline (read+reproject+write) is dominated by I/O for
compressed GeoTIFFs, where rioxarray's C-level rasterio beats
our Python/Numba reader.

Added note about ~4.5s JIT warmup on first call.
diff --git a/benchmarks/reproject_benchmark.md b/benchmarks/reproject_benchmark.md
@@ -21,15 +21,18 @@ dem_merc = reproject(dem, 'EPSG:3857')
 write_geotiff(dem_merc, 'output.tif')
 ```
 
-| Backend | End-to-end time | Notes |
-|:--------|----------------:|:------|
-| NumPy | 2,723 ms | Single-threaded Numba JIT resampling |
-| CuPy GPU | 348 ms | CUDA kernel for coordinate transform + resampling |
-| Dask+CuPy GPU | 343 ms | Chunked (512) GPU pipeline |
-| Dask (CPU) | 10,967 ms | Chunked (512) with Dask scheduler overhead |
-| rioxarray (GDAL) | 418 ms | C-level warp, highly optimized |
-
-The GPU path (CuPy or Dask+CuPy) is the fastest option for large rasters, running slightly faster than GDAL. The NumPy path is slower due to Python/Numba overhead in the resampling loop. The Dask CPU path has significant scheduler overhead for this single-file workload.
+All times measured with warm Numba/CUDA kernels (first call incurs ~4.5s JIT compilation).
+
+| Backend | End-to-end | Reproject only | vs rioxarray (reproject) |
+|:--------|----------:|--------------:|:------------------------|
+| CuPy GPU | 747 ms | 73 ms | **2.0x faster** |
+| Dask+CuPy GPU | 782 ms | ~80 ms | ~1.8x faster |
+| rioxarray (GDAL) | 411 ms | 144 ms | 1.0x |
+| NumPy | 2,907 ms | 413 ms | 0.3x |
+
+The CuPy reproject is 2x faster than rioxarray for the coordinate transform + resampling. The end-to-end gap is due to I/O: rioxarray uses rasterio's C-level compressed read/write, while our geotiff reader is pure Python/Numba. For reproject-only workloads (data already in memory), CuPy is the clear winner.
+
+**Note on JIT warmup**: The first `reproject()` call compiles the Numba kernels (~4.5s). All subsequent calls run at full speed. For long-running applications or batch processing, this is amortized over many calls.
 
 ---