Skip to content

Commit e55cd98

Browse files
Align chunk sizes in create_geozarr_compliant_multiscales function for improved compatibility with spatial dimensions.
1 parent 95acba5 commit e55cd98

2 files changed

Lines changed: 66 additions & 3 deletions

File tree

README.md

Lines changed: 60 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ This library provides tools to convert EOPF datasets to GeoZarr-spec 0.4 complia
1414
- **CF Conventions**: Proper CF standard names and grid_mapping attributes
1515
- **Robust Processing**: Band-by-band writing with validation and retry logic
1616
- **S3 Support**: Direct output to Amazon S3 buckets with automatic credential validation
17+
- **Parallel Processing**: Optional dask cluster support for parallel chunk processing
18+
- **Chunk Alignment**: Automatic chunk alignment to prevent data corruption with dask
1719

1820
## GeoZarr Compliance Features
1921

@@ -51,6 +53,12 @@ eopf-geozarr convert input.zarr output.zarr
5153
# Convert EOPF dataset to GeoZarr format (S3 output)
5254
eopf-geozarr convert input.zarr s3://my-bucket/path/to/output.zarr
5355

56+
# Convert with parallel processing using dask cluster
57+
eopf-geozarr convert input.zarr output.zarr --dask-cluster
58+
59+
# Convert with dask cluster and verbose output
60+
eopf-geozarr convert input.zarr output.zarr --dask-cluster --verbose
61+
5462
# Get information about a dataset
5563
eopf-geozarr info input.zarr
5664

@@ -111,6 +119,42 @@ aws configure
111119
- **Error Handling**: Provides helpful error messages for S3 configuration issues
112120
- **Performance**: Optimized for S3 with proper chunking and retry logic
113121

122+
### Parallel Processing with Dask
123+
124+
The library supports parallel processing using dask clusters for improved performance on large datasets:
125+
126+
```bash
127+
# Enable dask cluster for parallel processing
128+
eopf-geozarr convert input.zarr output.zarr --dask-cluster
129+
130+
# With verbose output to see cluster information
131+
eopf-geozarr convert input.zarr output.zarr --dask-cluster --verbose
132+
```
133+
134+
#### Dask Features
135+
136+
- **Local Cluster**: Automatically starts a local dask cluster with multiple workers
137+
- **Dashboard Access**: Provides access to the dask dashboard for monitoring (shown in verbose mode)
138+
- **Automatic Cleanup**: Properly closes the cluster even if errors occur during processing
139+
- **Chunk Alignment**: Automatically aligns Zarr chunks with dask chunks to prevent data corruption
140+
- **Memory Efficiency**: Better memory management through parallel chunk processing
141+
- **Error Handling**: Graceful handling of dask import errors with helpful installation instructions
142+
143+
#### Chunk Alignment
144+
145+
The library includes advanced chunk alignment logic to prevent the common issue of overlapping chunks when using dask:
146+
147+
- **Smart Detection**: Automatically detects if data is dask-backed and uses existing chunk structure
148+
- **Aligned Calculation**: Uses `calculate_aligned_chunk_size()` to find optimal chunk sizes that divide evenly into data dimensions
149+
- **Proper Rechunking**: Ensures datasets are rechunked to match encoding before writing
150+
- **Fallback Logic**: For non-dask arrays, uses reasonable chunk sizes that don't exceed data dimensions
151+
152+
This prevents errors like:
153+
```
154+
❌ Failed to write tci after 2 attempts: Specified Zarr chunks encoding['chunks']=(1, 3660, 3660)
155+
for variable named 'tci' would overlap multiple Dask chunks
156+
```
157+
114158
#### S3 Python API
115159

116160
```python
@@ -202,7 +246,22 @@ Downsample a 2D array using block averaging.
202246

203247
#### `calculate_aligned_chunk_size`
204248

205-
Calculate a chunk size that aligns well with the data dimension.
249+
Calculate a chunk size that divides evenly into the dimension size. This ensures that Zarr chunks align properly with the data dimensions, preventing chunk overlap issues when writing with Dask.
250+
251+
**Parameters:**
252+
- `dimension_size` (int): Size of the dimension to chunk
253+
- `target_chunk_size` (int): Desired chunk size
254+
255+
**Returns:**
256+
- `int`: Aligned chunk size that divides evenly into dimension_size
257+
258+
**Example:**
259+
```python
260+
from eopf_geozarr.conversion.utils import calculate_aligned_chunk_size
261+
262+
# For a dimension of size 5490 with target chunk size 3660
263+
aligned_size = calculate_aligned_chunk_size(5490, 3660) # Returns 2745
264+
```
206265

207266
#### `is_grid_mapping_variable`
208267

eopf_geozarr/conversion/geozarr.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -778,9 +778,13 @@ def create_geozarr_compliant_multiscales(
778778
encoding[var] = {"compressors": None}
779779
else:
780780
# Use smaller chunks for overview levels
781-
chunk_size = min(256, width, height)
781+
spatial_chunk_aligned = min(
782+
spatial_chunk,
783+
utils.calculate_aligned_chunk_size(width, spatial_chunk),
784+
utils.calculate_aligned_chunk_size(height, spatial_chunk),
785+
)
782786
encoding[var] = {
783-
"chunks": (chunk_size, chunk_size),
787+
"chunks": (1, spatial_chunk_aligned, spatial_chunk_aligned),
784788
"compressors": compressor,
785789
}
786790

0 commit comments

Comments
 (0)