GeoZarr compliant data model for EOPF (Earth Observation Processing Framework) datasets.
This library provides tools to convert EOPF datasets to GeoZarr-spec 0.4 compliant format while maintaining native projections and using /2 downsampling logic for multiscale support.
- GeoZarr Specification Compliance: Full compliance with GeoZarr spec 0.4
- Native CRS Preservation: No reprojection to TMS, maintains original coordinate reference systems
- Multiscale Support: COG-style /2 downsampling with overview levels as children groups
- CF Conventions: Proper CF standard names and grid_mapping attributes
- Robust Processing: Band-by-band writing with validation and retry logic
- S3 Support: Direct output to Amazon S3 buckets with automatic credential validation
- Parallel Processing: Optional dask cluster support for parallel chunk processing
- Chunk Alignment: Automatic chunk alignment to prevent data corruption with dask
_ARRAY_DIMENSIONSattributes on all arrays- CF standard names for all variables
grid_mappingattributes referencing CF grid_mapping variablesGeoTransformattributes in grid_mapping variables- Proper multiscales metadata structure
- Native CRS tile matrix sets
pip install eopf-geozarrFor development:
git clone <repository-url>
cd eopf-geozarr
pip install -e ".[dev]"After installation, you can use the eopf-geozarr command:
# Convert EOPF dataset to GeoZarr format (local output)
eopf-geozarr convert input.zarr output.zarr
# Convert EOPF dataset to GeoZarr format (S3 output)
eopf-geozarr convert input.zarr s3://my-bucket/path/to/output.zarr
# Convert with parallel processing using dask cluster
eopf-geozarr convert input.zarr output.zarr --dask-cluster
# Convert with dask cluster and verbose output
eopf-geozarr convert input.zarr output.zarr --dask-cluster --verbose
# Get information about a dataset
eopf-geozarr info input.zarr
# Validate GeoZarr compliance
eopf-geozarr validate output.zarr
# Get help
eopf-geozarr --helpThe library supports direct output to S3-compatible storage, including custom providers like OVH Cloud. Simply provide an S3 URL as the output path:
# Convert to S3
eopf-geozarr convert local_input.zarr s3://my-bucket/geozarr-data/output.zarr --verboseBefore using S3 output, ensure your S3 credentials are configured:
For AWS S3:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1For OVH Cloud Object Storage:
export AWS_ACCESS_KEY_ID=your_ovh_access_key
export AWS_SECRET_ACCESS_KEY=your_ovh_secret_key
export AWS_DEFAULT_REGION=gra # or other OVH region
export AWS_ENDPOINT_URL=https://s3.gra.cloud.ovh.net # OVH endpointFor other S3-compatible providers:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=your_region
export AWS_ENDPOINT_URL=https://your-s3-endpoint.comAlternative: AWS CLI Configuration
aws configure
# Note: For custom endpoints, you'll still need to set AWS_ENDPOINT_URL- Custom Endpoints: Support for any S3-compatible storage (AWS, OVH Cloud, MinIO, etc.)
- Automatic Validation: The tool validates S3 access before starting conversion
- Credential Detection: Automatically detects and validates S3 credentials
- Error Handling: Provides helpful error messages for S3 configuration issues
- Performance: Optimized for S3 with proper chunking and retry logic
The library supports parallel processing using dask clusters for improved performance on large datasets:
# Enable dask cluster for parallel processing
eopf-geozarr convert input.zarr output.zarr --dask-cluster
# With verbose output to see cluster information
eopf-geozarr convert input.zarr output.zarr --dask-cluster --verbose- Local Cluster: Automatically starts a local dask cluster with multiple workers
- Dashboard Access: Provides access to the dask dashboard for monitoring (shown in verbose mode)
- Automatic Cleanup: Properly closes the cluster even if errors occur during processing
- Chunk Alignment: Automatically aligns Zarr chunks with dask chunks to prevent data corruption
- Memory Efficiency: Better memory management through parallel chunk processing
- Error Handling: Graceful handling of dask import errors with helpful installation instructions
The library includes advanced chunk alignment logic to prevent the common issue of overlapping chunks when using dask:
- Smart Detection: Automatically detects if data is dask-backed and uses existing chunk structure
- Aligned Calculation: Uses
calculate_aligned_chunk_size()to find optimal chunk sizes that divide evenly into data dimensions - Proper Rechunking: Ensures datasets are rechunked to match encoding before writing
- Fallback Logic: For non-dask arrays, uses reasonable chunk sizes that don't exceed data dimensions
This prevents errors like:
❌ Failed to write tci after 2 attempts: Specified Zarr chunks encoding['chunks']=(1, 3660, 3660)
for variable named 'tci' would overlap multiple Dask chunks
import os
import xarray as xr
from eopf_geozarr import create_geozarr_dataset
# Configure for OVH Cloud (example)
os.environ['AWS_ACCESS_KEY_ID'] = 'your_ovh_access_key'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'your_ovh_secret_key'
os.environ['AWS_DEFAULT_REGION'] = 'gra'
os.environ['AWS_ENDPOINT_URL'] = 'https://s3.gra.cloud.ovh.net'
# Load your EOPF DataTree
dt = xr.open_datatree("path/to/eopf/dataset.zarr", engine="zarr")
# Convert directly to S3
dt_geozarr = create_geozarr_dataset(
dt_input=dt,
groups=["/measurements/r10m", "/measurements/r20m", "/measurements/r60m"],
output_path="s3://my-bucket/geozarr-data/output.zarr",
spatial_chunk=4096,
min_dimension=256,
tile_width=256,
max_retries=3
)import xarray as xr
from eopf_geozarr import create_geozarr_dataset
# Load your EOPF DataTree
dt = xr.open_datatree("path/to/eopf/dataset.zarr", engine="zarr")
# Define groups to convert (e.g., resolution groups)
groups = ["/measurements/r10m", "/measurements/r20m", "/measurements/r60m"]
# Convert to GeoZarr compliant format
dt_geozarr = create_geozarr_dataset(
dt_input=dt,
groups=groups,
output_path="path/to/output/geozarr.zarr",
spatial_chunk=4096,
min_dimension=256,
tile_width=256,
max_retries=3
)Create a GeoZarr-spec 0.4 compliant dataset from EOPF data.
Parameters:
dt_input(xr.DataTree): Input EOPF DataTreegroups(List[str]): List of group names to process as Geozarr datasetsoutput_path(str): Output path for the Zarr storespatial_chunk(int, default=4096): Spatial chunk size for encodingmin_dimension(int, default=256): Minimum dimension for overview levelstile_width(int, default=256): Tile width for TMS compatibilitymax_retries(int, default=3): Maximum number of retries for network operations
Returns:
xr.DataTree: DataTree containing the GeoZarr compliant data
Set up GeoZarr-spec compliant CF standard names and CRS information.
Parameters:
dt(xr.DataTree): The data tree containing the datasets to processgroups(List[str]): List of group names to process as Geozarr datasets
Returns:
Dict[str, xr.Dataset]: Dictionary of datasets with GeoZarr compliance applied
Downsample a 2D array using block averaging.
Calculate a chunk size that divides evenly into the dimension size. This ensures that Zarr chunks align properly with the data dimensions, preventing chunk overlap issues when writing with Dask.
Parameters:
dimension_size(int): Size of the dimension to chunktarget_chunk_size(int): Desired chunk size
Returns:
int: Aligned chunk size that divides evenly into dimension_size
Example:
from eopf_geozarr.conversion.utils import calculate_aligned_chunk_size
# For a dimension of size 5490 with target chunk size 3660
aligned_size = calculate_aligned_chunk_size(5490, 3660) # Returns 2745Check if a variable is a grid_mapping variable by looking for references to it.
Validate that a specific band exists and is complete in the dataset.
The library is organized into the following modules:
conversion: Core conversion tools for EOPF to GeoZarr transformationgeozarr.py: Main conversion functions and GeoZarr spec complianceutils.py: Utility functions for data processing and validation
data_api: Data access API (future development with pydantic-zarr)
This library implements the GeoZarr specification 0.4 with the following key requirements:
- Array Dimensions: All arrays must have
_ARRAY_DIMENSIONSattributes - CF Standard Names: All variables must have CF-compliant
standard_nameattributes - Grid Mapping: Data variables must reference CF grid_mapping variables via
grid_mappingattributes - Multiscales Structure: Overview levels are stored as children groups with proper tile matrix metadata
- Native CRS: Coordinate reference systems are preserved without reprojection
# Clone the repository
git clone <repository-url>
cd eopf-geozarr
# Install in development mode with all dependencies
pip install -e ".[dev,docs,all]"
# Install pre-commit hooks
pre-commit installpytestThe project uses:
- Black for code formatting
- isort for import sorting
- flake8 for linting
- mypy for type checking
- pre-commit for automated checks
cd docs
make html- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests and ensure code quality checks pass
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Built on top of the excellent xarray and zarr libraries
- Follows the GeoZarr specification for geospatial data in Zarr
- Designed for compatibility with EOPF datasets
For questions, issues, or contributions, please visit the GitHub repository.