Common questions and solutions for using the EOPF GeoZarr library.
EOPF GeoZarr is a Python library that converts EOPF (Earth Observation Processing Framework) datasets to GeoZarr-spec 0.4 compliant format. It maintains scientific accuracy while optimizing for cloud-native workflows and performance.
GeoZarr is a specification that extends Zarr with geospatial metadata standards. Our library specifically:
- Ensures GeoZarr 0.4 specification compliance
- Preserves native coordinate reference systems
- Creates multiscale pyramids for efficient visualization
- Maintains CF conventions for scientific interoperability
- Optimizes chunking for Earth observation data patterns
Currently, the library is optimized for:
- Sentinel-2 (L1C and L2A products)
- Sentinel-1 (planned support)
The architecture is designed to support additional missions with minimal modifications.
The library uses modern Python features and depends on recent versions of scientific libraries (xarray, zarr, dask) that require Python 3.11+.
While the library is primarily distributed via PyPI, you can install it in a conda environment:
conda create -n eopf-geozarr python=3.11
conda activate eopf-geozarr
pip install eopf-geozarrMultiple options are available:
# Option 1: AWS CLI
aws configure
# Option 2: Environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1
# Option 3: IAM roles (for EC2/ECS)
# No setup needed - automatic detectionInspect your EOPF dataset first:
# test: skip
import xarray as xr
dt = xr.open_datatree("input.zarr", engine="zarr")
print(dt) # Shows the full structure
# Common Sentinel-2 groups:
groups = [
"/measurements/r10m", # 10m bands: B02, B03, B04, B08
"/measurements/r20m", # 20m bands: B05, B06, B07, B8A, B11, B12
"/measurements/r60m" # 60m bands: B01, B09, B10
]The optimal chunk size depends on your data and use case:
# test: skip
from eopf_geozarr.conversion.utils import calculate_aligned_chunk_size
# For typical Sentinel-2 data (10980x10980)
optimal_chunk = calculate_aligned_chunk_size(10980, 4096)
print(optimal_chunk) # Returns 3660
# General guidelines:
# - 4096: Good default for most cases
# - 2048: Better for memory-constrained environments
# - 8192: For high-memory systems and large datasetsUse Dask for distributed processing:
# test: skip
from dask.distributed import Client
# Start Dask cluster
client = Client('scheduler-address:8786')
# Use smaller chunks for distributed processing
dt_geozarr = create_geozarr_dataset(
dt_input=dt,
groups=["/measurements/r10m"],
output_path="output.zarr",
spatial_chunk=2048 # Smaller chunks work better with Dask
)Currently, the library processes all bands within a group. To process specific bands, you would need to create a subset of your input dataset first:
# test: skip
# Create subset with specific bands
dt_subset = dt.copy()
ds_10m = dt_subset["/measurements/r10m"].ds
ds_subset = ds_10m[["b02", "b03", "b04"]] # Only RGB bands
dt_subset["/measurements/r10m"].ds = ds_subset
# Then convert
dt_geozarr = create_geozarr_dataset(
dt_input=dt_subset,
groups=["/measurements/r10m"],
output_path="rgb_only.zarr"
)Cause: Library not installed or wrong Python environment.
Solutions:
# test: skip
# Verify installation
pip list | grep eopf-geozarr
# Reinstall if missing
pip install eopf-geozarr
# Check Python environment
which python
python --versionCause: Specified groups don't exist in the input dataset.
Solution:
# test: skip
# Check available groups
dt = xr.open_datatree("input.zarr", engine="zarr")
print("Available groups:", list(dt.groups))
# Use correct group paths
groups = [g for g in dt.groups if "measurements" in g]Cause: Dataset too large for available memory.
Solutions:
# test: skip
# 1. Use smaller chunks
dt_geozarr = create_geozarr_dataset(
dt_input=dt,
groups=["/measurements/r10m"],
output_path="output.zarr",
spatial_chunk=1024 # Smaller chunks
)
# 2. Use Dask for out-of-core processing
from dask.distributed import Client
client = Client()
# 3. Process groups one at a time
for group in ["/measurements/r10m", "/measurements/r20m"]:
dt_geozarr = create_geozarr_dataset(
dt_input=dt,
groups=[group],
output_path=f"output_{group.split('/')[-1]}.zarr"
)Cause: Insufficient S3 permissions or incorrect credentials.
Solutions:
# test: skip
# 1. Verify credentials
from eopf_geozarr.conversion.fs_utils import get_s3_credentials_info
print(get_s3_credentials_info())
# 2. Test S3 access
from eopf_geozarr.conversion.fs_utils import validate_s3_access
is_valid, error = validate_s3_access("s3://your-bucket/test.zarr")
print(f"Valid: {is_valid}, Error: {error}")
# 3. Check IAM permissions (need s3:GetObject, s3:PutObject, s3:ListBucket)Cause: Corrupted or incomplete Zarr dataset.
Solutions:
# test: skip
# 1. Validate input dataset
try:
dt = xr.open_datatree("input.zarr", engine="zarr")
print("Dataset loaded successfully")
except Exception as e:
print(f"Dataset error: {e}")
# 2. Check for missing arrays
import zarr
store = zarr.open("input.zarr", mode="r")
print("Available arrays:", list(store.array_keys()))
# 3. Consolidate metadata if needed
zarr.consolidate_metadata("input.zarr")Cause: Missing or invalid coordinate reference system information.
Solutions:
# test: skip
# Check CRS information
ds = dt["/measurements/r10m"].ds
print("CRS variables:", [v for v in ds.data_vars if 'crs' in v.lower() or 'spatial_ref' in v])
# Check coordinate attributes
print("X coord attrs:", ds.x.attrs)
print("Y coord attrs:", ds.y.attrs)
# Verify rioxarray can read CRS
import rioxarray
try:
crs = ds.rio.crs
print(f"CRS: {crs}")
except Exception as e:
print(f"CRS error: {e}")Several factors affect performance:
- Chunk size: Too small = many operations, too large = memory issues
- Network: S3 operations depend on bandwidth and latency
- CPU: Overview generation is CPU-intensive
- Memory: Insufficient RAM causes swapping
Optimization strategies:
# test: skip
# 1. Optimal chunking
chunk_size = calculate_aligned_chunk_size(data_width, 4096)
# 2. Use Dask for parallelization
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=2)
# 3. Process in batches
for group in groups:
# Process one group at a time
pass
# 4. Use SSD storage for temporary files
import tempfile
import os
os.environ['TMPDIR'] = '/path/to/fast/storage'Enable verbose logging:
# test: skip
import logging
logging.basicConfig(level=logging.INFO)
# Or use the CLI with verbose flag
# eopf-geozarr convert input.zarr output.zarr --verboseGeoZarr datasets are typically larger than input due to:
- Multiscale overviews (adds ~33% for 2 overview levels)
- Additional metadata
- Chunk alignment padding
Estimation:
# test: skip
# Rough estimate: input_size * 1.4 (with 2 overview levels)
# For Sentinel-2 10m band: ~400MB input → ~560MB GeoZarr- AWS S3: Full support
- S3-compatible: MinIO, DigitalOcean Spaces, etc.
- Google Cloud Storage: Via S3 compatibility mode
- Azure Blob Storage: Via S3 compatibility (limited)
# test: skip
# 1. Use appropriate region
os.environ['AWS_DEFAULT_REGION'] = 'us-west-2' # Close to your data
# 2. Configure multipart uploads
s3_config = {
'config_kwargs': {
'max_pool_connections': 50,
'multipart_threshold': 64 * 1024 * 1024, # 64MB
'multipart_chunksize': 16 * 1024 * 1024 # 16MB
}
}
# 3. Use VPC endpoints for EC2 instances
# 4. Consider S3 Transfer Acceleration for global accessYes, the library supports mixed storage:
# test: skip
# Local input, S3 output
dt = xr.open_datatree("local_input.zarr", engine="zarr")
dt_geozarr = create_geozarr_dataset(
dt_input=dt,
groups=["/measurements/r10m"],
output_path="s3://bucket/output.zarr"
)
# S3 input, local output
dt = xr.open_datatree("s3://bucket/input.zarr", engine="zarr")
dt_geozarr = create_geozarr_dataset(
dt_input=dt,
groups=["/measurements/r10m"],
output_path="local_output.zarr"
)# test: skip
# 1. Use built-in validation
from eopf_geozarr.cli import validate_command
import argparse
args = argparse.Namespace()
args.input_path = "output.zarr"
args.verbose = True
validate_command(args)
# 2. Manual checks
dt = xr.open_datatree("output.zarr", engine="zarr")
# Check multiscales metadata
print("Multiscales:", dt.attrs.get('multiscales', 'Missing'))
# Check overview levels
for level in ['0', '1', '2']:
path = f"/measurements/r10m/{level}"
if path in dt.groups:
ds = dt[path].ds
print(f"Level {level}: {dict(ds.dims)}")
# Check required attributes
ds = dt["/measurements/r10m/0"].ds
for var_name in ds.data_vars:
var = ds[var_name]
print(f"{var_name}: grid_mapping={var.attrs.get('grid_mapping', 'Missing')}")- Check the error message - it usually indicates the specific issue
- Verify input data - ensure the EOPF dataset is complete and valid
- Check dependencies - ensure all required libraries are up to date
- Try with verbose logging - get more detailed error information
- Report issues - if it seems like a bug, please report it
# test: skip
# Load both datasets
dt_input = xr.open_datatree("input.zarr", engine="zarr")
dt_output = xr.open_datatree("output.zarr", engine="zarr")
# Compare native resolution data
ds_input = dt_input["/measurements/r10m"].ds
ds_output = dt_output["/measurements/r10m/0"].ds
# Check data values (should be identical)
import numpy as np
for band in ["b02", "b03", "b04"]:
if band in ds_input and band in ds_output:
diff = np.abs(ds_input[band].values - ds_output[band].values)
print(f"{band} max difference: {diff.max()}")
# Should be 0 or very close to 0Yes, you can create STAC items for GeoZarr datasets. See the Examples section for detailed code.
The library works well in Jupyter environments. See Examples for interactive visualization patterns.
Absolutely! The library is designed for integration:
# test: skip
# Example pipeline integration
def process_sentinel2_scene(input_path: str, output_path: str):
"""Process a single Sentinel-2 scene to GeoZarr."""
try:
dt = xr.open_datatree(input_path, engine="zarr")
dt_geozarr = create_geozarr_dataset(
dt_input=dt,
groups=["/measurements/r10m", "/measurements/r20m"],
output_path=output_path,
spatial_chunk=4096
)
return True, "Success"
except Exception as e:
return False, str(e)
# Use in batch processing
results = []
for scene in scene_list:
success, message = process_sentinel2_scene(scene.input, scene.output)
results.append((scene.id, success, message))- Documentation: Check the User Guide and API Reference
- Examples: See Examples for common use cases
- GitHub Issues: Report bugs or request features at the GitHub repository
- Community: Join discussions in the GeoZarr community
When reporting issues, please include:
-
Version information:
# test: skip eopf-geozarr --version python --version pip list | grep -E "(xarray|zarr|dask)"
-
Error message: Full traceback if available
-
Minimal example: Code that reproduces the issue
-
Environment: OS, Python version, installation method
-
Data information: Dataset type, size, structure (if shareable)
Contributions are welcome! See the project repository for contribution guidelines. Areas where help is needed:
- Additional satellite mission support
- Performance optimizations
- Documentation improvements
- Test coverage expansion
- Bug fixes and feature enhancements