Skip to content

Commit b117364

Browse files
Add Sentinel-2 Optimization Module with CLI and Data Processing Enhancements (#58)
* feat: add sharding support for GeoZarr conversion and CLI * update launch configurations for GeoZarr conversion with new data sources and adjusted parameters * feat: enable sharding in GeoZarr conversion launch configuration * fix: update sharding codec handling in _create_sharded_encoding function * refactor: streamline sharding configuration in _create_geozarr_encoding function * feat: enhance sharding logic in _create_geozarr_encoding and add _calculate_shard_dimension utility * feat: improve sharding configuration and validation in _create_geozarr_encoding * fix: refine shard dimension calculation and improve divisor check in utility functions * Add dataset tree structure and test script for sharding fix - Introduced a new dataset tree structure for Sentinel-2 data, detailing conditions, quality, and measurements. - Added a comprehensive test script to verify the sharding fix for GeoZarr conversion. - Implemented tests for shard dimension calculations and encoding creation with sharding enabled/disabled. - Enhanced output for better debugging and validation of shard dimensions against chunk dimensions. * feat: enable sharding in Dask cluster setup and enhance chunking logic for sharded variables * Add Sentinel-2 Optimization Module with CLI Integration and Data Processing - Created the `s2_optimization` module for optimizing Sentinel-2 Zarr datasets. - Implemented CLI commands for converting Sentinel-2 datasets to optimized structures. - Developed band mapping and resolution definitions for Sentinel-2 optimization. - Added the `S2OptimizedConverter` class for handling the conversion process. - Implemented data consolidation logic to reorganize Sentinel-2 structure. - Created multiscale pyramid generation for optimized data. - Added downsampling operations for various data types (reflectance, classification, quality masks). - Implemented validation logic for optimized Sentinel-2 datasets. - Developed unit tests for band mapping, converter functionality, and resampling operations. * feat: enhance S2 data consolidator with comprehensive extraction methods and testing framework * Add comprehensive tests for S2MultiscalePyramid class - Implement unit tests for initialization, pyramid levels structure, chunk alignment, and shard dimension calculations. - Create tests for encoding generation, dataset writing, and level dataset creation with various resolutions. - Include integration tests for realistic measurements data and edge cases handling. - Ensure coverage for time separation logic and coordinate preservation during processing. * feat: simplify chunk alignment and sharding logic in S2MultiscalePyramid * feat: integrate S2 optimization commands into CLI and enhance converter functionality * feat: add S2L2A optimized conversion command to CLI and update launch configuration * feat: enhance S2 converter and multiscale pyramid with optimized encoding and rechunking * feat: enhance sharding logic to ensure compatibility with chunk dimensions in S2MultiscalePyramid * feat: add downsampling for 10m data and adjust dataset creation for levels 3+ * feat: add support for Dask cluster in S2 optimization commands and enhance progress tracking for Zarr writes * feat: add compression level option for GeoZarr conversion * feat: implement Dask parallelization for multiscale pyramid creation and downsampling * feat: enhance multiscale pyramid creation with streaming Dask parallelization and improved memory management * feat: configure Dask client to use 3 workers with 8GB memory each for improved parallel processing * fix: update import path for geozarr functions in S2OptimizedConverter * feat: refactor multiscales metadata handling and root consolidation in S2OptimizedConverter * feat: add comprehensive unit tests for S2OptimizedConverter and related functionalities * feat: implement geographic metadata writing in S2MultiscalePyramid and add corresponding unit tests * feat: skip duplicate variables during downsampling in S2MultiscalePyramid * feat: enhance CRS handling by adding grid mapping variable to dataset attributes * feat: add grid mapping variable writing for datasets in S2MultiscalePyramid * feat: skip already present variables during downsampling in S2MultiscalePyramid * feat: reduce memory limit for Dask client to 4GB and add geographic metadata writing in S2MultiscalePyramid * Refactor test cases and improve code formatting in S2 resampling tests and sharding fix - Reorganized import statements and improved code formatting for better readability in `test_s2_resampling.py`. - Updated sample data creation functions to use consistent array formatting and improved attribute handling. - Enhanced assertions in tests to ensure clarity and consistency. - Improved test output messages in `test_sharding_fix.py` for better debugging and understanding of test results. - Ensured that shard dimensions are properly calculated and validated against chunk dimensions in the sharding tests. * feat: update memory limit for Dask client to 8GB and adjust spatial chunk size to 256 in S2MultiscalePyramid * feat: add new CLI command for converting to GeoZarr S2L2A optimized format with sharding support * feat: implement batched parallel downsampling for S2 datasets and improve classification downsampling method * fix: update measurement group keys and enhance dataset loading with decoding options * feat: add streaming support for multiscale pyramid creation in S2 converter * feat: add --enable-streaming option for experimental streaming mode in S2 optimization command * fix: avoid passing coordinates in lazy dataset creation to prevent alignment issues * feat: implement Zarr v3 compatible encoding for optimized datasets * fix: enhance measurements group writing by consolidating metadata and improving Zarr group handling * feat: enhance streaming write with advanced chunking and sharding support * feat: enhance encoding for streaming writes with advanced chunking and sharding support * fix: improve root-level metadata consolidation with proper Zarr group creation and linking * feat: add streaming support to S2 optimized converter and update measurements group handling * fix: change root Zarr group creation mode from 'w' to 'a' for appending data * refactor: streamline Zarr group handling and metadata consolidation in S2 converter * fix: streamline root Zarr group creation by removing existence check and ensuring proper attributes are set * fix: correct multiscales attribute assignment and update group prefix handling refactor: replace os.path.exists with fs_utils.path_exists for level path check * feat: add downsampled coordinates creation for multiscale pyramid levels * fix: update launch configuration for S2A MSIL2A dataset and adjust grid mapping attributes in streaming pyramid creation * Refactor downsample factor calculation in S2StreamingMultiscalePyramid Updated the downsample factor calculation to use resolution ratios from pyramid_levels. This change improves clarity by explicitly referencing the resolutions of level 2 and the target level, ensuring accurate downsampling based on the defined pyramid structure. * streaming as default * refactor: update S2 optimization process to preserve original data structure and enhance multiscale pyramid creation * refactor: enhance multiscale creation by preserving all original groups and improving empty group handling * refactor: enhance group writing by preserving original chunking and encoding for non-measurement groups * refactor: preserve original chunking during dataset writing by rechunking variables individually * refactor: enhance downsampling process by organizing resolution groups and creating from coarsest available resolution * refactor: improve error handling and verbosity in downsampling process for multiscale pyramid creation * refactor: update band mapping for Sentinel-2 by adding 'b10' to native bands and quality data * refactor: update tile dimensions calculation and enhance multiscales metadata handling in S2 multiscale pyramid creation * refactor: simplify multiscales metadata addition by removing unnecessary datatree loading and streamline dataset writing return * refactor: simplify variable naming in multiscale pyramid creation and remove unused downsampling operations * refactor: streamline zarr group creation and multiscales metadata handling in S2 converter * refactor: change Zarr write mode from 'a' to 'r+' in S2 converter and multiscale classes * refactor: change Zarr write mode from 'r+' to 'a' and simplify DataTree initialization in S2 converter and multiscale classes * fix: correct parameter name from 'modea' to 'mode' in DataTree zarr writing * feat: add missing parent groups creation in root-level metadata consolidation * feat: enhance root-level group creation by identifying and creating missing intermediary groups in Zarr structure * fix: correct parameter name from 'zqarr_format' to 'zarr_format' in S2OptimizedConverter * fix: store result of multiscales metadata addition in processed_groups * fix: update NATIVE_BANDS to include 'b10' and adjust pyramid levels count in tests * fix: update coordinate creation in downsampling for consistency and improve test for geo metadata integration with level creation * refactor: remove unused fixture and update tests for pyramid levels and chunk dimensions * Remove Sentinel-2 Zarr Conversion Optimization Plan and associated test script - Deleted the comprehensive optimization plan for the Sentinel-2 Zarr conversion, which included details on the current state, proposed structure, technical specifications, implementation plan, and expected benefits. - Removed the test script for verifying the sharding fix in GeoZarr conversion, which included tests for shard dimensions and encoding creation. * Implement feature X to enhance user experience and fix bug Y in module Z * delete: remove dataset_tree_simplified.txt as it is no longer needed
1 parent 66fc5ac commit b117364

17 files changed

Lines changed: 3558 additions & 5 deletions

.vscode/launch.json

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,34 @@
156156
"AWS_ENDPOINT_URL": "https://s3.de.io.cloud.ovh.net/"
157157
},
158158

159+
},
160+
{
161+
// eopf_geozarr convert https://objectstore.eodc.eu:2222/e05ab01a9d56408d82ac32d69a5aae2a:sample-data/tutorial_data/cpm_v253/S2B_MSIL1C_20250113T103309_N0511_R108_T32TLQ_20250113T122458.zarr /tmp/tmp7mmjkjk3/s2b_subset_test.zarr --groups /measurements/reflectance/r10m --spatial-chunk 512 --min-dimension 128 --tile-width 256 --max-retries 2 --verbose
162+
"name": "Convert to GeoZarr S2L2A Optimized (S3)",
163+
"type": "debugpy",
164+
"request": "launch",
165+
"module": "eopf_geozarr",
166+
"args": [
167+
"convert-s2-optimized",
168+
"https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:202509-s02msil2a/08/products/cpm_v256/S2A_MSIL2A_20250908T100041_N0511_R122_T32TQM_20250908T115116.zarr",
169+
// "s3://esa-zarr-sentinel-explorer-fra/tests-output/sentinel-2-l2a-opt/S2A_MSIL2A_20250908T100041_N0511_R122_T32TQM_20250908T115116.zarr",
170+
"./tests-output/eopf_geozarr/s2l2_optimized.zarr",
171+
"--spatial-chunk", "1024",
172+
"--compression-level", "5",
173+
"--enable-sharding",
174+
"--dask-cluster",
175+
"--verbose"
176+
],
177+
"cwd": "${workspaceFolder}",
178+
"justMyCode": false,
179+
"console": "integratedTerminal",
180+
"env": {
181+
"PYTHONPATH": "${workspaceFolder}/.venv/bin",
182+
"AWS_PROFILE": "eopf-explorer",
183+
"AWS_DEFAULT_REGION": "de",
184+
"AWS_ENDPOINT_URL": "https://s3.de.io.cloud.ovh.net/"
185+
},
186+
159187
},
160188
{
161189
"name": "Convert to GeoZarr Sentinel-1 GRD (Local)",
@@ -261,7 +289,8 @@
261289
"module": "eopf_geozarr",
262290
"args": [
263291
"info",
264-
"s3://esa-zarr-sentinel-explorer-fra/tests-output/sentinel-2-l2a/S2A_MSIL2A_20250704T094051_N0511_R036_T33SWB_20250704T115824.zarr",
292+
// "s3://esa-zarr-sentinel-explorer-fra/tests-output/sentinel-2-l2a/S2A_MSIL2A_20250704T094051_N0511_R036_T33SWB_20250704T115824.zarr",
293+
"s3://esa-zarr-sentinel-explorer-fra/tests-output/sentinel-2-l2a-opt/S2A_MSIL2A_20250908T100041_N0511_R122_T32TQM_20250908T115116.zarr",
265294
"--verbose",
266295
"--html-output", "dataset_info.html"
267296
],

src/eopf_geozarr/cli.py

Lines changed: 111 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@
1313

1414
import xarray as xr
1515

16+
from eopf_geozarr.s2_optimization.s2_converter import convert_s2_optimized
17+
1618
from . import create_geozarr_dataset
1719
from .conversion.fs_utils import (
1820
get_s3_credentials_info,
@@ -52,7 +54,7 @@ def setup_dask_cluster(enable_dask: bool, verbose: bool = False) -> Any | None:
5254
from dask.distributed import Client
5355

5456
# Set up local cluster with high memory limits
55-
client = Client(memory_limit="8GB") # set up local cluster
57+
client = Client(n_workers=3, memory_limit="8GB") # set up local cluster with 3 workers and 8GB memory each
5658
# client = Client() # set up local cluster
5759

5860
if verbose:
@@ -1145,9 +1147,117 @@ def create_parser() -> argparse.ArgumentParser:
11451147
)
11461148
validate_parser.set_defaults(func=validate_command)
11471149

1150+
# Add S2 optimization commands
1151+
add_s2_optimization_commands(subparsers)
1152+
11481153
return parser
11491154

11501155

1156+
def add_s2_optimization_commands(subparsers):
1157+
"""Add S2 optimization commands to CLI parser."""
1158+
1159+
# Convert S2 optimized command
1160+
s2_parser = subparsers.add_parser(
1161+
"convert-s2-optimized", help="Convert Sentinel-2 dataset to optimized structure"
1162+
)
1163+
s2_parser.add_argument(
1164+
"input_path", type=str, help="Path to input Sentinel-2 dataset (Zarr format)"
1165+
)
1166+
s2_parser.add_argument(
1167+
"output_path", type=str, help="Path for output optimized dataset"
1168+
)
1169+
s2_parser.add_argument(
1170+
"--spatial-chunk",
1171+
type=int,
1172+
default=256,
1173+
help='Spatial chunk size (default: 256)'
1174+
)
1175+
s2_parser.add_argument(
1176+
"--enable-sharding", action="store_true", help="Enable Zarr v3 sharding"
1177+
)
1178+
s2_parser.add_argument(
1179+
"--compression-level",
1180+
type=int,
1181+
default=3,
1182+
choices=range(1, 10),
1183+
help="Compression level 1-9 (default: 3)",
1184+
)
1185+
s2_parser.add_argument(
1186+
"--skip-geometry", action="store_true", help="Skip creating geometry group"
1187+
)
1188+
s2_parser.add_argument(
1189+
"--skip-meteorology",
1190+
action="store_true",
1191+
help="Skip creating meteorology group",
1192+
)
1193+
s2_parser.add_argument(
1194+
"--skip-validation", action="store_true", help="Skip output validation"
1195+
)
1196+
s2_parser.add_argument(
1197+
"--verbose", action="store_true", help="Enable verbose output"
1198+
)
1199+
s2_parser.add_argument(
1200+
"--dask-cluster",
1201+
action="store_true",
1202+
help="Start a local dask cluster for parallel processing and progress bars",
1203+
)
1204+
s2_parser.set_defaults(func=convert_s2_optimized_command)
1205+
1206+
1207+
def convert_s2_optimized_command(args):
1208+
"""Execute S2 optimized conversion command."""
1209+
# Set up dask cluster if requested
1210+
dask_client = setup_dask_cluster(
1211+
enable_dask=getattr(args, "dask_cluster", False), verbose=args.verbose
1212+
)
1213+
1214+
try:
1215+
# Load input dataset
1216+
print(f"Loading Sentinel-2 dataset from: {args.input_path}")
1217+
storage_options = get_storage_options(str(args.input_path))
1218+
dt_input = xr.open_datatree(
1219+
str(args.input_path),
1220+
engine="zarr",
1221+
chunks="auto",
1222+
storage_options=storage_options,
1223+
)
1224+
1225+
# Convert
1226+
dt_optimized = convert_s2_optimized(
1227+
dt_input=dt_input,
1228+
output_path=args.output_path,
1229+
enable_sharding=args.enable_sharding,
1230+
spatial_chunk=args.spatial_chunk,
1231+
compression_level=args.compression_level,
1232+
create_geometry_group=not args.skip_geometry,
1233+
create_meteorology_group=not args.skip_meteorology,
1234+
validate_output=not args.skip_validation,
1235+
verbose=args.verbose,
1236+
)
1237+
1238+
print(f"✅ S2 optimization completed: {args.output_path}")
1239+
return 0
1240+
1241+
except Exception as e:
1242+
print(f"❌ Error during S2 optimization: {e}")
1243+
if args.verbose:
1244+
import traceback
1245+
1246+
traceback.print_exc()
1247+
return 1
1248+
finally:
1249+
# Clean up dask client if it was created
1250+
if dask_client is not None:
1251+
try:
1252+
if hasattr(dask_client, "close"):
1253+
dask_client.close()
1254+
if args.verbose:
1255+
print("🔄 Dask cluster closed")
1256+
except Exception as e:
1257+
if args.verbose:
1258+
print(f"Warning: Error closing dask cluster: {e}")
1259+
1260+
11511261
def main() -> None:
11521262
"""Execute main entry point for the CLI."""
11531263
parser = create_parser()

src/eopf_geozarr/conversion/geozarr.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -853,8 +853,8 @@ def create_native_crs_tile_matrix_set(
853853
scale_denominator = cell_size * 3779.5275
854854

855855
# Calculate matrix dimensions
856-
tile_width = 256
857-
tile_height = 256
856+
tile_width = overview["chunks"][1][0] if "chunks" in overview else 256
857+
tile_height = overview["chunks"][0][0] if "chunks" in overview else 256
858858
matrix_width = int(np.ceil(width / tile_width))
859859
matrix_height = int(np.ceil(height / tile_height))
860860

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Sentinel-2 Optimization Module
2+
# This package contains tools for optimizing Sentinel-2 Zarr datasets.
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
"""
2+
Band mapping and resolution definitions for Sentinel-2 optimization.
3+
"""
4+
5+
from dataclasses import dataclass
6+
from typing import Dict, List, Set
7+
8+
9+
@dataclass
10+
class BandInfo:
11+
"""Information about a spectral band."""
12+
13+
name: str
14+
native_resolution: int # meters
15+
data_type: str
16+
wavelength_center: float # nanometers
17+
wavelength_width: float # nanometers
18+
19+
20+
# Native resolution definitions
21+
NATIVE_BANDS: Dict[int, List[str]] = {
22+
10: ["b02", "b03", "b04", "b08"], # Blue, Green, Red, NIR
23+
20: ["b05", "b06", "b07", "b11", "b12", "b8a"], # Red Edge, SWIR
24+
60: ["b01", "b09", "b10"], # Coastal, Water Vapor, Cirrus
25+
}
26+
27+
# Complete band information
28+
BAND_INFO: Dict[str, BandInfo] = {
29+
"b01": BandInfo("b01", 60, "uint16", 443, 21), # Coastal aerosol
30+
"b02": BandInfo("b02", 10, "uint16", 490, 66), # Blue
31+
"b03": BandInfo("b03", 10, "uint16", 560, 36), # Green
32+
"b04": BandInfo("b04", 10, "uint16", 665, 31), # Red
33+
"b05": BandInfo("b05", 20, "uint16", 705, 15), # Red Edge 1
34+
"b06": BandInfo("b06", 20, "uint16", 740, 15), # Red Edge 2
35+
"b07": BandInfo("b07", 20, "uint16", 783, 20), # Red Edge 3
36+
"b08": BandInfo("b08", 10, "uint16", 842, 106), # NIR
37+
"b8a": BandInfo("b8a", 20, "uint16", 865, 21), # NIR Narrow
38+
"b09": BandInfo("b09", 60, "uint16", 945, 20), # Water Vapor
39+
"b10": BandInfo("b10", 60, "uint16", 1375, 30), # Cirrus
40+
"b11": BandInfo("b11", 20, "uint16", 1614, 91), # SWIR 1
41+
"b12": BandInfo("b12", 20, "uint16", 2202, 175), # SWIR 2
42+
}
43+
44+
# Quality data mapping - defines which auxiliary data exists at which resolutions
45+
QUALITY_DATA_NATIVE: Dict[str, int] = {
46+
"scl": 20, # Scene Classification Layer - native 20m
47+
"aot": 20, # Aerosol Optical Thickness - native 20m
48+
"wvp": 20, # Water Vapor - native 20m
49+
"cld": 20, # Cloud probability - native 20m
50+
"snw": 20, # Snow probability - native 20m
51+
}

0 commit comments

Comments
 (0)