Summary
Add follow-up benchmarks to better understand when kerchunk becomes preferable to the native NetCDF engine for our CMIP use case.
Goals
- Refine the crossover point where kerchunk starts to outperform NetCDF
- Test whether daily-frequency datasets change the result
- Validate expectations for repeated access and remote data access
Benchmark additions
1. File-count bins near the crossover
Test datasets in narrower file-count ranges:
- 25-49
- 50-99
- 100-149
- 150-199
- 200-299
- 300-499
- 500+
2. Higher-frequency datasets
Add daily-frequency datasets in addition to the current cases.
3. Repeated access
Measure both:
- first open
- repeated open of the same dataset
4. Remote access
Run the same comparisons for remote data access.
Dataset sampling
Use 3 datasets per file-count bin for the initial benchmark pass to keep the batch job within a reasonable runtime. If results are noisy or the crossover remains unclear, follow up with additional samples in the most relevant bins.
Operations to compare
- open
- load
- temporal average
- spatial average
- subset
Notes
Current results suggest:
- NetCDF is generally favored when data is colocated with compute at NERSC
- kerchunk becomes more attractive as file counts increase
- remote access still needs to be tested
Summary
Add follow-up benchmarks to better understand when kerchunk becomes preferable to the native NetCDF engine for our CMIP use case.
Goals
Benchmark additions
1. File-count bins near the crossover
Test datasets in narrower file-count ranges:
2. Higher-frequency datasets
Add daily-frequency datasets in addition to the current cases.
3. Repeated access
Measure both:
4. Remote access
Run the same comparisons for remote data access.
Dataset sampling
Use 3 datasets per file-count bin for the initial benchmark pass to keep the batch job within a reasonable runtime. If results are noisy or the crossover remains unclear, follow up with additional samples in the most relevant bins.
Operations to compare
Notes
Current results suggest: