Skip to content

Follow-up benchmarks for kerchunk vs. NetCDF across file-count ranges, frequency, and access patterns #62

@tomvothecoder

Description

@tomvothecoder

Summary

Add follow-up benchmarks to better understand when kerchunk becomes preferable to the native NetCDF engine for our CMIP use case.

Goals

  • Refine the crossover point where kerchunk starts to outperform NetCDF
  • Test whether daily-frequency datasets change the result
  • Validate expectations for repeated access and remote data access

Benchmark additions

1. File-count bins near the crossover

Test datasets in narrower file-count ranges:

  • 25-49
  • 50-99
  • 100-149
  • 150-199
  • 200-299
  • 300-499
  • 500+

2. Higher-frequency datasets

Add daily-frequency datasets in addition to the current cases.

3. Repeated access

Measure both:

  • first open
  • repeated open of the same dataset

4. Remote access

Run the same comparisons for remote data access.

Dataset sampling

Use 3 datasets per file-count bin for the initial benchmark pass to keep the batch job within a reasonable runtime. If results are noisy or the crossover remains unclear, follow up with additional samples in the most relevant bins.

Operations to compare

  • open
  • load
  • temporal average
  • spatial average
  • subset

Notes

Current results suggest:

  • NetCDF is generally favored when data is colocated with compute at NERSC
  • kerchunk becomes more attractive as file counts increase
  • remote access still needs to be tested

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions