Hi everyone,
I've been digging around to see if there's already an existing way to use zarr.sync.ProcessSynchronizer(path) with S3 as path, but no luck.
My scenario is I have a Lambda function that listens to S3 events and writes NetCDF files to a Zarr store (on S3), each Lambda call will process one NetCDF file.
As Lambda is a distributed system, 10 new files uploaded will trigger 10 different processes that try to write to the Zarr store pretty much at the same time, and I experience some data corruption issues.
Using zarr.sync.ProcessSynchronizer() in xarray.dataset.to_zarr(synchronizer=...) for DirectoryStore seems to solve this write consistency issue.
But storing Zarr store on S3 is important to us, and cloud-optimised format like Zarr should be able to fully support S3. So I wonder if this is a bug or a non-existing feature or I just don't know it yet.
Please advise.
Thanks everyone.
Hi everyone,
I've been digging around to see if there's already an existing way to use
zarr.sync.ProcessSynchronizer(path)with S3 aspath, but no luck.My scenario is I have a Lambda function that listens to S3 events and writes NetCDF files to a Zarr store (on S3), each Lambda call will process one NetCDF file.
As Lambda is a distributed system, 10 new files uploaded will trigger 10 different processes that try to write to the Zarr store pretty much at the same time, and I experience some data corruption issues.
Using
zarr.sync.ProcessSynchronizer()inxarray.dataset.to_zarr(synchronizer=...)forDirectoryStoreseems to solve this write consistency issue.But storing Zarr store on
S3is important to us, and cloud-optimised format like Zarr should be able to fully support S3. So I wonder if this is a bug or a non-existing feature or I just don't know it yet.Please advise.
Thanks everyone.