Skip to content

Commit 0645442

Browse files
Sharing of representative datasets via compressed Zarr zips (#2570)
1 parent 95eed0a commit 0645442

File tree

4 files changed

+184
-32
lines changed

4 files changed

+184
-32
lines changed

.github/ISSUE_TEMPLATE/02_bug.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ body:
1717
- type: "textarea"
1818
attributes:
1919
label: "Code sample"
20-
description: "If relevant, please provide a code example where this bug is shown as well as any error message. A [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) is preffered as it makes it much easier for developers to identify the cause of the bug. This also allows them quickly determine whether the problem is with your code or with Parcels itself. If you want support on a specific dataset, please [follow our instructions on how to share dataset metadata](https://docs.parcels-code.org/en/main/development/posting-issues.html)"
20+
description: "If relevant, please provide a code example where this bug is shown as well as any error message. A [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) is preffered as it makes it much easier for developers to identify the cause of the bug. This also allows them quickly determine whether the problem is with your code or with Parcels itself. If you want support on a specific dataset, please [follow our instructions on how to share representative datasets](https://docs.parcels-code.org/en/main/development/posting-issues.html)"
2121
value: |
2222
```python
2323
# Paste your code within this block

docs/development/posting-issues.md

Lines changed: 63 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -20,51 +20,96 @@ Following these templates provides structure and ensures that we have all the ne
2020
Parcels is designed to work with a large range of input datasets.
2121

2222
When extending support for various input datasets, or trying to debug problems
23-
that only occur with specific datasets, having the dataset metadata is very valuable.
23+
that only occur with specific datasets, having access to your dataset (or a
24+
close representation of it) is very valuable.
2425

25-
This metadata could include information such as:
26+
This could include information such as:
2627

2728
- the nature of the array variables (e.g., via CF compliant metadata)
2829
- descriptions about the origin of the dataset, or additional comments
2930
- the shapes and data types of the arrays
31+
- the grid topology (coordinates and key variables)
3032

3133
This also allows us to see if your metadata is broken/non-compliant with standards - where we can then suggest fixes for you (and maybe we can tell the data provider!).
3234
Since version 4 of Parcels we rely much more on metadata to discover information about your input data.
3335

34-
Sharing this metadata often provides enough debugging information to solve your problem, instead of having to share a whole dataset.
36+
Sharing a compact representation of your dataset often provides enough information to solve your problem, without having to share the full dataset (which may be very large or contain sensitive data).
3537

36-
Sharing dataset metadata is made easy in Parcels.
38+
Parcels makes this easy by replacing irrelevant array data with zeros and saving the result as a compressed Zarr zip store, which is typically small enough to attach directly to a GitHub issue.
3739

3840
### Step 1. Users
3941

4042
As a user with access to your dataset, you would do:
4143

4244
```{code-cell}
43-
import json
45+
:tags: [hide-cell]
4446
47+
# Generate an example dataset to zip. The user would use their own.
4548
import xarray as xr
49+
from parcels._datasets.structured.generic import datasets
50+
datasets['ds_2d_left'].to_netcdf("my_dataset.nc")
51+
```
52+
53+
```{code-cell}
54+
import os
55+
56+
import xarray as xr
57+
import zarr
58+
59+
from parcels._datasets.utils import replace_arrays_with_zeros
60+
61+
# load your dataset
62+
ds = xr.open_dataset("my_dataset.nc") # or xr.open_zarr(...), etc.
63+
64+
# Replace all data arrays with zeros, keeping coordinate metadata.
65+
# This keeps array shapes and metadata while removing actual data.
66+
#
67+
# You can customise `except_for` to also retain actual values for specific variables:
68+
# except_for='coords' — keep coordinate arrays (useful for grid topology)
69+
# except_for=['lon', 'lat'] — keep a specific list of variables
70+
# except_for=None — remove all arrays (useful to know about dtypes, structure, and metadata). This is the default for the function.
71+
ds_trimmed = replace_arrays_with_zeros(ds, except_for = None)
4672
47-
# defining an example dataset to illustrate
48-
# (you would use `xr.open_dataset(...)` instead)
49-
ds = xr.Dataset(attrs={"description": "my dataset"})
73+
# Save to a zipped Zarr store - replace `my_dataset` with a more informative name
74+
with zarr.storage.ZipStore("my_dataset.zip", mode='w') as store:
75+
ds_trimmed.to_zarr(store)
5076
51-
output_file = "my_dataset.json"
52-
with open(output_file, "w") as f:
53-
json.dump(ds.to_dict(data=False), f) # write your dataset to a JSON excluding array data
77+
size_mb_original = os.path.getsize("my_dataset.nc") / 1e6
78+
print(f"Original size: {size_mb_original:.1f} MB")
79+
80+
# Check the file size (aim for < 25 MB so it can be attached to a GitHub issue)
81+
size_mb = os.path.getsize("my_dataset.zip") / 1e6
82+
print(f"Zip store size: {size_mb:.1f} MB")
5483
```
5584

56-
Then attach the JSON file written above alongside your issue
85+
Then attach the zip file written above alongside your issue.
86+
87+
If the file is larger than 25 MB, try passing `except_for=None` (the default)
88+
to ensure all arrays are zeroed out. If it is still too large, consider
89+
subsetting your dataset to a smaller spatial or temporal region before saving.
5790

5891
### Step 2. Maintainers and developers
5992

60-
As developers looking to inspect the metadata, we would do:
93+
As developers looking to inspect the dataset, we would do:
94+
95+
```{code-cell}
96+
import xarray as xr
97+
import zarr
98+
99+
ds = xr.open_zarr(zarr.storage.ZipStore("my_dataset.zip", mode="r"))
100+
ds
101+
```
61102

62103
```{code-cell}
63-
from parcels._datasets.utils import from_xarray_dataset_dict
104+
:tags: [hide-cell]
105+
106+
# Cleanup files in doc build process
107+
del ds
108+
from pathlib import Path
109+
Path("my_dataset.zip").unlink()
110+
Path("my_dataset.nc").unlink()
64111
65-
with open(output_file) as f:
66-
d = json.load(f)
67-
ds = from_xarray_dataset_dict(d)
68112
```
69113

70-
From there we can take a look the metadata of your dataset!
114+
From there we can take a look at the structure, metadata, and grid topology of your dataset!
115+
This also makes it straightforward for us to add this dataset to our test suite.

src/parcels/_datasets/utils.py

Lines changed: 41 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
import copy
2-
from typing import Any
1+
from collections.abc import Hashable
2+
from typing import Any, Literal
33

44
import numpy as np
55
import xarray as xr
@@ -186,21 +186,49 @@ def verbose_print(*args, **kwargs):
186186
verbose_print("=" * 30 + " End of Comparison " + "=" * 30)
187187

188188

189-
def from_xarray_dataset_dict(d) -> xr.Dataset:
190-
"""Reconstruct a dataset with zero data from the output of ``xarray.Dataset.to_dict(data=False)``.
189+
def replace_arrays_with_zeros(
190+
ds: xr.Dataset, except_for: Literal["coords"] | list[Hashable] | None = None
191+
) -> xr.Dataset:
192+
"""Replace datavars in the xarray dataset with zeros, except for some.
191193
192-
Useful in issues helping users debug fieldsets - sharing dataset schemas with associated metadata
193-
without sharing the data itself.
194+
Parameters
195+
----------
196+
ds : xr.Dataset
197+
The dataset whose arrays will be replaced with zeros.
198+
except_for : "coords" or list of Hashable or None, optional
199+
Controls which arrays are preserved:
194200
195-
Example
201+
- ``None``: Replace all arrays with zeros.
202+
- ``"coords"``: Replace all arrays with zeros except the non-index coords.
203+
- list: Provide a list of variable/coord names to exclude from zeroing.
204+
205+
Returns
196206
-------
197-
>>> import xarray as xr
198-
>>> from parcels._datasets.structured.generic import datasets
199-
>>> ds = datasets['ds_2d_left']
200-
>>> d = ds.to_dict(data=False)
201-
>>> ds2 = from_xarray_dataset_dict(d)
207+
xr.Dataset
208+
A copy of ``ds`` with the selected arrays replaced by zeros.
202209
"""
203-
return xr.Dataset.from_dict(_fill_with_dummy_data(copy.deepcopy(d)))
210+
import dask.array as da
211+
212+
if except_for is None:
213+
except_for = []
214+
if except_for == "coords":
215+
except_for = list(ds.coords.keys())
216+
217+
ds = ds.copy()
218+
ds_keys = set(ds.data_vars) | set(ds.coords)
219+
for k in except_for:
220+
if k not in ds_keys:
221+
raise ValueError(f"Item {k!r} in `except_for` not a valid item in dataset. Got {except_for=!r}.")
222+
223+
for k in ds_keys - set(except_for):
224+
data = da.zeros_like(ds[k].data)
225+
try:
226+
ds[k].data = data
227+
except ValueError:
228+
# Cannot assign to dimension coordinate, leave as is
229+
pass
230+
231+
return ds
204232

205233

206234
def _fill_with_dummy_data(d: dict[str, dict]):

tests/datasets/test_utils.py

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
import numpy as np
2+
import pytest
3+
import xarray as xr
4+
5+
from parcels._datasets import utils
6+
from parcels._datasets.structured.generic import datasets
7+
8+
9+
@pytest.fixture
10+
def nonzero_ds():
11+
"""Small dataset with nonzero data_vars and non-index coords for replace_arrays_with_zeros tests.
12+
13+
Uses 2D lon/lat as coords so they are regular (non-index) variables that can be zeroed.
14+
"""
15+
import dask.array as da
16+
17+
lon = np.array([[1.0, 2.0, 3.0, 4.0]] * 3)
18+
lat = np.array([[10.0] * 4, [20.0] * 4, [30.0] * 4])
19+
return xr.Dataset(
20+
{
21+
"U": (["y", "x"], da.from_array(np.ones((3, 4)), chunks=-1)),
22+
"V": (["y", "x"], da.from_array(np.full((3, 4), 2.0), chunks=-1)),
23+
},
24+
coords={
25+
"lon": (["y", "x"], da.from_array(lon, chunks=-1)),
26+
"lat": (["y", "x"], da.from_array(lat, chunks=-1)),
27+
},
28+
)
29+
30+
31+
@pytest.mark.parametrize("ds", [pytest.param(v, id=k) for k, v in datasets.items()])
32+
@pytest.mark.parametrize("except_for", [None, "coords"])
33+
def test_replace_arrays_with_zeros(ds, except_for):
34+
# make sure doesn't error with range of datasets
35+
utils.replace_arrays_with_zeros(ds, except_for=except_for)
36+
37+
38+
def test_replace_arrays_with_zeros_none(nonzero_ds):
39+
"""except_for=None: all data_vars and coords replaced with zeros."""
40+
result = utils.replace_arrays_with_zeros(nonzero_ds, except_for=None)
41+
42+
for k in set(result.data_vars) | set(result.coords):
43+
assert np.all(result[k].values == 0), f"{k!r} should be zero"
44+
45+
46+
def test_replace_arrays_with_zeros_coords(nonzero_ds):
47+
"""except_for='coords': data_vars zeroed, coords preserved."""
48+
result = utils.replace_arrays_with_zeros(nonzero_ds, except_for="coords")
49+
50+
for k in result.data_vars:
51+
assert np.all(result[k].values == 0), f"data_var {k!r} should be zero"
52+
53+
np.testing.assert_array_equal(result["lon"].values, nonzero_ds["lon"].values)
54+
np.testing.assert_array_equal(result["lat"].values, nonzero_ds["lat"].values)
55+
56+
57+
def test_replace_arrays_with_zeros_list(nonzero_ds):
58+
"""except_for=[...]: listed variables preserved, others zeroed."""
59+
result = utils.replace_arrays_with_zeros(nonzero_ds, except_for=["U", "lon"])
60+
61+
np.testing.assert_array_equal(result["U"].values, nonzero_ds["U"].values)
62+
np.testing.assert_array_equal(result["lon"].values, nonzero_ds["lon"].values)
63+
assert np.all(result["V"].values == 0), "V should be zero"
64+
assert np.all(result["lat"].values == 0), "lat should be zero"
65+
66+
67+
def test_replace_arrays_with_zeros_does_not_mutate(nonzero_ds):
68+
"""Original dataset is not modified."""
69+
original_U = nonzero_ds["U"].values.copy()
70+
original_lon = nonzero_ds["lon"].values.copy()
71+
utils.replace_arrays_with_zeros(nonzero_ds, except_for=None)
72+
np.testing.assert_array_equal(nonzero_ds["U"].values, original_U)
73+
np.testing.assert_array_equal(nonzero_ds["lon"].values, original_lon)
74+
75+
76+
def test_replace_arrays_with_zeros_invalid_key(nonzero_ds):
77+
"""Invalid key in except_for raises ValueError."""
78+
with pytest.raises(ValueError, match="not a valid item"):
79+
utils.replace_arrays_with_zeros(nonzero_ds, except_for=["nonexistent"])

0 commit comments

Comments
 (0)