Skip to content

Commit ea4e6b9

Browse files
authored
Add icechunk examples (#709)
1 parent 31eaa2c commit ea4e6b9

1 file changed

Lines changed: 128 additions & 0 deletions

File tree

docs/examples/icechunk.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
---
2+
file_format: mystnb
3+
kernelspec:
4+
name: python3
5+
---
6+
# Icechunk
7+
8+
This example shows how to perform large-scale distributed writes to Icechunk using Cubed
9+
(based on the examples for using [Icechunk with Dask](https://icechunk.io/en/latest/icechunk-python/dask/)).
10+
11+
Install the package pre-requisites by running the following:
12+
13+
```shell
14+
pip install cubed icechunk
15+
```
16+
17+
Start by creating an Icechunk store.
18+
19+
```{code-cell} ipython3
20+
import icechunk
21+
import tempfile
22+
23+
# initialize the icechunk store
24+
storage = icechunk.local_filesystem_storage(tempfile.TemporaryDirectory().name)
25+
icechunk_repo = icechunk.Repository.create(storage)
26+
icechunk_session = icechunk_repo.writable_session("main")
27+
```
28+
29+
## Write to Icechunk
30+
31+
Use `cubed.icechunk.store_icechunk` to write a Cubed array to an Icechunk store.
32+
The API follows that of {py:func}`cubed.store`.
33+
34+
First create a Cubed array to write:
35+
36+
```{code-cell} ipython3
37+
import cubed
38+
shape = (100, 100)
39+
cubed_chunks = (20, 20)
40+
cubed_array = cubed.random.random(shape, chunks=cubed_chunks)
41+
```
42+
43+
Now create the Zarr array you will write to.
44+
45+
```{code-cell} ipython3
46+
import zarr
47+
48+
zarr_chunks = (10, 10)
49+
group = zarr.group(store=icechunk_session.store, overwrite=True)
50+
51+
zarray = group.create_array(
52+
"array",
53+
shape=shape,
54+
chunks=zarr_chunks,
55+
dtype="f8",
56+
fill_value=float("nan"),
57+
)
58+
```
59+
60+
Note that the chunks in the store are a divisor of the Cubed chunks. This means each individual write task is independent, and will not conflict. It is your responsibility to ensure that such conflicts are avoided.
61+
62+
Now write
63+
64+
```{code-cell} ipython3
65+
from cubed.icechunk import store_icechunk
66+
67+
store_icechunk(
68+
icechunk_session,
69+
sources=[cubed_array],
70+
targets=[zarray]
71+
)
72+
```
73+
74+
Finally commit your changes!
75+
76+
```{code-cell} ipython3
77+
print(icechunk_session.commit("wrote a cubed array!"))
78+
```
79+
80+
## Read from Icechunk
81+
82+
Use {py:func}`cubed.from_zarr` to read from Icechunk - note that no special Icechunk-specific function is needed in this case.
83+
84+
```{code-cell} ipython3
85+
cubed.from_zarr(store=icechunk_session.store, path="array")
86+
```
87+
88+
## Distributed writes
89+
90+
In distributed contexts where the Session, and Zarr Array objects are sent across the network, you must opt-in to successful pickling of a writable store.
91+
`cubed.icechunk.store_icechunk` takes care of the hard bit of merging Sessions but it is required that you opt-in to pickling prior to creating the target Zarr array objects.
92+
93+
Here is an example:
94+
95+
```{code-cell} ipython3
96+
from cubed import config
97+
98+
# start a new session. Old session is readonly after committing
99+
100+
icechunk_session = icechunk_repo.writable_session("main")
101+
zarr_chunks = (10, 10)
102+
103+
# use the Cubed processes executor which requires pickling
104+
with config.set({"spec.executor_name": "processes"}):
105+
with icechunk_session.allow_pickling():
106+
cubed_array = cubed.random.random(shape, chunks=cubed_chunks)
107+
108+
group = zarr.group(
109+
store=icechunk_session.store,
110+
overwrite=True
111+
)
112+
113+
zarray = group.create_array(
114+
"array",
115+
shape=shape,
116+
chunks=zarr_chunks,
117+
dtype="f8",
118+
fill_value=float("nan"),
119+
)
120+
121+
store_icechunk(
122+
icechunk_session,
123+
sources=[cubed_array],
124+
targets=[zarray]
125+
)
126+
127+
print(icechunk_session.commit("wrote a cubed array!"))
128+
```

0 commit comments

Comments
 (0)