Skip to content

Commit 9172da4

Browse files
πŸ“„ Add an upgrade guide for 0.1.x β†’ v1.0.0 migration (#439)
Co-authored-by: Fabien Casenave <fabien.casenave@safrangroup.com>
1 parent 3b4203c commit 9172da4

4 files changed

Lines changed: 298 additions & 1 deletion

File tree

β€ŽCHANGELOG.mdβ€Ž

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Added
1111

12+
- (docs) add an upgrade guide describing breaking changes when migrating from the `0.1.x` series to `v1.0.0`.
1213
- Add files to support mybinder.org
1314
- (plaid-check) add a simple app to check the integrity of a plaid database
1415
- (dataset-viewer) add a trame app for dataset visual exploration.

β€Ždocs/source/tutorials/downloadable_example.mdβ€Ž

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,6 @@ print("samples.vki_ls59:", samples.vki_ls59)
1717

1818
The first call to `samples.vki_ls59` triggers a download and takes a few seconds.
1919

20-
## Cached retrieval
20+
## Cached retrievals
2121

2222
Subsequent calls are instantaneous because they reuse the cached sample.

β€Ždocs/source/upgrade_guide.mdβ€Ž

Lines changed: 295 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,295 @@
1+
---
2+
title: Upgrade guide
3+
---
4+
5+
# Upgrade guide
6+
7+
This page explains how to upgrade an existing code base to **PLAID v1.0.0**.
8+
9+
PLAID follows [Semantic Versioning](https://semver.org/). The `v1.0.0` release is
10+
the first major release: it consolidates the data model, removes deprecated and
11+
out-of-scope modules, and simplifies several public APIs. As a major release, it
12+
contains **breaking changes**.
13+
14+
The guide is organized by **version jump**. Read the section that matches the
15+
version you are upgrading *from*. For the exhaustive, change-by-change history,
16+
see the [`CHANGELOG.md`](https://github.com/PLAID-lib/plaid/blob/main/CHANGELOG.md).
17+
18+
!!! tip "Upgrade incrementally"
19+
If you are several versions behind, pin `pyplaid`, upgrade one step at a
20+
time, and run your test suite between steps. The last release of the `0.x`
21+
series is **`0.1.15`**; the sections below describe the jump from `0.1.x` to
22+
`1.0.0`.
23+
24+
!!! info "Related documentation"
25+
This guide focuses on *what changed* and *how to migrate*. For *how the new
26+
API works*, see:
27+
28+
- [Quickstart](quickstart.md) β€” the new read/write pattern in a nutshell.
29+
- [Concepts](concepts.md) β€” [Sample](concepts/sample.md),
30+
[Problem definition](concepts/problem_definition.md),
31+
[Infos](concepts/infos.md), [Dataset](concepts/dataset.md),
32+
[Disk format](concepts/disk_format.md).
33+
- [Conversion tutorial](tutorials/storage.md) β€” end-to-end storage workflow
34+
(save, load, backends, Hub, parallel I/O).
35+
- [API reference](api/index.md) β€” in particular
36+
[`plaid.storage`](api/storage/backend_api.md).
37+
38+
---
39+
40+
## Upgrade to v1.0.0 from 0.1.x
41+
42+
`v1.0.0` reorganizes the package. The changes most likely to affect your code are
43+
listed below, with before/after examples.
44+
45+
### Top-level imports
46+
47+
The single biggest change is that **the `Dataset` class has been removed**: it is
48+
no longer exported from the top-level `plaid` package *and no longer exists as a
49+
module either*. A new `Infos` object is exported, and the version string moved
50+
module. See [Removing the `Dataset` class](#removing-the-dataset-class-use-plaidstorage)
51+
below for the full migration.
52+
53+
```python
54+
# Before (0.1.x)
55+
from plaid import Dataset, Sample, ProblemDefinition
56+
from plaid import __version__ # backed by plaid._version
57+
58+
# After (1.0.0)
59+
from plaid import Sample, ProblemDefinition, Infos
60+
from plaid import __version__ # backed by plaid.version
61+
# `Dataset` no longer exists β€” there is no `plaid.containers.dataset` module.
62+
# Use the storage helpers instead:
63+
from plaid.storage import save_to_disk, init_from_disk
64+
```
65+
66+
The helpers `get_number_of_samples` and `get_sample_ids` are still exported from
67+
the top-level package.
68+
69+
### Removing the `Dataset` class: use `plaid.storage`
70+
71+
In the `0.1.x` series, `plaid.Dataset` was a **monolithic, in-memory container**:
72+
you built one `Dataset` object, appended every `Sample` to it, kept the whole
73+
collection in RAM, and called `save_to_dir` / `load` on that object.
74+
75+
In `v1.0.0` this class is **removed entirely** β€” there is no public high-level
76+
dataset container class anymore, and there is no `plaid.containers.dataset`
77+
module to import from. The data model is now centered on three objects β€”
78+
[`Sample`](concepts/sample.md), [`ProblemDefinition`](concepts/problem_definition.md)
79+
and [`Infos`](concepts/infos.md) β€” plus the storage helpers in
80+
[`plaid.storage`](api/storage/backend_api.md). A dataset on disk is a shared
81+
metadata layout plus backend-specific sample payloads; loading it back gives you,
82+
**per split**, a backend dataset object and a `Converter` that materializes
83+
individual `Sample` objects lazily.
84+
85+
This is a deliberate shift away from "load the whole dataset into one in-memory
86+
object" toward **backend-agnostic, lazy, per-sample access**, so that large
87+
datasets that do not fit in memory can be streamed sample by sample into ML
88+
pipelines. The concepts are introduced in [Quickstart](quickstart.md) and the
89+
[Dataset concept page](concepts/dataset.md); the end-to-end workflow is in the
90+
[Conversion tutorial](tutorials/storage.md).
91+
92+
#### Writing: build-then-append β†’ `save_to_disk(sample_constructor, ids)`
93+
94+
Instead of building a `Dataset` and appending samples, you provide a
95+
`sample_constructor(id) -> Sample` callable plus an `ids` mapping of split names
96+
to sliceable id sequences. PLAID handles iteration, generator creation and
97+
parallel sharding internally, and writes directly to the chosen backend.
98+
99+
```python
100+
# Before (0.1.x) β€” everything in memory, then dumped
101+
from plaid import Dataset, Sample
102+
103+
dataset = Dataset()
104+
for raw in raw_items:
105+
sample = Sample()
106+
# fill the sample: add_tree, add_field, ...
107+
dataset.add_sample(sample)
108+
dataset.save_to_dir("my_plaid_dataset")
109+
110+
# After (1.0.0) β€” lazy, per-sample, backend-aware
111+
from plaid import Sample
112+
from plaid.storage import save_to_disk
113+
114+
def sample_constructor(sample_id):
115+
sample = Sample()
116+
# fill the sample: add_tree, add_field, ...
117+
return sample
118+
119+
save_to_disk(
120+
"my_plaid_dataset",
121+
sample_constructor=sample_constructor,
122+
ids={"train": [0, 1, 2], "test": [3, 4]},
123+
backend="zarr", # one of "hf_datasets", "cgns", "zarr"
124+
)
125+
```
126+
127+
See the [Conversion tutorial](tutorials/storage.md) for a complete example
128+
(including `num_proc` parallel writing and `push_to_hub`) and the
129+
[writer API](api/storage/writer.md).
130+
131+
#### Reading: `Dataset.load(...)` β†’ `init_from_disk(...)` + converter
132+
133+
Loading no longer returns a single object you index into. It returns a
134+
dictionary of backend datasets and a dictionary of converters, one per split.
135+
You materialize a `Sample` on demand with `converter.to_plaid(dataset, idx)`.
136+
137+
```python
138+
# Before (0.1.x)
139+
from plaid import Dataset
140+
141+
dataset = Dataset()
142+
dataset.load("my_plaid_dataset")
143+
sample = dataset[0]
144+
n = len(dataset)
145+
146+
# After (1.0.0)
147+
from plaid.storage import init_from_disk
148+
149+
datasetdict, converterdict = init_from_disk("my_plaid_dataset")
150+
dataset = datasetdict["train"]
151+
converter = converterdict["train"]
152+
153+
sample = converter.to_plaid(dataset, 0) # materialize one Sample lazily
154+
n = len(dataset)
155+
```
156+
157+
The same shape is used for the Hub (`download_from_hub`,
158+
`init_streaming_from_hub`). See the [Dataset concept page](concepts/dataset.md),
159+
the [reader API](api/storage/reader.md), and the [backend API](api/storage/backend_api.md).
160+
161+
#### Operation-by-operation map
162+
163+
| `0.1.x` β€” `Dataset` method | `1.0.0` β€” replacement |
164+
| --- | --- |
165+
| `Dataset()` + `add_sample` / `add_samples` / `from_list_of_samples` | `save_to_disk(sample_constructor=..., ids=...)` |
166+
| `Dataset.save_to_dir(path)` / `add_to_dir` | `save_to_disk(path, sample_constructor=..., ids=...)` |
167+
| `Dataset.load(path)` | `init_from_disk(path)` β†’ `(datasetdict, converterdict)` |
168+
| `dataset[i]` / `get_samples()` | `converter.to_plaid(dataset, i)` |
169+
| `len(dataset)` / `get_number_of_samples()` | `len(dataset)` (per-split backend object) |
170+
| `dataset.set_infos(...)` / `get_infos()` | pass [`Infos`](concepts/infos.md) to `save_to_disk(infos=...)`; read back with `Infos.from_path(path)` |
171+
| persisting a `ProblemDefinition` with the dataset | `save_to_disk(..., pb_defs=...)`; read back with `load_problem_definitions_from_disk(path)` |
172+
| `Dataset.add_features_from_tabular` (ex-`from_tabular`) | build the corresponding `Sample` objects in `sample_constructor` |
173+
| `Dataset.extract_dataset_from_identifier` | request features at read time: `converter.to_plaid(dataset, i, features=[...])` |
174+
| `Dataset.get_tabular_from_stacked_identifiers` | gather features yourself from the materialized `Sample` objects |
175+
| `plaid.examples` | `plaid.downloadable_examples` |
176+
| change backend (e.g. CGNS β†’ HF) | `init_from_disk` then `save_to_disk` with the new `backend` (see the [Conversion tutorial](tutorials/storage.md)) |
177+
178+
If you only need a subset of features or spatial indices, the converter supports
179+
`features=[...]` and `indexers={...}` for partial reads on the `hf_datasets` and
180+
`zarr` backends β€” see the [Conversion tutorial](tutorials/storage.md#indexed-extraction-with-indexers).
181+
182+
### Removed modules
183+
184+
The following modules were removed from the `plaid` package in `1.0.0`. They were
185+
either out of the scope of the data model or superseded:
186+
187+
| Removed module | What to do instead |
188+
| --- | --- |
189+
| `plaid.pipelines` (`plaid_blocks`, `sklearn_block_wrappers`) | build ML pipelines outside PLAID, on top of the data model |
190+
| `plaid.post` (`bisect`, `metrics`) | compute post-processing / metrics in your own code |
191+
| `plaid.utils.split` | manage dataset splits via `ProblemDefinition` train/test splits |
192+
| `plaid.utils.stats` | compute statistics in your own code |
193+
| `plaid.utils.interpolation` | use an external interpolation routine |
194+
| `plaid.utils.init_with_tabular` | construct samples explicitly |
195+
| `plaid.utils.deprecation`, `plaid.utils.base` | internal helpers, no public replacement |
196+
197+
If you imported any of these, remove the import and move the corresponding logic
198+
into your own project, or rely on the supported data-model APIs.
199+
200+
### `ProblemDefinition`
201+
202+
`ProblemDefinition` was rewritten as a compact [pydantic](https://docs.pydantic.dev/)
203+
model with four required fields β€” `input_features`, `output_features`,
204+
`train_split` and `test_split`. The many `*_features_identifiers` accessors were
205+
collapsed into two methods, splits became plain model attributes, and YAML key
206+
order is now enforced on save.
207+
208+
```python
209+
# Before (0.1.x)
210+
pb.add_in_features_identifiers([...])
211+
pb.add_out_features_identifiers([...])
212+
pb.set_in_features_identifiers([...])
213+
pb.set_out_features_identifiers([...])
214+
pb.get_in_features_identifiers()
215+
pb.get_out_features_identifiers()
216+
pb.get_split("train") # split accessors
217+
218+
# After (1.0.0)
219+
from plaid import ProblemDefinition
220+
221+
pb = ProblemDefinition(
222+
input_features=["Base/Zone/GridCoordinates/CoordinateX"],
223+
output_features=["Base/Zone/VertexFields/pressure"],
224+
train_split={"train": [0, 1, 2]},
225+
test_split={"test": [3, 4]},
226+
)
227+
pb.add_input_features([...])
228+
pb.add_output_features([...])
229+
pb.train_split # direct attribute access
230+
pb.test_split
231+
```
232+
233+
The public surface of `ProblemDefinition` in `1.0.0` is intentionally small:
234+
`from_path`, `model_validate`, `add_input_features`, `add_output_features`,
235+
`save_to_file`, and the four model fields (`input_features`, `output_features`,
236+
`train_split`, `test_split`). The previous `constant_features_identifiers`
237+
accessors and the `get_*_split_*` / `set_*_split_*` helpers were removed
238+
together with the in/out identifier accessors; splits are now read and assigned
239+
directly via the `train_split` / `test_split` attributes, and feature lists are
240+
normalized (stringified, sorted, deduplicated, non-empty) by pydantic
241+
validators. The problem name is no longer stored in the model β€” on disk it is
242+
the YAML filename stem, in memory it is the dictionary key returned by
243+
`load_problem_definitions_from_disk`. See the
244+
[Problem definition concept page](concepts/problem_definition.md) and the
245+
[`problem_definition` API](api/problem_definition.md).
246+
247+
### Storage / CGNS backend
248+
249+
The constant/variable mechanism used in the CGNS backend reading and writing
250+
paths was removed. If you relied on that distinction at the storage level, review
251+
your read/write code against the current
252+
[backend API](api/storage/backend_api.md) and the
253+
[CGNS backend API](api/storage/cgns/index.md). The on-disk layout written by
254+
`save_to_disk` (shared metadata + per-backend payloads) is described in the
255+
[Disk format concept page](concepts/disk_format.md), and the three backends
256+
(`hf_datasets`, `cgns`, `zarr`) are compared in the
257+
[Conversion tutorial](tutorials/storage.md#choosing-a-backend).
258+
259+
### New in v1.0.0
260+
261+
`v1.0.0` also introduces new building blocks you can adopt:
262+
263+
- **`plaid.infos`** β€” a dedicated pydantic `Infos` class, now living at the same
264+
level as `ProblemDefinition` (see [Infos](concepts/infos.md)).
265+
- **`plaid.viewer`** β€” an interactive [trame](https://kitware.github.io/trame/)
266+
application for visual dataset exploration (see [Viewer](concepts/viewer.md)).
267+
- **`plaid-check`** β€” a CLI tool that validates the integrity of a local PLAID
268+
dataset (on-disk layout, `infos.yaml`, splits, sample conversion, invalid
269+
numeric values, duplicated samples, and optional problem definitions); see
270+
[Dataset check](concepts/check.md).
271+
272+
---
273+
274+
## Upgrading from an older 0.1.x release
275+
276+
If you are upgrading from a release earlier than `0.1.15`, first move up to
277+
`0.1.15` and account for the intermediate breaking changes documented in the
278+
[`CHANGELOG.md`](https://github.com/PLAID-lib/plaid/blob/main/CHANGELOG.md), in
279+
particular:
280+
281+
- **0.1.15** β€” `save_to_disk` API simplified: `generators` replaced by
282+
`sample_constructor` and `ids`.
283+
- **0.1.13** β€” `get_mesh` renamed to `get_tree`; `get_<x>_assignment` renamed to
284+
`resolve_<x>` (e.g. `get_time_assignment` β†’ `resolve_time`).
285+
- **0.1.11** β€” `get_all_mesh_times()` renamed to `get_all_time_values()`;
286+
`FeatureIdentifier` moved from `plaid.types` to `plaid.containers`; Python 3.10
287+
support dropped.
288+
- **0.1.10** β€” `Sample` restructured to store globals at time steps (scalars and
289+
time series unified into CGNS trees).
290+
- **0.1.8** β€” `Dataset.from_tabular` β†’ `Dataset.add_features_from_tabular`;
291+
`Dataset.from_features_identifier` β†’ `Dataset.extract_dataset_from_identifier`;
292+
`Sample.from_features_identifier` β†’ `Sample.extract_sample_from_identifier`.
293+
294+
Once on `0.1.15`, follow the [Upgrade to v1.0.0 from 0.1.x](#upgrade-to-v100-from-01x)
295+
section above.

β€Ždocs/zensical.tomlβ€Ž

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ nav = [
5757
] },
5858
{ "plaid-viewer" = "concepts/viewer.md" },
5959
{ "plaid-check" = "concepts/check.md" },
60+
{ "Upgrade guide" = "upgrade_guide.md" },
6061
# >>> AUTO-GENERATED API REFERENCE START
6162
# The block below is overwritten by `python docs/generate_api_stubs.py`.
6263
# Edit that script (or the markers) instead of changing this section by hand.

0 commit comments

Comments
Β (0)