Skip to content

Commit 975744d

Browse files
Merge branch 'main' into docs/406-upgrade-guide
2 parents 10906a2 + 3b4203c commit 975744d

41 files changed

Lines changed: 1618 additions & 943 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/source/concepts/check.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
---
2+
title: Dataset check
3+
---
4+
5+
# Dataset check
6+
7+
`plaid-check` validates the integrity of a local PLAID dataset.
8+
9+
It checks:
10+
11+
- required on-disk files and directories;
12+
- `infos.yaml`, metadata and split sample counts;
13+
- sample conversion through the declared storage backend;
14+
- invalid numeric values such as `None`, empty arrays, `NaN` and `Inf`;
15+
- duplicated samples;
16+
- optional `problem_definitions/` feature names, splits and indices.
17+
18+
## Basic usage
19+
20+
```bash
21+
plaid-check /path/to/plaid_dataset
22+
```
23+
24+
A valid dataset prints an `[OK]` line and returns exit code `0`.
25+
26+
## Options
27+
28+
Check only selected splits:
29+
30+
```bash
31+
plaid-check /path/to/plaid_dataset --split train --split test
32+
```
33+
34+
Check only selected problem definitions:
35+
36+
```bash
37+
plaid-check /path/to/plaid_dataset --problem-definition regression_500
38+
```
39+
40+
Emit a machine-readable report:
41+
42+
```bash
43+
plaid-check /path/to/plaid_dataset --json
44+
```
45+
46+
Make warnings fail the command:
47+
48+
```bash
49+
plaid-check /path/to/plaid_dataset --strict
50+
```
51+
52+
## Report format
53+
54+
Messages are reported with a severity, a stable code, a location and a short
55+
description. Errors return exit code `1`; warnings return exit code `2` only in
56+
strict mode.
57+
58+
## Notes
59+
60+
- For CGNS datasets, only `infos.yaml` and `data/` are required at the root.
61+
- For other backends, metadata files and `constants/` are checked as well.
62+
- Without `--problem-definition`, all discovered problem definitions are checked.
63+
- In JSON mode, progress bars are disabled to keep the output parseable.

docs/source/concepts/dataset.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -117,12 +117,18 @@ dictionary with the same schema:
117117

118118
```python
119119
from plaid import ProblemDefinition
120-
from plaid.infos import DataDescription, Infos, Legal
120+
from plaid.infos import DataDescription, Infos
121121
from plaid.storage import save_to_disk
122122

123-
pb_def = ProblemDefinition(name="regression_1")
123+
pb_def = ProblemDefinition(
124+
input_features=["Global/input"],
125+
output_features=["Base/Zone/VertexFields/pressure"],
126+
train_split={"train": [0, 1, 2]},
127+
test_split={"test": "all"},
128+
)
124129
infos = Infos(
125-
legal=Legal(owner="CompanyX", license="proprietary"),
130+
owner="CompanyX",
131+
license="proprietary",
126132
data_description=DataDescription(number_of_samples=3),
127133
num_samples={"train": 3},
128134
)
@@ -132,7 +138,7 @@ save_to_disk(
132138
sample_constructor=sample_constructor,
133139
ids={"train": [0, 1, 2]},
134140
infos=infos,
135-
pb_defs=pb_def,
141+
pb_defs={"regression_1": pb_def},
136142
)
137143
```
138144

docs/source/concepts/infos.md

Lines changed: 32 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -9,21 +9,23 @@ root of a PLAID dataset.
99

1010
In the current API, infos stores:
1111

12-
- `legal`, with required `owner` and `license` entries
12+
- `owner` and `license`, required string entries describing the dataset
13+
ownership and licensing
1314
- `data_production`, for optional production context such as simulator,
1415
hardware, contact, or location
1516
- `data_description`, for optional dataset description entries such as the
1617
number of samples, DOE, inputs, and outputs
17-
- `num_samples`, as a dictionary keyed by split name
18-
- `storage_backend`
18+
- `num_samples`, as a dictionary keyed by split name, populated by storage writers
19+
- `storage_backend`, as a storage backend identifier, populated by storage writers
1920

2021
## Basic usage
2122

2223
```python
23-
from plaid.infos import DataProduction, Infos, Legal
24+
from plaid.infos import DataProduction, Infos
2425

2526
infos = Infos(
26-
legal=Legal(owner="Safran", license="proprietary"),
27+
owner="Safran",
28+
license="proprietary",
2729
data_production=DataProduction(
2830
type="simulation",
2931
physics="fluid dynamics",
@@ -36,26 +38,35 @@ infos = Infos(
3638
Infos can also be built from a plain mapping, for instance after reading YAML:
3739

3840
```python
39-
infos = Infos.from_mapping(
41+
infos = Infos.model_validate(
4042
{
41-
"legal": {
42-
"owner": "Safran",
43-
"license": "proprietary",
44-
},
43+
"owner": "Safran",
44+
"license": "proprietary",
4545
}
4646
)
4747
```
4848

49+
`num_samples` and `storage_backend` are derived from the chosen storage backend
50+
and the saved split contents. They can be omitted when creating an `Infos`
51+
object that will later be passed to `save_to_disk(...)`; PLAID fills them before
52+
writing `infos.yaml`.
53+
4954
## Loading from disk
5055

51-
Load infos from a dataset path or directly from an `infos.yaml` file:
56+
Load infos from a complete dataset path or directly from an `infos.yaml` file:
5257

5358
```python
5459
infos = Infos.from_path("/path/to/plaid_dataset")
5560
```
5661

5762
When a directory is provided, `Infos.from_path(...)` looks for `infos.yaml`
58-
inside that directory.
63+
inside that directory. By default, loading from disk requires the persisted
64+
storage metadata (`num_samples` and `storage_backend`) to be present. To load a
65+
draft infos file that has not been produced by `save_to_disk(...)`, use:
66+
67+
```python
68+
infos = Infos.from_path("/path/to/draft/infos.yaml", require_persisted=False)
69+
```
5970

6071
## Saving
6172

@@ -68,20 +79,21 @@ infos.save_to_file("/path/to/plaid_dataset/infos.yaml")
6879
If a directory path is provided, the file is saved as `infos.yaml` inside that
6980
directory.
7081

71-
## Mapping-style access
82+
## Typed access and serialization
7283

73-
`Infos` provides read-only mapping-style helpers for compatibility with code
74-
expecting a YAML-like dictionary:
84+
`Infos` is a Pydantic model. Access metadata through typed attributes and use
85+
Pydantic serialization when a plain mapping is needed:
7586

7687
```python
77-
owner = infos["legal"]["owner"]
78-
backend = infos.get("storage_backend")
79-
payload = infos.to_dict()
88+
owner = infos.owner
89+
backend = infos.storage_backend
90+
payload = infos.model_dump(exclude_none=True)
8091
```
8192

8293
## Notes
8394

84-
- `legal.owner` and `legal.license` are required when validating complete infos.
85-
- `num_samples` and `storage_backend` are automatically filled when `save_to_disk(..., infos=...)` is called before writing `infos.yaml`.
95+
- `owner` and `license` are required when creating infos.
96+
- `num_samples` and `storage_backend` are required when loading persisted dataset infos.
97+
- `num_samples` and `storage_backend` are overwritten with the actual saved dataset values when `save_to_disk(..., infos=...)` is called before writing `infos.yaml`.
8698
- Unknown keys are rejected during validation.
8799
- `save_to_file(...)` writes YAML using the standard infos key order.

docs/source/concepts/problem_definition.md

Lines changed: 42 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -8,42 +8,59 @@ title: Problem definition
88

99
In the current API, a problem definition stores:
1010

11-
- `name`
12-
- `input_features` (`list[str]`)
13-
- `output_features` (`list[str]`)
14-
- `train_split` and `test_split`
11+
- `input_features` (`list[str]`, required and non-empty)
12+
- `output_features` (`list[str]`, required and non-empty)
13+
- `train_split` and `test_split` (required)
14+
15+
The problem identifier is not stored in the model. On disk, it is the YAML
16+
filename stem; in memory, it is the dictionary key used for the definition.
1517

1618
## Basic usage
1719

1820
```python
1921
from plaid import ProblemDefinition
2022

21-
pb = ProblemDefinition(name="regression_1")
22-
23-
pb.add_input_features([
24-
"Base/Zone/GridCoordinates/CoordinateX",
25-
"Base/Zone/GridCoordinates/CoordinateY",
26-
])
27-
pb.add_output_features([
28-
"Base/Zone/VertexFields/pressure",
29-
])
30-
31-
pb.train_split = {"train": [0, 1, 2]}
32-
pb.test_split = {"test": [3, 4]}
23+
pb = ProblemDefinition(
24+
input_features=[
25+
"Base/Zone/GridCoordinates/CoordinateX",
26+
"Base/Zone/GridCoordinates/CoordinateY",
27+
],
28+
output_features=[
29+
"Base/Zone/VertexFields/pressure",
30+
],
31+
train_split={"train": [0, 1, 2]},
32+
test_split={"test": [3, 4]},
33+
)
3334
```
3435

3536
Feature lists are normalized by the model: entries are converted to strings,
36-
sorted, and checked for duplicates.
37+
sorted, checked for duplicates, and rejected if empty.
38+
39+
Problem definitions can also be validated from a plain mapping, for instance
40+
after reading YAML:
41+
42+
```python
43+
pb = ProblemDefinition.model_validate(
44+
{
45+
"input_features": ["Base/Zone/GridCoordinates/CoordinateX"],
46+
"output_features": ["Base/Zone/VertexFields/pressure"],
47+
"train_split": {"train": [0, 1, 2]},
48+
"test_split": {"test": [3, 4]},
49+
}
50+
)
51+
```
3752

3853
## Loading from disk
3954

4055
Load a definition from a dataset path:
4156

4257
```python
43-
pb = ProblemDefinition.from_path("/path/to/plaid_dataset", name="regression_1")
58+
pb = ProblemDefinition.from_path(
59+
"/path/to/plaid_dataset/problem_definitions/regression_1.yaml"
60+
)
4461
```
4562

46-
At storage level, problem definitions are loaded as a dictionary keyed by name:
63+
At storage level, problem definitions are loaded as a dictionary keyed by YAML filename stem:
4764

4865
```python
4966
from plaid.storage import load_problem_definitions_from_disk
@@ -60,10 +77,13 @@ Save to YAML:
6077
pb.save_to_file("problem_definitions/regression_1.yaml")
6178
```
6279

80+
This writes no `name:` key; `regression_1` is inferred from the filename by the
81+
storage loader.
82+
6383
## Notes
6484

65-
- Input/output features are plain strings correspond to CGNS paths.
66-
- Splits are represented by `train_split` and `test_split` dictionaries.
85+
- Input/output features are plain strings corresponding to CGNS paths.
86+
- Splits are represented by `train_split` and `test_split` dictionaries and are accessed directly as model attributes.
6787
- Split values can be explicit index sequences or the string `"all"`.
6888
- `add_input_features(...)` and `add_output_features(...)` accept either a
69-
single string or a sequence of strings.
89+
single string or a sequence of strings after initialization.

docs/source/examples_tutorials.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@ You can find here detailed examples for different parts of plaid, explained in J
1212
* [Sample](notebooks/containers/sample_example.md)
1313
* [Problem definition](notebooks/problem_definition_example.md)
1414
* [Infos](notebooks/infos_example.md)
15-
* [Downloadable samples](notebooks/downloadable_example/sample_example.md)
1615

1716
## Tutorials
1817

18+
* [Downloadable samples](tutorials/downloadable_example.md)
1919
* [Conversion tutorial](tutorials/storage.md)
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
title: Downloadable samples
3+
---
4+
5+
# Downloadable samples
6+
7+
## First retrieval
8+
9+
Retrieving sample examples is as easy as:
10+
11+
```python
12+
from plaid.downloadable_examples import AVAILABLE_EXAMPLES, samples
13+
14+
print(AVAILABLE_EXAMPLES)
15+
print("samples.vki_ls59:", samples.vki_ls59)
16+
```
17+
18+
The first call to `samples.vki_ls59` triggers a download and takes a few seconds.
19+
20+
## Cached retrieval
21+
22+
Subsequent calls are instantaneous because they reuse the cached sample.

0 commit comments

Comments
 (0)