PLAID-lib · casenave · Jun 5, 2026 · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026
@@ -0,0 +1,63 @@
+---
+title: Dataset check
+---
+
+# Dataset check
+
+`plaid-check` validates the integrity of a local PLAID dataset.
+
+It checks:
+
+- required on-disk files and directories;
+- `infos.yaml`, metadata and split sample counts;
+- sample conversion through the declared storage backend;
+- invalid numeric values such as `None`, empty arrays, `NaN` and `Inf`;
+- duplicated samples;
+- optional `problem_definitions/` feature names, splits and indices.
+
+## Basic usage
+
+```bash
+plaid-check /path/to/plaid_dataset
+```
+
+A valid dataset prints an `[OK]` line and returns exit code `0`.
+
+## Options
+
+Check only selected splits:
+
+```bash
+plaid-check /path/to/plaid_dataset --split train --split test
+```
+
+Check only selected problem definitions:
+
+```bash
+plaid-check /path/to/plaid_dataset --problem-definition regression_500
+```
+
+Emit a machine-readable report:
+
+```bash
+plaid-check /path/to/plaid_dataset --json
+```
+
+Make warnings fail the command:
+
+```bash
+plaid-check /path/to/plaid_dataset --strict
+```
+
+## Report format
+
+Messages are reported with a severity, a stable code, a location and a short
+description.  Errors return exit code `1`; warnings return exit code `2` only in
+strict mode.
+
+## Notes
+
+- For CGNS datasets, only `infos.yaml` and `data/` are required at the root.
+- For other backends, metadata files and `constants/` are checked as well.
+- Without `--problem-definition`, all discovered problem definitions are checked.
+- In JSON mode, progress bars are disabled to keep the output parseable.
@@ -117,12 +117,18 @@ dictionary with the same schema:
 
 ```python
 from plaid import ProblemDefinition
-from plaid.infos import DataDescription, Infos, Legal
+from plaid.infos import DataDescription, Infos
 from plaid.storage import save_to_disk
 
-pb_def = ProblemDefinition(name="regression_1")
+pb_def = ProblemDefinition(
+    input_features=["Global/input"],
+    output_features=["Base/Zone/VertexFields/pressure"],
+    train_split={"train": [0, 1, 2]},
+    test_split={"test": "all"},
+)
 infos = Infos(
-    legal=Legal(owner="CompanyX", license="proprietary"),
+    owner="CompanyX",
+    license="proprietary",
     data_description=DataDescription(number_of_samples=3),
     num_samples={"train": 3},
 )
@@ -132,7 +138,7 @@ save_to_disk(
     sample_constructor=sample_constructor,
     ids={"train": [0, 1, 2]},
     infos=infos,
-    pb_defs=pb_def,
+    pb_defs={"regression_1": pb_def},
 )
 ```
 

@@ -9,21 +9,23 @@ root of a PLAID dataset.
 
 In the current API, infos stores:
 
-- `legal`, with required `owner` and `license` entries
+- `owner` and `license`, required string entries describing the dataset
+  ownership and licensing
 - `data_production`, for optional production context such as simulator,
   hardware, contact, or location
 - `data_description`, for optional dataset description entries such as the
   number of samples, DOE, inputs, and outputs
-- `num_samples`, as a dictionary keyed by split name
-- `storage_backend`
+- `num_samples`, as a dictionary keyed by split name, populated by storage writers
+- `storage_backend`, as a storage backend identifier, populated by storage writers
 
 ## Basic usage
 
 ```python
-from plaid.infos import DataProduction, Infos, Legal
+from plaid.infos import DataProduction, Infos
 
 infos = Infos(
-    legal=Legal(owner="Safran", license="proprietary"),
+    owner="Safran",
+    license="proprietary",
     data_production=DataProduction(
         type="simulation",
         physics="fluid dynamics",
@@ -36,26 +38,35 @@ infos = Infos(
 Infos can also be built from a plain mapping, for instance after reading YAML:
 
 ```python
-infos = Infos.from_mapping(
+infos = Infos.model_validate(
     {
-        "legal": {
-            "owner": "Safran",
-            "license": "proprietary",
-        },
+        "owner": "Safran",
+        "license": "proprietary",
     }
 )
 ```
 
+`num_samples` and `storage_backend` are derived from the chosen storage backend
+and the saved split contents. They can be omitted when creating an `Infos`
+object that will later be passed to `save_to_disk(...)`; PLAID fills them before
+writing `infos.yaml`.
+
 ## Loading from disk
 
-Load infos from a dataset path or directly from an `infos.yaml` file:
+Load infos from a complete dataset path or directly from an `infos.yaml` file:
 
 ```python
 infos = Infos.from_path("/path/to/plaid_dataset")
 ```
 
 When a directory is provided, `Infos.from_path(...)` looks for `infos.yaml`
-inside that directory.
+inside that directory. By default, loading from disk requires the persisted
+storage metadata (`num_samples` and `storage_backend`) to be present. To load a
+draft infos file that has not been produced by `save_to_disk(...)`, use:
+
+```python
+infos = Infos.from_path("/path/to/draft/infos.yaml", require_persisted=False)
+```
 
 ## Saving
 
@@ -68,20 +79,21 @@ infos.save_to_file("/path/to/plaid_dataset/infos.yaml")
 If a directory path is provided, the file is saved as `infos.yaml` inside that
 directory.
 
-## Mapping-style access
+## Typed access and serialization
 
-`Infos` provides read-only mapping-style helpers for compatibility with code
-expecting a YAML-like dictionary:
+`Infos` is a Pydantic model. Access metadata through typed attributes and use
+Pydantic serialization when a plain mapping is needed:
 
 ```python
-owner = infos["legal"]["owner"]
-backend = infos.get("storage_backend")
-payload = infos.to_dict()
+owner = infos.owner
+backend = infos.storage_backend
+payload = infos.model_dump(exclude_none=True)
 ```
 
 ## Notes
 
-- `legal.owner` and `legal.license` are required when validating complete infos.
-- `num_samples` and `storage_backend` are automatically filled when `save_to_disk(..., infos=...)` is called before writing `infos.yaml`.
+- `owner` and `license` are required when creating infos.
+- `num_samples` and `storage_backend` are required when loading persisted dataset infos.
+- `num_samples` and `storage_backend` are overwritten with the actual saved dataset values when `save_to_disk(..., infos=...)` is called before writing `infos.yaml`.
 - Unknown keys are rejected during validation.
 - `save_to_file(...)` writes YAML using the standard infos key order.
@@ -8,42 +8,59 @@ title: Problem definition
 
 In the current API, a problem definition stores:
 
-- `name`
-- `input_features` (`list[str]`)
-- `output_features` (`list[str]`)
-- `train_split` and `test_split`
+- `input_features` (`list[str]`, required and non-empty)
+- `output_features` (`list[str]`, required and non-empty)
+- `train_split` and `test_split` (required)
+
+The problem identifier is not stored in the model. On disk, it is the YAML
+filename stem; in memory, it is the dictionary key used for the definition.
 
 ## Basic usage
 
 ```python
 from plaid import ProblemDefinition
 
-pb = ProblemDefinition(name="regression_1")
-
-pb.add_input_features([
-    "Base/Zone/GridCoordinates/CoordinateX",
-    "Base/Zone/GridCoordinates/CoordinateY",
-])
-pb.add_output_features([
-    "Base/Zone/VertexFields/pressure",
-])
-
-pb.train_split = {"train": [0, 1, 2]}
-pb.test_split = {"test": [3, 4]}
+pb = ProblemDefinition(
+    input_features=[
+        "Base/Zone/GridCoordinates/CoordinateX",
+        "Base/Zone/GridCoordinates/CoordinateY",
+    ],
+    output_features=[
+        "Base/Zone/VertexFields/pressure",
+    ],
+    train_split={"train": [0, 1, 2]},
+    test_split={"test": [3, 4]},
+)
 ```
 
 Feature lists are normalized by the model: entries are converted to strings,
-sorted, and checked for duplicates.
+sorted, checked for duplicates, and rejected if empty.
+
+Problem definitions can also be validated from a plain mapping, for instance
+after reading YAML:
+
+```python
+pb = ProblemDefinition.model_validate(
+    {
+        "input_features": ["Base/Zone/GridCoordinates/CoordinateX"],
+        "output_features": ["Base/Zone/VertexFields/pressure"],
+        "train_split": {"train": [0, 1, 2]},
+        "test_split": {"test": [3, 4]},
+    }
+)
+```
 
 ## Loading from disk
 
 Load a definition from a dataset path:
 
 ```python
-pb = ProblemDefinition.from_path("/path/to/plaid_dataset", name="regression_1")
+pb = ProblemDefinition.from_path(
+    "/path/to/plaid_dataset/problem_definitions/regression_1.yaml"
+)
 ```
 
-At storage level, problem definitions are loaded as a dictionary keyed by name:
+At storage level, problem definitions are loaded as a dictionary keyed by YAML filename stem:
 
 ```python
 from plaid.storage import load_problem_definitions_from_disk
@@ -60,10 +77,13 @@ Save to YAML:
 pb.save_to_file("problem_definitions/regression_1.yaml")
 ```
 
+This writes no `name:` key; `regression_1` is inferred from the filename by the
+storage loader.
+
 ## Notes
 
-- Input/output features are plain strings correspond to CGNS paths.
-- Splits are represented by `train_split` and `test_split` dictionaries.
+- Input/output features are plain strings corresponding to CGNS paths.
+- Splits are represented by `train_split` and `test_split` dictionaries and are accessed directly as model attributes.
 - Split values can be explicit index sequences or the string `"all"`.
 - `add_input_features(...)` and `add_output_features(...)` accept either a
-  single string or a sequence of strings.
+  single string or a sequence of strings after initialization.
@@ -12,8 +12,8 @@ You can find here detailed examples for different parts of plaid, explained in J
 * [Sample](notebooks/containers/sample_example.md)
 * [Problem definition](notebooks/problem_definition_example.md)
 * [Infos](notebooks/infos_example.md)
-* [Downloadable samples](notebooks/downloadable_example/sample_example.md)
 
 ## Tutorials
 
+* [Downloadable samples](tutorials/downloadable_example.md)
 * [Conversion tutorial](tutorials/storage.md)
@@ -0,0 +1,22 @@
+---
+title: Downloadable samples
+---
+
+# Downloadable samples
+
+## First retrieval
+
+Retrieving sample examples is as easy as:
+
+```python
+from plaid.downloadable_examples import AVAILABLE_EXAMPLES, samples
+
+print(AVAILABLE_EXAMPLES)
+print("samples.vki_ls59:", samples.vki_ls59)
+```
+
+The first call to `samples.vki_ls59` triggers a download and takes a few seconds.
+
+## Cached retrieval
+
+Subsequent calls are instantaneous because they reuse the cached sample.