Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions docs/source/concepts/check.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
title: Dataset check
---

# Dataset check

`plaid-check` validates the integrity of a local PLAID dataset.

It checks:

- required on-disk files and directories;
- `infos.yaml`, metadata and split sample counts;
- sample conversion through the declared storage backend;
- invalid numeric values such as `None`, empty arrays, `NaN` and `Inf`;
- duplicated samples;
- optional `problem_definitions/` feature names, splits and indices.

## Basic usage

```bash
plaid-check /path/to/plaid_dataset
```

A valid dataset prints an `[OK]` line and returns exit code `0`.

## Options

Check only selected splits:

```bash
plaid-check /path/to/plaid_dataset --split train --split test
```

Check only selected problem definitions:

```bash
plaid-check /path/to/plaid_dataset --problem-definition regression_500
```

Emit a machine-readable report:

```bash
plaid-check /path/to/plaid_dataset --json
```

Make warnings fail the command:

```bash
plaid-check /path/to/plaid_dataset --strict
```

## Report format

Messages are reported with a severity, a stable code, a location and a short
description. Errors return exit code `1`; warnings return exit code `2` only in
strict mode.

## Notes

- For CGNS datasets, only `infos.yaml` and `data/` are required at the root.
- For other backends, metadata files and `constants/` are checked as well.
- Without `--problem-definition`, all discovered problem definitions are checked.
- In JSON mode, progress bars are disabled to keep the output parseable.
14 changes: 10 additions & 4 deletions docs/source/concepts/dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,12 +117,18 @@ dictionary with the same schema:

```python
from plaid import ProblemDefinition
from plaid.infos import DataDescription, Infos, Legal
from plaid.infos import DataDescription, Infos
from plaid.storage import save_to_disk

pb_def = ProblemDefinition(name="regression_1")
pb_def = ProblemDefinition(
input_features=["Global/input"],
output_features=["Base/Zone/VertexFields/pressure"],
train_split={"train": [0, 1, 2]},
test_split={"test": "all"},
)
infos = Infos(
legal=Legal(owner="CompanyX", license="proprietary"),
owner="CompanyX",
license="proprietary",
data_description=DataDescription(number_of_samples=3),
num_samples={"train": 3},
)
Expand All @@ -132,7 +138,7 @@ save_to_disk(
sample_constructor=sample_constructor,
ids={"train": [0, 1, 2]},
infos=infos,
pb_defs=pb_def,
pb_defs={"regression_1": pb_def},
)
```

Expand Down
52 changes: 32 additions & 20 deletions docs/source/concepts/infos.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,23 @@ root of a PLAID dataset.

In the current API, infos stores:

- `legal`, with required `owner` and `license` entries
- `owner` and `license`, required string entries describing the dataset
ownership and licensing
- `data_production`, for optional production context such as simulator,
hardware, contact, or location
- `data_description`, for optional dataset description entries such as the
number of samples, DOE, inputs, and outputs
- `num_samples`, as a dictionary keyed by split name
- `storage_backend`
- `num_samples`, as a dictionary keyed by split name, populated by storage writers
- `storage_backend`, as a storage backend identifier, populated by storage writers

## Basic usage

```python
from plaid.infos import DataProduction, Infos, Legal
from plaid.infos import DataProduction, Infos

infos = Infos(
legal=Legal(owner="Safran", license="proprietary"),
owner="Safran",
license="proprietary",
data_production=DataProduction(
type="simulation",
physics="fluid dynamics",
Expand All @@ -36,26 +38,35 @@ infos = Infos(
Infos can also be built from a plain mapping, for instance after reading YAML:

```python
infos = Infos.from_mapping(
infos = Infos.model_validate(
{
"legal": {
"owner": "Safran",
"license": "proprietary",
},
"owner": "Safran",
"license": "proprietary",
}
)
```

`num_samples` and `storage_backend` are derived from the chosen storage backend
and the saved split contents. They can be omitted when creating an `Infos`
object that will later be passed to `save_to_disk(...)`; PLAID fills them before
writing `infos.yaml`.

## Loading from disk

Load infos from a dataset path or directly from an `infos.yaml` file:
Load infos from a complete dataset path or directly from an `infos.yaml` file:

```python
infos = Infos.from_path("/path/to/plaid_dataset")
```

When a directory is provided, `Infos.from_path(...)` looks for `infos.yaml`
inside that directory.
inside that directory. By default, loading from disk requires the persisted
storage metadata (`num_samples` and `storage_backend`) to be present. To load a
draft infos file that has not been produced by `save_to_disk(...)`, use:

```python
infos = Infos.from_path("/path/to/draft/infos.yaml", require_persisted=False)
```

## Saving

Expand All @@ -68,20 +79,21 @@ infos.save_to_file("/path/to/plaid_dataset/infos.yaml")
If a directory path is provided, the file is saved as `infos.yaml` inside that
directory.

## Mapping-style access
## Typed access and serialization

`Infos` provides read-only mapping-style helpers for compatibility with code
expecting a YAML-like dictionary:
`Infos` is a Pydantic model. Access metadata through typed attributes and use
Pydantic serialization when a plain mapping is needed:

```python
owner = infos["legal"]["owner"]
backend = infos.get("storage_backend")
payload = infos.to_dict()
owner = infos.owner
backend = infos.storage_backend
payload = infos.model_dump(exclude_none=True)
```

## Notes

- `legal.owner` and `legal.license` are required when validating complete infos.
- `num_samples` and `storage_backend` are automatically filled when `save_to_disk(..., infos=...)` is called before writing `infos.yaml`.
- `owner` and `license` are required when creating infos.
- `num_samples` and `storage_backend` are required when loading persisted dataset infos.
- `num_samples` and `storage_backend` are overwritten with the actual saved dataset values when `save_to_disk(..., infos=...)` is called before writing `infos.yaml`.
- Unknown keys are rejected during validation.
- `save_to_file(...)` writes YAML using the standard infos key order.
64 changes: 42 additions & 22 deletions docs/source/concepts/problem_definition.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,42 +8,59 @@ title: Problem definition

In the current API, a problem definition stores:

- `name`
- `input_features` (`list[str]`)
- `output_features` (`list[str]`)
- `train_split` and `test_split`
- `input_features` (`list[str]`, required and non-empty)
- `output_features` (`list[str]`, required and non-empty)
- `train_split` and `test_split` (required)

The problem identifier is not stored in the model. On disk, it is the YAML
filename stem; in memory, it is the dictionary key used for the definition.

## Basic usage

```python
from plaid import ProblemDefinition

pb = ProblemDefinition(name="regression_1")

pb.add_input_features([
"Base/Zone/GridCoordinates/CoordinateX",
"Base/Zone/GridCoordinates/CoordinateY",
])
pb.add_output_features([
"Base/Zone/VertexFields/pressure",
])

pb.train_split = {"train": [0, 1, 2]}
pb.test_split = {"test": [3, 4]}
pb = ProblemDefinition(
input_features=[
"Base/Zone/GridCoordinates/CoordinateX",
"Base/Zone/GridCoordinates/CoordinateY",
],
output_features=[
"Base/Zone/VertexFields/pressure",
],
train_split={"train": [0, 1, 2]},
test_split={"test": [3, 4]},
)
```

Feature lists are normalized by the model: entries are converted to strings,
sorted, and checked for duplicates.
sorted, checked for duplicates, and rejected if empty.

Problem definitions can also be validated from a plain mapping, for instance
after reading YAML:

```python
pb = ProblemDefinition.model_validate(
{
"input_features": ["Base/Zone/GridCoordinates/CoordinateX"],
"output_features": ["Base/Zone/VertexFields/pressure"],
"train_split": {"train": [0, 1, 2]},
"test_split": {"test": [3, 4]},
}
)
```

## Loading from disk

Load a definition from a dataset path:

```python
pb = ProblemDefinition.from_path("/path/to/plaid_dataset", name="regression_1")
pb = ProblemDefinition.from_path(
"/path/to/plaid_dataset/problem_definitions/regression_1.yaml"
)
```

At storage level, problem definitions are loaded as a dictionary keyed by name:
At storage level, problem definitions are loaded as a dictionary keyed by YAML filename stem:

```python
from plaid.storage import load_problem_definitions_from_disk
Expand All @@ -60,10 +77,13 @@ Save to YAML:
pb.save_to_file("problem_definitions/regression_1.yaml")
```

This writes no `name:` key; `regression_1` is inferred from the filename by the
storage loader.

## Notes

- Input/output features are plain strings correspond to CGNS paths.
- Splits are represented by `train_split` and `test_split` dictionaries.
- Input/output features are plain strings corresponding to CGNS paths.
- Splits are represented by `train_split` and `test_split` dictionaries and are accessed directly as model attributes.
- Split values can be explicit index sequences or the string `"all"`.
- `add_input_features(...)` and `add_output_features(...)` accept either a
single string or a sequence of strings.
single string or a sequence of strings after initialization.
2 changes: 1 addition & 1 deletion docs/source/examples_tutorials.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ You can find here detailed examples for different parts of plaid, explained in J
* [Sample](notebooks/containers/sample_example.md)
* [Problem definition](notebooks/problem_definition_example.md)
* [Infos](notebooks/infos_example.md)
* [Downloadable samples](notebooks/downloadable_example/sample_example.md)

## Tutorials

* [Downloadable samples](tutorials/downloadable_example.md)
* [Conversion tutorial](tutorials/storage.md)
22 changes: 22 additions & 0 deletions docs/source/tutorials/downloadable_example.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
title: Downloadable samples
---

# Downloadable samples

## First retrieval

Retrieving sample examples is as easy as:

```python
from plaid.downloadable_examples import AVAILABLE_EXAMPLES, samples

print(AVAILABLE_EXAMPLES)
print("samples.vki_ls59:", samples.vki_ls59)
```

The first call to `samples.vki_ls59` triggers a download and takes a few seconds.

## Cached retrieval

Subsequent calls are instantaneous because they reuse the cached sample.
Loading
Loading