Skip to content

Commit 8389f83

Browse files
authored
Implement a portable lockfile. (#743)
1 parent 4fbaba6 commit 8389f83

32 files changed

Lines changed: 1413 additions & 85 deletions

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and
77

88
## Unreleased
99

10+
- {pull}`743` adds the `pytask.lock` lockfile as the primary state backend with a
11+
portable format and documentation. When no lockfile exists, pytask reads the legacy
12+
SQLite state and writes `pytask.lock`; `pytask build` continues updating the legacy
13+
database for downgrade compatibility.
1014
- {pull}`787` makes the `attributes` field mandatory on `PNode` and
1115
`PProvisionalNode`, and preserves existing node attributes when loading entries from
1216
the data catalog.

docs/source/how_to_guides/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ maxdepth: 1
1313
---
1414
migrating_from_scripts_to_pytask
1515
interfaces_for_dependencies_products
16+
portability
1617
remote_files
1718
functional_interface
1819
capture_warnings
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Portability
2+
3+
This guide explains what you need to do to move a pytask project between machines and
4+
why the lockfile is central to that process.
5+
6+
```{seealso}
7+
The lockfile format and behavior are documented in the
8+
[reference guide](../reference_guides/lockfile.md).
9+
```
10+
11+
## How to port a project
12+
13+
Use this checklist when you move a project to another machine or environment.
14+
15+
1. **Update state once on the source machine.**
16+
17+
Run a normal build so `pytask.lock` is up to date:
18+
19+
```console
20+
$ pytask build
21+
```
22+
23+
If you already have a recent lockfile and up-to-date outputs, you can skip this step.
24+
25+
1. **Ship the right files.**
26+
27+
Commit `pytask.lock` to your repository and move it with the project. In practice,
28+
you should move:
29+
30+
- the project files tracked in version control (source, configuration, data inputs
31+
and `pytask.lock`)
32+
- the build artifacts you want to reuse (often in `bld/` if you follow the tutorial
33+
layout)
34+
- the `.pytask` folder in case you are using the data catalog and it manages some of
35+
the files
36+
37+
1. **Files outside the project**
38+
39+
If you have files outside the project root (the folder with the `pyproject.toml`
40+
file), you need to make sure that the same relative layout exists on the target
41+
machine.
42+
43+
1. **Run pytask on the target machine.**
44+
45+
When states match, tasks are skipped. When they differ, tasks run and the lockfile is
46+
updated.
47+
48+
## What makes a project portable
49+
50+
There are two things that must stay stable across machines:
51+
52+
First, task and node IDs must be stable. An ID is the unique identifier that ties a task
53+
or node to an entry in `pytask.lock`. pytask builds these IDs from project-relative
54+
paths anchored at the project root, so most users do not need to do anything. If you
55+
implement custom nodes, make sure their IDs remain project-relative and stable across
56+
machines.
57+
58+
Second, state values must be portable. The lockfile stores opaque state strings from
59+
`PNode.state()` and `PTask.state()`, and pytask uses them to decide whether a task is up
60+
to date. Content hashes are portable; timestamps or absolute paths are not. This mostly
61+
matters when you define custom nodes or custom hash functions.
62+
63+
## Tips for stable state values
64+
65+
- Prefer file content hashes over timestamps for custom nodes.
66+
- For `PythonNode` values that are not natively stable, provide a custom hash function.
67+
- Avoid machine-specific paths or timestamps in custom `state()` implementations.
68+
69+
```{seealso}
70+
For custom nodes, see [Writing custom nodes](writing_custom_nodes.md).
71+
For hashing guidance, see
72+
[Hashing inputs of tasks](hashing_inputs_of_tasks.md).
73+
```
74+
75+
## Cleaning up the lockfile
76+
77+
`pytask.lock` is updated incrementally. Entries are only replaced when the corresponding
78+
tasks run. If tasks are removed or renamed, their old entries remain as stale data and
79+
are ignored.
80+
81+
To clean up stale entries without deleting the file, run:
82+
83+
```console
84+
$ pytask build --clean-lockfile
85+
```
86+
87+
This rewrites the lockfile after a successful build with only the currently collected
88+
tasks and their current state values.

docs/source/how_to_guides/writing_custom_nodes.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,13 @@ Here are some explanations.
8989
signature is a hash and a unique identifier for the node. For most nodes it will be a
9090
hash of the path or the name.
9191

92+
- `signature` and lockfile `id` are different concepts.
93+
94+
- `signature` is the runtime identity in pytask's in-memory DAG.
95+
- lockfile `id` is the portable key stored in `pytask.lock`.
96+
97+
For custom nodes, make sure the lockfile id stays stable and unique within a task.
98+
9299
- The classmethod {meth}`~pytask.PickleNode.from_path` is a convenient method to
93100
instantiate the class.
94101

docs/source/reference_guides/configuration.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -44,11 +44,13 @@ are welcome to also support macOS.
4444

4545
````{confval} database_url
4646
47-
pytask uses a database to keep track of tasks, products, and dependencies over runs. By
48-
default, it will create an SQLite database in the project's root directory called
49-
`.pytask/pytask.sqlite3`. If you want to use a different name or a different dialect
50-
[supported by sqlalchemy](https://docs.sqlalchemy.org/en/latest/core/engines.html#backend-specific-urls),
51-
use either {option}`pytask build --database-url` or `database_url` in the config.
47+
SQLite is the legacy state format. pytask uses `pytask.lock` as the primary state
48+
backend for change detection. When no lockfile exists, pytask reads the configured
49+
database and writes `pytask.lock`. For downgrade compatibility, `pytask build` also
50+
keeps the legacy database state updated.
51+
52+
The `database_url` option remains for backward compatibility and controls the legacy
53+
database location and dialect ([supported by sqlalchemy](https://docs.sqlalchemy.org/en/latest/core/engines.html#backend-specific-urls)).
5254
5355
```toml
5456
database_url = "sqlite:///.pytask/pytask.sqlite3"

docs/source/reference_guides/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ maxdepth: 1
99
---
1010
command_line_interface
1111
configuration
12+
lockfile
1213
hookspecs
1314
api
1415
```
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# The Lock File
2+
3+
`pytask.lock` is the default state backend. It stores task state in a portable,
4+
git-friendly format so runs can be resumed or shared across machines.
5+
6+
```{note}
7+
SQLite is the legacy format. When no lockfile exists, pytask reads the legacy database
8+
state and writes `pytask.lock`. The lockfile remains the primary backend for skip
9+
decisions, and `pytask build` also keeps the legacy database updated for downgrade
10+
compatibility.
11+
```
12+
13+
## Example
14+
15+
```toml
16+
# This file is automatically @generated by pytask.
17+
# It is not intended for manual editing.
18+
19+
lock-version = "1"
20+
21+
[[task]]
22+
id = "src/tasks/data.py::task_clean_data"
23+
state = "f9e8d7c6..."
24+
25+
[task.depends_on]
26+
"data/raw/input.csv" = "e5f6g7h8..."
27+
28+
[task.produces]
29+
"data/processed/clean.parquet" = "m3n4o5p6..."
30+
```
31+
32+
## Behavior
33+
34+
On each run, pytask:
35+
36+
1. Reads `pytask.lock` (if present).
37+
1. Compares current dependency/product/task `state()` to stored `state`.
38+
1. Skips tasks whose states match; runs the rest.
39+
1. Updates `pytask.lock` after each completed task (atomic write).
40+
1. Updates `pytask.lock` after skipping unchanged tasks (unless `--dry-run` or
41+
`--explain` are active).
42+
43+
## Portability
44+
45+
There are two portability concerns:
46+
47+
1. **IDs**: Lockfile IDs must be project‑relative and stable across machines.
48+
1. **State values**: `state` is opaque; portability depends on each node’s `state()`
49+
implementation. Content hashes are portable; timestamps are not.
50+
51+
## Maintenance
52+
53+
Use `pytask build --clean-lockfile` to rewrite `pytask.lock` with only currently
54+
collected tasks. The rewrite happens after a successful build and recomputes current
55+
state values without executing tasks again.
56+
57+
## File Format Reference
58+
59+
### Top-Level
60+
61+
| Field | Required | Description |
62+
| -------------- | -------- | -------------------------------- |
63+
| `lock-version` | Yes | Schema version (currently `"1"`) |
64+
65+
### Task Entry
66+
67+
| Field | Required | Description |
68+
| ------------ | -------- | ----------------------------- |
69+
| `id` | Yes | Portable task identifier |
70+
| `state` | Yes | Opaque state string |
71+
| `depends_on` | No | Mapping from node id to state |
72+
| `produces` | No | Mapping from node id to state |
73+
74+
### Dependency/Product Entry
75+
76+
Node entries are stored as key-value pairs inside `depends_on` and `produces`, where the
77+
key is the node id and the value is the node state string.
78+
79+
### IDs vs Signatures
80+
81+
`id` in the lockfile is a portable identifier used to match entries across runs and
82+
machines. It is not the same as a node or task `signature` used internally in the DAG.
83+
84+
- `signature`: runtime identity in the in-memory DAG.
85+
- `id`: portable lockfile key persisted to `pytask.lock`.
86+
87+
When implementing custom nodes, keep lockfile IDs stable and unique within a task.
88+
89+
## Version Compatibility
90+
91+
Only lock-version `"1"` is supported. Older or newer versions error with a clear upgrade
92+
message.
93+
94+
## Implementation Notes
95+
96+
- The lockfile is encoded/decoded with `msgspec`’s TOML support.
97+
- Writes are atomic: pytask writes a temporary file and replaces `pytask.lock`.

docs/source/tutorials/making_tasks_persist.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ In this case, you can apply the {func}`@pytask.mark.persist <pytask.mark.persist
99
decorator to the task, which will skip its execution as long as all products exist.
1010

1111
Internally, the state of the dependencies, the source file, and the products are updated
12-
in the database such that the subsequent execution will skip the task successfully.
12+
in the lockfile such that the subsequent execution will skip the task successfully.
1313

1414
## When is this useful?
1515

pyproject.toml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ dependencies = [
3030
"pluggy>=1.3.0",
3131
"rich>=13.8.0",
3232
"sqlalchemy>=2.0.31",
33+
"msgspec[toml]>=0.18.6",
3334
'tomli>=1; python_version < "3.11"',
3435
'typing-extensions>=4.8.0; python_version < "3.11"',
3536
"universal-pathlib>=0.2.2",
@@ -54,7 +55,7 @@ docs = [
5455
"matplotlib>=3.5.0",
5556
"myst-parser>=3.0.0",
5657
"myst-nb>=1.2.0",
57-
"sphinx>=7.0.0",
58+
"sphinx>=7.0.0,<9.0.0",
5859
"sphinx-click>=6.0.0",
5960
"sphinx-copybutton>=0.5.2",
6061
"sphinx-design>=0.3",
@@ -138,6 +139,9 @@ ignore = [
138139
"tests/test_capture.py" = ["T201", "PT011"]
139140
"tests/*" = ["ANN", "D", "FBT", "PLR2004", "S101"]
140141
"tests/test_jupyter/*" = ["INP001"]
142+
"tests/_test_data/*" = ["INP001"]
143+
"tests/_test_data/*/*" = ["INP001"]
144+
"tests/_test_data/*/*/*" = ["INP001"]
141145
"scripts/*" = ["D", "INP001"]
142146
"docs/source/conf.py" = ["D401", "INP001"]
143147
"docs_src/*" = ["ARG001", "D", "INP001", "S301"]

src/_pytask/build.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ def build( # noqa: PLR0913
7272
debug_pytask: bool = False,
7373
disable_warnings: bool = False,
7474
dry_run: bool = False,
75+
clean_lockfile: bool = False,
7576
editor_url_scheme: Literal["no_link", "file", "vscode", "pycharm"] # noqa: PYI051
7677
| str = "file",
7778
explain: bool = False,
@@ -121,6 +122,8 @@ def build( # noqa: PLR0913
121122
Whether warnings should be disabled and not displayed.
122123
dry_run
123124
Whether a dry-run should be performed that shows which tasks need to be rerun.
125+
clean_lockfile
126+
Whether the lockfile should be rewritten to only include collected tasks.
124127
editor_url_scheme
125128
An url scheme that allows to click on task names, node names and filenames and
126129
jump right into you preferred editor to the right line.
@@ -189,6 +192,7 @@ def build( # noqa: PLR0913
189192
"debug_pytask": debug_pytask,
190193
"disable_warnings": disable_warnings,
191194
"dry_run": dry_run,
195+
"clean_lockfile": clean_lockfile,
192196
"editor_url_scheme": editor_url_scheme,
193197
"explain": explain,
194198
"expression": expression,
@@ -305,6 +309,12 @@ def build( # noqa: PLR0913
305309
default=False,
306310
help="Execute a task even if it succeeded successfully before.",
307311
)
312+
@click.option(
313+
"--clean-lockfile",
314+
is_flag=True,
315+
default=False,
316+
help="Rewrite the lockfile with only currently collected tasks.",
317+
)
308318
@click.option(
309319
"--explain",
310320
is_flag=True,

0 commit comments

Comments
 (0)