Skip to content

Commit bfc53ac

Browse files
ko3n1gclaude
andauthored
docs: progressive learning structure with per-executor guides and architecture reference (#467)
* docs: progressive learning structure with per-executor guides and architecture reference Restructures the guides into a layered learning path so users get something working first and deepen understanding step by step: - Add quickstart.md: 5-minute local run using run.Script + LocalExecutor - Add executors/ directory with per-executor guides (local, docker, slurm, skypilot, dgxcloud, lepton, kuberay), each with prerequisites, annotated config, and an end-to-end workflow - Add architecture.md: Experiment call chain, Executor→TorchX scheduler mapping, metadata layout, and contributor steps for adding a new executor - Update execution.md: remove per-executor sections (now in executors/); add links to executors/ and architecture.md - Update management.md: add "Putting it all together" e2e section - Update ray.md: add "When to use Ray" decision table, per-backend prerequisites, and cross-links to executor guides before each quick-start - Update index.md: reorder toctree to match the learning path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(docs): use GitHub URL for kubeflow example link to avoid MyST xref warning Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix(docs): update broken SkyPilot managed jobs URL Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent c2b4b98 commit bfc53ac

15 files changed

Lines changed: 1176 additions & 305 deletions

docs/guides/architecture.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Architecture
2+
3+
> **Audience**: Contributors adding new executors, and users who want to understand why something is failing or how to extend NeMo-Run.
4+
>
5+
> **Prerequisite**: Read [Execution](execution.md), at least one [executor guide](executors/index.md), and [Management](management.md) first.
6+
7+
## `run.run()` vs `run.Experiment`
8+
9+
`run.run()` is a thin convenience wrapper. Internally it creates an `Experiment` with a single task and `detach=False`:
10+
11+
```python
12+
# These two are equivalent
13+
run.run(task, executor=executor)
14+
15+
with run.Experiment("untitled") as exp:
16+
exp.add(task, executor=executor)
17+
exp.run(detach=False)
18+
```
19+
20+
All the mechanics described below apply to both.
21+
22+
---
23+
24+
## Call chain
25+
26+
```{mermaid}
27+
flowchart TD
28+
A["exp.run()"] --> B["Experiment._prepare()"]
29+
B --> C["Job.prepare()"]
30+
C --> D["executor.assign(exp_id, exp_dir, task_id, task_dir)"]
31+
C --> E["executor.create_job_dir()"]
32+
C --> F["package(task, executor) → AppDef + Role(s)"]
33+
A --> G["Job.launch(runner)"]
34+
G --> H["runner.dryrun(AppDef, scheduler_name, cfg=executor)"]
35+
H --> I["scheduler.submit_dryrun(AppDef, executor)"]
36+
G --> J["runner.schedule(dryrun_info)"]
37+
J --> K["scheduler.schedule(dryrun_info) → AppHandle"]
38+
```
39+
40+
1. `_prepare()` calls `Job.prepare()` for each task, which assigns experiment/job directories, syncs code, and builds the TorchX `AppDef`.
41+
2. `Job.launch(runner)` calls `runner.dryrun()` to validate the submission plan, then `runner.schedule()` to submit it.
42+
3. The `AppHandle` returned by `scheduler.schedule()` is stored in the experiment metadata so `Experiment.from_id()` can reconnect.
43+
44+
---
45+
46+
## Executor → TorchX scheduler mapping
47+
48+
Each executor is backed by a TorchX scheduler registered as an entry point in `pyproject.toml` under `torchx.schedulers`:
49+
50+
| Executor | TorchX Scheduler |
51+
|----------|-----------------|
52+
| `LocalExecutor` | `local_persistent` |
53+
| `DockerExecutor` | `docker_persistent` |
54+
| `SlurmExecutor` | `slurm_tunnel` |
55+
| `SkypilotExecutor` | `skypilot` |
56+
| `SkypilotJobsExecutor` | `skypilot_jobs` |
57+
| `DGXCloudExecutor` | `dgx_cloud` |
58+
| `LeptonExecutor` | `lepton` |
59+
60+
Schedulers are discovered at runtime via `torchx.schedulers.get_scheduler_factories()`.
61+
62+
---
63+
64+
## Key TorchX types
65+
66+
| Type | What it represents |
67+
|------|--------------------|
68+
| `AppDef` | Full application: list of `Role`s + metadata |
69+
| `Role` | One execution unit: entrypoint, args, env, image, num_replicas, resources |
70+
| `AppDryRunInfo` | Validated `AppDef` + submission plan (can be inspected without running) |
71+
| `AppHandle` | Running job ID: `"{scheduler}://{runner}/{app_id}"` |
72+
| `AppState` | Status enum: `RUNNING`, `SUCCEEDED`, `FAILED`, `CANCELLED`, `UNKNOWN` |
73+
74+
---
75+
76+
## How `Executor` fields map to TorchX concepts
77+
78+
| Executor field | TorchX mapping |
79+
|----------------|---------------|
80+
| `nnodes()` + `nproc_per_node()` | `Role.num_replicas` + replica topology |
81+
| `launcher` | `AppDef` structure (`torchrun` / `ft` / basic entrypoint) |
82+
| `retries` | `Role.max_retries` |
83+
| `env_vars` | `Role.env` |
84+
| `packager` | Pre-launch code sync strategy |
85+
| `assign(exp_id, exp_dir, task_id, task_dir)` | Sets path metadata consumed by the scheduler |
86+
87+
---
88+
89+
## Metadata storage layout
90+
91+
All experiment metadata is written under `NEMORUN_HOME` (default `~/.nemo_run`):
92+
93+
```
94+
~/.nemo_run/experiments/{title}/{title}_{exp_id}/
95+
├── {task_id}/
96+
│ ├── configs/
97+
│ │ ├── {task_id}_executor.yaml # serialised executor config
98+
│ │ ├── {task_id}_fn_or_script # zlib-JSON encoded task
99+
│ │ └── {task_id}_packager # zlib-JSON encoded packager
100+
│ └── scripts/{task_id}.sh # generated sbatch/shell script
101+
└── .tasks # serialised Job metadata (JSON)
102+
```
103+
104+
`Experiment.from_id()` reads `.tasks` to reconstruct the experiment and reattach to live jobs via the stored `AppHandle`.
105+
106+
---
107+
108+
## Adding a new executor
109+
110+
1. **Subclass `Executor`** in `nemo_run/core/execution/`:
111+
112+
```python
113+
from nemo_run.core.execution.base import Executor
114+
115+
@dataclass
116+
class MyExecutor(Executor):
117+
my_param: str = "default"
118+
...
119+
```
120+
121+
2. **Implement a TorchX `Scheduler`** in `nemo_run/run/torchx_backend/schedulers/`:
122+
123+
```python
124+
from torchx.schedulers import Scheduler
125+
126+
class MyScheduler(Scheduler):
127+
def submit_dryrun(self, app, cfg): ...
128+
def schedule(self, dryrun_info): ...
129+
def describe(self, app_id): ...
130+
def cancel(self, app_id): ...
131+
```
132+
133+
3. **Register the scheduler as an entry point** in `pyproject.toml`:
134+
135+
```toml
136+
[project.entry-points."torchx.schedulers"]
137+
my_scheduler = "nemo_run.run.torchx_backend.schedulers.my:create_scheduler"
138+
```
139+
140+
4. **Add to `EXECUTOR_MAPPING`** in `nemo_run/run/torchx_backend/schedulers/api.py`:
141+
142+
```python
143+
EXECUTOR_MAPPING = {
144+
...,
145+
MyExecutor: "my_scheduler",
146+
}
147+
```

0 commit comments

Comments
 (0)