The pipeline mode field is either dev or prod. Same Python and YAML shape; execution differs (local files vs YT cluster).
**Start in dev**
No `secrets.env` requirement, fast feedback, artifacts under `.dev/`.
- Dev: tables and many operations are simulated on disk; good for development and CI without a cluster.
- Prod: real YT operations and uploads; needs credentials and a compatible cluster image.
**Prod needs credentials**
Set `YT_PROXY` and `YT_TOKEN` in `configs/secrets.env` before switching to `prod`.
- Tables: JSONL files under
.dev/, keyed from logical YT-style paths. - Map / vanilla style jobs: local subprocess plus a sandbox directory under
.dev/. - Code upload: skipped; Python runs from your working tree.
- YQL: translated through DuckDB where the dev client supports it (not identical to cluster YQL in every edge case).
# configs/config.yaml
pipeline:
mode: "dev"my_pipeline/
├── .dev/
│ ├── table1.jsonl
│ ├── table2.jsonl
│ └── operation.log
├── configs/
├── stages/
└── pipeline.py
Write:
# Writes .dev/data.jsonl for logical path //tmp/my_pipeline/data
self.deps.yt_client.write_table(
table_path="//tmp/my_pipeline/data",
rows=[{"id": 1, "name": "Alice"}],
)Append without truncating (same keyword as prod when the target already exists):
self.deps.yt_client.write_table(
table_path="//tmp/my_pipeline/data",
rows=[{"id": 2, "name": "Bob"}],
append=True,
)Read:
rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data"))Typical flow:
- Sandbox:
.dev/sandbox_<input>-><output>/(exact name may vary by config). - Copy or link input JSONL into the sandbox.
- Run the mapper entrypoint.
- Mapper stdout becomes
.dev/<output>.jsonl, or appends when the map op setsappend: true.
See Map operations — Append output (append: true section).
Command-mode mappers only (string commands). TypedJob map legs run on the cluster in prod.
Typical flow:
- Sandbox:
.dev/sandbox_mr_<input>-><output>/withinput.jsonl,intermediate.jsonl, andoutput.jsonl. - Copy input JSONL into the sandbox and upload file dependencies (same as map).
- Run the mapper command as a subprocess; stdout becomes
intermediate.jsonl. - Sort intermediate rows by
sort_bywhen set, otherwise byreduce_by(loads the JSONL into memory; for small local fixtures only). - Run the reducer command; stdout becomes the output table at
.dev/<output>.jsonl. - Stderr for each leg:
.dev/<output_basename>_mapper.logand_reducer.log.
String commands only; TypedJob MapReduce legs are prod-only (same rule as map).
Dev runs one mapper and one reducer process (no shuffle partitions). For command-mode reducers that expect sorted keys, dev sorting matches what the cluster provides after shuffle.
Typical flow:
- Sandbox:
.dev/sandbox_reduce_<input>-><output>/. - Copy input JSONL, upload dependencies, auto-sort rows by
sort_bywhen set in config, otherwisereduce_by(in-memory; small fixtures only). - Run the reducer subprocess; stdout becomes
.dev/<output>.jsonl. - Stderr:
.dev/<output_basename>_reducer.log.
String commands only. Dev auto-sorts before the reducer so you do not need a separate run_sort stage locally. In prod, the input table must already be sorted by reduce_by (or run sort first).
- Sandbox under
.dev/<stage>_sandbox/(name depends on stage). - Extract the uploaded archive layout locally.
- Run
vanilla.py(or configured entry). - Stdout/stderr captured to
.dev/<stage>.log(see operations docs for exact file names).
Runs through the dev client’s DuckDB-backed path for supported statements. Treat results as representative, not a full YT SQL conformance suite.
- Writing stages and unit-style checks.
- Debugging mapper I/O with small fixtures.
- CI that should not depend on YT network.
Pros
- Fast edit-run cycles.
- No cluster account required for basic flows.
- Easy to inspect
.jsonland logs on disk. - Works offline for many pipelines.
Cons
- Dataset size bounded by your machine.
- Parallelism and timing differ from prod.
- Rare YT-only behavior may not appear until prod.
- Tables: real Cypress paths on YT.
- Operations: cluster jobs with the resources you request.
- Upload: framework packages code to
build_folderbefore starting jobs. - YQL: cluster YQL engine.
**Image must match imports**
Job code imports `ytjobs` and your own modules. The Docker image for those jobs must ship matching Python deps. See [Cluster requirements](configuration/cluster-requirements.md).
# configs/config.yaml
pipeline:
mode: "prod"
build_folder: "//tmp/my_pipeline/build"configs/secrets.env:
YT_PROXY=your-yt-proxy-url
YT_TOKEN=your-yt-tokenself.deps.yt_client.write_table(
table_path="//tmp/my_pipeline/data",
rows=[{"id": 1, "name": "Alice"}],
)rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data"))- Upload bundle to
build_folder. - YT schedules tasks over input chunks.
- Reducers (if any) follow map semantics you configured.
- Output lands in the configured output table.
Upload, single or few cluster tasks, logs in YT UI.
Distributed engine, cluster-sized inputs.
- Production schedules.
- Data larger than fits comfortably on a laptop disk.
- Real concurrency and YT-native features.
Pros
- Scales with cluster storage and CPU.
- Matches how batch jobs actually run in YT.
Cons
- Needs credentials and network.
- Slower iteration than dev.
- Debugging means YT logs and UI, not only local files.
| Topic | Dev | Prod |
|---|---|---|
| Config snippet | pipeline.mode: "dev" |
pipeline.mode: "prod" plus build_folder when uploading |
| Credentials | No YT secrets.env for basic flows |
YT_PROXY / YT_TOKEN required |
| Throughput | One machine, subprocess-style map/vanilla | Cluster scheduling and distributed tables |
| Debugging | .dev/*.jsonl, local stderr |
YT operation UI, remote stderr |
Change one field:
pipeline:
mode: "dev" # or "prod"**Same repo, different backend**
The framework picks dev vs prod implementations from `mode`; your stage classes stay the same.
Checklist when going to prod:
- Logical table paths stay the same string format; dev maps them to files.
secrets.envexists and points at the right cluster.build_folderis set and writable for your service user.- Docker image includes everything imported inside uploaded job code.
- Dev:
//tmp/.../namemaps to.dev/name.jsonl(see client implementation for exact mapping rules). - Prod: the same string is a Cypress path.
- Dev map runs are closer to “one local subprocess story” than thousands of tiny tasks.
- Prod uses YT scheduling; race conditions that never show up locally can appear under load.
- Dev reads your tree directly.
- Prod needs a successful upload each run; if you change only local files, rerun the pipeline to refresh the bundle.
- Dev: tracebacks in your terminal.
- Prod: fetch stderr and system logs from YT for the failing operation.
- List
.dev/after the stage runs. - Open the JSONL you think should have changed.
- Read
operation.logand stage logs next to it. - Print debugging is fine; you see it immediately.
- Open the operation in the YT UI and read stderr.
- Use
self.loggerconsistently; it ends up in the same places operations already aggregate logs. - For stubborn issues, reproduce with a tiny input table, then widen.
Table missing in prod
- Path typo or table never created in that Cypress tree.
- Credentials or proxy pointing at the wrong cluster.
Code changes ignored in prod
- You did not re-run the pipeline after editing sources, or upload failed silently earlier (check logs).
Different results dev vs prod
- DuckDB vs cluster SQL differences for YQL-heavy stages.
- Resource limits killing tasks only on the cluster.
**Smoke in prod early**
After dev passes, run prod once on a small slice of real schema before full-scale backfill jobs.
- Implement in dev.
- Promote to prod with a narrow date range or row limit.
- Only then open the floodgates.