Skip to content

Latest commit

 

History

History
331 lines (224 loc) · 9.13 KB

File metadata and controls

331 lines (224 loc) · 9.13 KB

Dev vs prod modes

The pipeline mode field is either dev or prod. Same Python and YAML shape; execution differs (local files vs YT cluster).

Overview

**Start in dev**

No `secrets.env` requirement, fast feedback, artifacts under `.dev/`.
  • Dev: tables and many operations are simulated on disk; good for development and CI without a cluster.
  • Prod: real YT operations and uploads; needs credentials and a compatible cluster image.
**Prod needs credentials**

Set `YT_PROXY` and `YT_TOKEN` in `configs/secrets.env` before switching to `prod`.

Dev mode

Behavior

  • Tables: JSONL files under .dev/, keyed from logical YT-style paths.
  • Map / vanilla style jobs: local subprocess plus a sandbox directory under .dev/.
  • Code upload: skipped; Python runs from your working tree.
  • YQL: translated through DuckDB where the dev client supports it (not identical to cluster YQL in every edge case).

Config

# configs/config.yaml
pipeline:
  mode: "dev"

Layout after a run

my_pipeline/
├── .dev/
│   ├── table1.jsonl
│   ├── table2.jsonl
│   └── operation.log
├── configs/
├── stages/
└── pipeline.py

Tables

Write:

# Writes .dev/data.jsonl for logical path //tmp/my_pipeline/data
self.deps.yt_client.write_table(
    table_path="//tmp/my_pipeline/data",
    rows=[{"id": 1, "name": "Alice"}],
)

Append without truncating (same keyword as prod when the target already exists):

self.deps.yt_client.write_table(
    table_path="//tmp/my_pipeline/data",
    rows=[{"id": 2, "name": "Bob"}],
    append=True,
)

Read:

rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data"))

Map (dev)

Typical flow:

  1. Sandbox: .dev/sandbox_<input>-><output>/ (exact name may vary by config).
  2. Copy or link input JSONL into the sandbox.
  3. Run the mapper entrypoint.
  4. Mapper stdout becomes .dev/<output>.jsonl, or appends when the map op sets append: true.

See Map operations — Append output (append: true section).

Command-mode mappers only (string commands). TypedJob map legs run on the cluster in prod.

MapReduce (dev)

Typical flow:

  1. Sandbox: .dev/sandbox_mr_<input>-><output>/ with input.jsonl, intermediate.jsonl, and output.jsonl.
  2. Copy input JSONL into the sandbox and upload file dependencies (same as map).
  3. Run the mapper command as a subprocess; stdout becomes intermediate.jsonl.
  4. Sort intermediate rows by sort_by when set, otherwise by reduce_by (loads the JSONL into memory; for small local fixtures only).
  5. Run the reducer command; stdout becomes the output table at .dev/<output>.jsonl.
  6. Stderr for each leg: .dev/<output_basename>_mapper.log and _reducer.log.

String commands only; TypedJob MapReduce legs are prod-only (same rule as map).

Dev runs one mapper and one reducer process (no shuffle partitions). For command-mode reducers that expect sorted keys, dev sorting matches what the cluster provides after shuffle.

Reduce (dev)

Typical flow:

  1. Sandbox: .dev/sandbox_reduce_<input>-><output>/.
  2. Copy input JSONL, upload dependencies, auto-sort rows by sort_by when set in config, otherwise reduce_by (in-memory; small fixtures only).
  3. Run the reducer subprocess; stdout becomes .dev/<output>.jsonl.
  4. Stderr: .dev/<output_basename>_reducer.log.

String commands only. Dev auto-sorts before the reducer so you do not need a separate run_sort stage locally. In prod, the input table must already be sorted by reduce_by (or run sort first).

Vanilla (dev)

  1. Sandbox under .dev/<stage>_sandbox/ (name depends on stage).
  2. Extract the uploaded archive layout locally.
  3. Run vanilla.py (or configured entry).
  4. Stdout/stderr captured to .dev/<stage>.log (see operations docs for exact file names).

YQL (dev)

Runs through the dev client’s DuckDB-backed path for supported statements. Treat results as representative, not a full YT SQL conformance suite.

When dev mode is enough

  • Writing stages and unit-style checks.
  • Debugging mapper I/O with small fixtures.
  • CI that should not depend on YT network.

Tradeoffs (dev)

Pros

  • Fast edit-run cycles.
  • No cluster account required for basic flows.
  • Easy to inspect .jsonl and logs on disk.
  • Works offline for many pipelines.

Cons

  • Dataset size bounded by your machine.
  • Parallelism and timing differ from prod.
  • Rare YT-only behavior may not appear until prod.

Prod mode

Behavior

  • Tables: real Cypress paths on YT.
  • Operations: cluster jobs with the resources you request.
  • Upload: framework packages code to build_folder before starting jobs.
  • YQL: cluster YQL engine.
**Image must match imports**

Job code imports `ytjobs` and your own modules. The Docker image for those jobs must ship matching Python deps. See [Cluster requirements](configuration/cluster-requirements.md).

Config

# configs/config.yaml
pipeline:
  mode: "prod"
  build_folder: "//tmp/my_pipeline/build"

configs/secrets.env:

YT_PROXY=your-yt-proxy-url
YT_TOKEN=your-yt-token

Tables (prod)

self.deps.yt_client.write_table(
    table_path="//tmp/my_pipeline/data",
    rows=[{"id": 1, "name": "Alice"}],
)
rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data"))

Map (prod)

  1. Upload bundle to build_folder.
  2. YT schedules tasks over input chunks.
  3. Reducers (if any) follow map semantics you configured.
  4. Output lands in the configured output table.

Vanilla (prod)

Upload, single or few cluster tasks, logs in YT UI.

YQL (prod)

Distributed engine, cluster-sized inputs.

When you need prod

  • Production schedules.
  • Data larger than fits comfortably on a laptop disk.
  • Real concurrency and YT-native features.

Tradeoffs (prod)

Pros

  • Scales with cluster storage and CPU.
  • Matches how batch jobs actually run in YT.

Cons

  • Needs credentials and network.
  • Slower iteration than dev.
  • Debugging means YT logs and UI, not only local files.

Quick comparison

Topic Dev Prod
Config snippet pipeline.mode: "dev" pipeline.mode: "prod" plus build_folder when uploading
Credentials No YT secrets.env for basic flows YT_PROXY / YT_TOKEN required
Throughput One machine, subprocess-style map/vanilla Cluster scheduling and distributed tables
Debugging .dev/*.jsonl, local stderr YT operation UI, remote stderr

Switching modes

Change one field:

pipeline:
  mode: "dev"   # or "prod"
**Same repo, different backend**

The framework picks dev vs prod implementations from `mode`; your stage classes stay the same.

Checklist when going to prod:

  1. Logical table paths stay the same string format; dev maps them to files.
  2. secrets.env exists and points at the right cluster.
  3. build_folder is set and writable for your service user.
  4. Docker image includes everything imported inside uploaded job code.

Where behavior diverges

Paths

  • Dev: //tmp/.../name maps to .dev/name.jsonl (see client implementation for exact mapping rules).
  • Prod: the same string is a Cypress path.

Parallelism

  • Dev map runs are closer to “one local subprocess story” than thousands of tiny tasks.
  • Prod uses YT scheduling; race conditions that never show up locally can appear under load.

Code freshness

  • Dev reads your tree directly.
  • Prod needs a successful upload each run; if you change only local files, rerun the pipeline to refresh the bundle.

Errors

  • Dev: tracebacks in your terminal.
  • Prod: fetch stderr and system logs from YT for the failing operation.

Debugging

Dev

  1. List .dev/ after the stage runs.
  2. Open the JSONL you think should have changed.
  3. Read operation.log and stage logs next to it.
  4. Print debugging is fine; you see it immediately.

Prod

  1. Open the operation in the YT UI and read stderr.
  2. Use self.logger consistently; it ends up in the same places operations already aggregate logs.
  3. For stubborn issues, reproduce with a tiny input table, then widen.

Common symptoms

Table missing in prod

  • Path typo or table never created in that Cypress tree.
  • Credentials or proxy pointing at the wrong cluster.

Code changes ignored in prod

  • You did not re-run the pipeline after editing sources, or upload failed silently earlier (check logs).

Different results dev vs prod

  • DuckDB vs cluster SQL differences for YQL-heavy stages.
  • Resource limits killing tasks only on the cluster.

Workflow suggestion

**Smoke in prod early**

After dev passes, run prod once on a small slice of real schema before full-scale backfill jobs.
  1. Implement in dev.
  2. Promote to prod with a narrow date range or row limit.
  3. Only then open the floodgates.

Next steps