Skip to content

Latest commit

 

History

History
56 lines (39 loc) · 2.71 KB

File metadata and controls

56 lines (39 loc) · 2.71 KB

Tokenizer artifacts (tarball uploads)

Tokenizer artifacts are separate from checkpoint management model files. They package a directory or .tar.gz (tokenizers, small processor bundles, etc.), upload it to a Cypress file path, and mount it into map / map-reduce / vanilla jobs. The framework wires file dependencies and environment variables so StageBootstrapTypedJob (or command-mode wrappers) can extract the archive in the sandbox.

When to use

  • Reducers or mappers need a tarball of assets that are not a single PyTorch/HF checkpoint file.
  • You want idempotent upload (skip if the YT path already exists) and the same pool/resource model as other operations.

Configuration (tokenizer_artifact)

Under client.operations.<operation>.tokenizer_artifact (map, map_reduce, reduce, vanilla—where enabled):

Key Required Description
artifact_base Yes (to enable) Cypress directory for uploaded tarballs (e.g. //tmp/pipeline/tokenizers).
artifact_name No* Logical name; defaults from job.tokenizer_name, job.model_name basename, or local_artifact_path filename.
local_artifact_path No Local directory (packed to temp .tar.gz) or existing .tar.gz to upload if missing in YT.

*If artifact_base is set, a resolvable artifact_name is required (directly or via job / local path).

Example (map):

client:
  operations:
    map:
      input_table: //tmp/pipeline/in
      output_table: //tmp/pipeline/out
      tokenizer_artifact:
        artifact_base: //tmp/pipeline/tokenizer_artifacts
        local_artifact_path: /path/to/my_tokenizer_bundle   # dir or .tar.gz
      resources:
        pool: default

Sandbox behavior

  1. The uploaded file is named <artifact_name>.tar.gz under artifact_base.
  2. Workers receive TOKENIZER_ARTIFACT_FILE, TOKENIZER_ARTIFACT_DIR, and optionally TOKENIZER_ARTIFACT_NAME.
  3. StageBootstrapTypedJob extracts the tarball once per sandbox (see Environment variables).

Relation to checkpoints

  • Checkpoints (checkpoint: block): usually a single model file mounted as CHECKPOINT_FILE.
  • Tokenizer artifacts: tarball workflow, separate Cypress path and env vars. You can use both in one stage if needed.

API

  • init_tokenizer_artifact_directory and helpers: Tokenizer artifact in API Reference (yt_framework.operations._internal.tokenizer_artifact).
  • Exported on yt_framework.operations for advanced callers.

See also