Tokenizer artifacts are separate from checkpoint management model files. They package a directory or .tar.gz (tokenizers, small processor bundles, etc.), upload it to a Cypress file path, and mount it into map / map-reduce / vanilla jobs. The framework wires file dependencies and environment variables so StageBootstrapTypedJob (or command-mode wrappers) can extract the archive in the sandbox.
- Reducers or mappers need a tarball of assets that are not a single PyTorch/HF checkpoint file.
- You want idempotent upload (skip if the YT path already exists) and the same pool/resource model as other operations.
Under client.operations.<operation>.tokenizer_artifact (map, map_reduce, reduce, vanilla—where enabled):
| Key | Required | Description |
|---|---|---|
artifact_base |
Yes (to enable) | Cypress directory for uploaded tarballs (e.g. //tmp/pipeline/tokenizers). |
artifact_name |
No* | Logical name; defaults from job.tokenizer_name, job.model_name basename, or local_artifact_path filename. |
local_artifact_path |
No | Local directory (packed to temp .tar.gz) or existing .tar.gz to upload if missing in YT. |
*If artifact_base is set, a resolvable artifact_name is required (directly or via job / local path).
Example (map):
client:
operations:
map:
input_table: //tmp/pipeline/in
output_table: //tmp/pipeline/out
tokenizer_artifact:
artifact_base: //tmp/pipeline/tokenizer_artifacts
local_artifact_path: /path/to/my_tokenizer_bundle # dir or .tar.gz
resources:
pool: default- The uploaded file is named
<artifact_name>.tar.gzunderartifact_base. - Workers receive
TOKENIZER_ARTIFACT_FILE,TOKENIZER_ARTIFACT_DIR, and optionallyTOKENIZER_ARTIFACT_NAME. StageBootstrapTypedJobextracts the tarball once per sandbox (see Environment variables).
- Checkpoints (
checkpoint:block): usually a single model file mounted asCHECKPOINT_FILE. - Tokenizer artifacts: tarball workflow, separate Cypress path and env vars. You can use both in one stage if needed.
init_tokenizer_artifact_directoryand helpers: Tokenizer artifact in API Reference (yt_framework.operations._internal.tokenizer_artifact).- Exported on
yt_framework.operationsfor advanced callers.