Skip to content

Commit db2050f

Browse files
committed
kubernetes: suppress tqdm progress bars in all container pods
Inject TQDM_DISABLE=1 and HF_DATASETS_DISABLE_PROGRESS_BARS=1 into every Kubernetes container's environment unless the component has already set those keys explicitly (user values take precedence). High-volume tqdm block-glyph output (█▉▊▋▌▍▎▏, 3-byte UTF-8) from concurrent HF datasets workers (num_proc>1) is the dominant source of non-ASCII bytes in pod log streams. Eliminating the glyphs at the source makes the log stream pure ASCII for tokenization/packing phases, removing any possibility of torn multi-byte sequences reaching the Kubernetes API read path regardless of the defensive decode added in the previous commit. Side effect: log sizes for heavy tokenization jobs drop significantly (observed ~6 MB → tens of KB), since tqdm progress bars account for the bulk of the raw byte volume.
1 parent 7a5ff1b commit db2050f

1 file changed

Lines changed: 14 additions & 0 deletions

File tree

cloud_pipelines_backend/launchers/kubernetes_launchers.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,16 @@
6969
# Environment variables for multi-node execution.
7070
_MULTI_NODE_NODE_INDEX_ENV_VAR_NAME = "_TANGLE_MULTI_NODE_NODE_INDEX"
7171

72+
# Environment variables injected into every container to suppress tqdm progress
73+
# bar output. High-volume block-glyph writes (3-byte UTF-8: █▉▊▋▌▍▎▏) from
74+
# concurrent worker processes interleave at the OS level, producing torn
75+
# multi-byte sequences in the pod log stream that cause UnicodeDecodeError.
76+
# Components may override these by setting the same keys in their own env.
77+
_TQDM_SUPPRESS_ENV_VARS: dict[str, str] = {
78+
"TQDM_DISABLE": "1",
79+
"HF_DATASETS_DISABLE_PROGRESS_BARS": "1",
80+
}
81+
7282

7383
_T = typing.TypeVar("_T")
7484

@@ -352,6 +362,10 @@ def get_output_path(output_name: str) -> str:
352362
k8s_client_lib.V1EnvVar(name=name, value=value)
353363
for name, value in (container_spec.env or {}).items()
354364
]
365+
user_env_names = {env.name for env in container_env}
366+
for name, value in _TQDM_SUPPRESS_ENV_VARS.items():
367+
if name not in user_env_names:
368+
container_env.append(k8s_client_lib.V1EnvVar(name=name, value=value))
355369
main_container_spec = k8s_client_lib.V1Container(
356370
name=_MAIN_CONTAINER_NAME,
357371
image=container_spec.image,

0 commit comments

Comments
 (0)