Skip to content

Commit fa0a2c5

Browse files
fix(lm): retry HTTP 408 from the HF CDN in the hub retry backend (#561)
Job 703812 died at dataloader setup when a ranged parquet read got a 408 (Request Time-out) from the HF CDN. The #557 retry backend was active but its status_forcelist only covered 429/5xx, so urllib3 returned the 408 unretried and hf_raise_for_status raised, tearing down all 16 ranks. 408 is the same transient-timeout class that backend exists for, and only idempotent methods are retried, so adding it is safe. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
1 parent c62531d commit fa0a2c5

1 file changed

Lines changed: 1 addition & 1 deletion

File tree

param_decomp_lab/infra/hf_http.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
adapter for metadata calls like `HfApi.repo_info` — the call `datasets` makes to resolve
55
a streaming dataset's layout at startup. A single `ReadTimeout` there raises, and in a
66
DDP job that one rank's failure tears down every rank before training begins. This mounts
7-
a retrying adapter on the session factory so connect/read timeouts and 5xx/429 are
7+
a retrying adapter on the session factory so connect/read timeouts and 408/429/5xx are
88
retried with jittered backoff across *all* Hub HTTP calls (dataset, tokenizer, model).
99
"""
1010

0 commit comments

Comments
 (0)