fix(lm): retry HTTP 408 from the HF CDN in the hub retry backend (#561)

danbraunai-goodfire · claude · web-flow · commit fa0a2c573e9f · 2026-06-29T11:11:21.000+01:00
Job 703812 died at dataloader setup when a ranged parquet read got a 408 (Request Time-out) from the HF CDN. The #557 retry backend was active but its status_forcelist only covered 429/5xx, so urllib3 returned the 408 unretried and hf_raise_for_status raised, tearing down all 16 ranks. 408 is the same transient-timeout class that backend exists for, and only idempotent methods are retried, so adding it is safe. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
diff --git a/param_decomp_lab/infra/hf_http.py b/param_decomp_lab/infra/hf_http.py
@@ -4,7 +4,7 @@
 adapter for metadata calls like `HfApi.repo_info` — the call `datasets` makes to resolve
 a streaming dataset's layout at startup. A single `ReadTimeout` there raises, and in a
 DDP job that one rank's failure tears down every rank before training begins. This mounts
-a retrying adapter on the session factory so connect/read timeouts and 5xx/429 are
+a retrying adapter on the session factory so connect/read timeouts and 408/429/5xx are
 retried with jittered backoff across *all* Hub HTTP calls (dataset, tokenizer, model).
 """