Raise HF_XET_CLIENT_READ_TIMEOUT to 300s + clean up query_dedup 404 log#808
Open
rajatarya wants to merge 2 commits into
Open
Raise HF_XET_CLIENT_READ_TIMEOUT to 300s + clean up query_dedup 404 log#808rajatarya wants to merge 2 commits into
rajatarya wants to merge 2 commits into
Conversation
The 120s default was firing before legitimate upload_xorb requests could complete on high-latency / transatlantic / bursty links, producing a chronic 30-50% fleet-wide xorb POST failure rate. Raise the client default to 300s; env override is unchanged. Refs #807 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A 404 from cas::query_dedup is an expected cache miss — the caller converts it to Ok(None) and proceeds to upload the chunk. Today the retry wrapper logs it as `Fatal Error: "cas::query_dedup" api call failed ... 404 Not Found`, which produces 20+ alarming-looking lines per upload session with no actual failure behind them. Add RetryWrapper::with_expected_404(), mirroring the existing with_expected_416() pattern, and opt query_dedup into it. The 404 still short-circuits retries and surfaces as a fatal error to the caller (preserving the existing Ok(None) conversion), but the log line now reads as a cache miss. Refs #807 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
seanses
approved these changes
Apr 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two connected cleanups from the 2026-04-21 Julien upload-stuck investigation. Closes #807. Docs PR: huggingface/hub-docs#2419.
Change 1 — raise
HF_XET_CLIENT_READ_TIMEOUTdefault 120s → 300sFiles:
xet_runtime/src/config/groups/client.rs,xet_client/src/cas_client/remote_client.rs(stale comment).The 120s client read timeout was firing before legitimate
upload_xorbrequests could complete on high-latency / transatlantic / bursty links. Fleet-wide this produced a chronic 30–50% xorb POST failure rate (1,092–4,196error uploading xorbevents per hour sustained over 24h, peaking at 49.1% in the investigation window). 267 successful uploads in the same 24h had latency > 120s (max 37 min), so 120s wasn't protecting anything legitimate — it was only cutting off slow-but-healthy streams.300s preserves stall-detection semantics (still an order of magnitude under the 3600s ALB idle). The env override
HF_XET_CLIENT_READ_TIMEOUTis unchanged.Change 2 — log
query_dedup404 as cache miss, not "Fatal Error"Files:
xet_client/src/cas_client/retry_wrapper.rs,xet_client/src/cas_client/remote_client.rs.A 404 from
cas::query_dedupis an expected cache miss — the caller converts it toOk(None)and proceeds to upload. Today the retry wrapper logs it asFatal Error: \"cas::query_dedup\" api call failed ... 404 Not Found, producing 20+ alarming-looking lines per upload session with no actual failure behind them (Hoyt flagged this in the incident Slack thread).Fix: add
RetryWrapper::with_expected_404()— mirroring the existingwith_expected_416()pattern — and optquery_dedupinto it. The 404 still short-circuits retries and surfaces as a fatal error to the caller (preserving the existingOk(None)conversion), but the log line now readsNot Found (cache miss): \"cas::query_dedup\" api call failed ... 404 Not Found.Test plan
cargo +nightly fmt --all --checkcleancargo test -p xet-client --lib cas_client::retry_wrapper— 5 passed (incl. newtest_404_expected_is_fatal_and_not_retried)HF_XET_CLIENT_READ_TIMEOUT=120still overrides via envFatal Error:lines for thequery_dedup404s🤖 Generated with Claude Code
Note
Medium Risk
Adjusts client networking defaults (read timeout) and alters retry-wrapper handling/logging for HTTP 404s, which can change behavior and observability for slow uploads and cache-miss paths.
Overview
Raises the default
HF_XET_CLIENT_READ_TIMEOUTfrom 120s to 300s to better tolerate slow-but-progressing transfers.Adds
RetryWrapper::with_expected_404()and optscas::query_dedupinto it so 404 responses are still non-retried/fatal to the caller but are logged as an expected cache miss (with a new unit test covering the no-retry behavior).Reviewed by Cursor Bugbot for commit 3e88f9c. Bugbot is set up for automated code reviews on this repo. Configure here.