Skip to content

Raise HF_XET_CLIENT_READ_TIMEOUT to 300s + clean up query_dedup 404 log#808

Open
rajatarya wants to merge 2 commits into
mainfrom
rajat/raise-client-read-timeout-to-300s
Open

Raise HF_XET_CLIENT_READ_TIMEOUT to 300s + clean up query_dedup 404 log#808
rajatarya wants to merge 2 commits into
mainfrom
rajat/raise-client-read-timeout-to-300s

Conversation

@rajatarya
Copy link
Copy Markdown
Collaborator

@rajatarya rajatarya commented Apr 21, 2026

Two connected cleanups from the 2026-04-21 Julien upload-stuck investigation. Closes #807. Docs PR: huggingface/hub-docs#2419.

Change 1 — raise HF_XET_CLIENT_READ_TIMEOUT default 120s → 300s

Files: xet_runtime/src/config/groups/client.rs, xet_client/src/cas_client/remote_client.rs (stale comment).

The 120s client read timeout was firing before legitimate upload_xorb requests could complete on high-latency / transatlantic / bursty links. Fleet-wide this produced a chronic 30–50% xorb POST failure rate (1,092–4,196 error uploading xorb events per hour sustained over 24h, peaking at 49.1% in the investigation window). 267 successful uploads in the same 24h had latency > 120s (max 37 min), so 120s wasn't protecting anything legitimate — it was only cutting off slow-but-healthy streams.

300s preserves stall-detection semantics (still an order of magnitude under the 3600s ALB idle). The env override HF_XET_CLIENT_READ_TIMEOUT is unchanged.

Change 2 — log query_dedup 404 as cache miss, not "Fatal Error"

Files: xet_client/src/cas_client/retry_wrapper.rs, xet_client/src/cas_client/remote_client.rs.

A 404 from cas::query_dedup is an expected cache miss — the caller converts it to Ok(None) and proceeds to upload. Today the retry wrapper logs it as Fatal Error: \"cas::query_dedup\" api call failed ... 404 Not Found, producing 20+ alarming-looking lines per upload session with no actual failure behind them (Hoyt flagged this in the incident Slack thread).

Fix: add RetryWrapper::with_expected_404() — mirroring the existing with_expected_416() pattern — and opt query_dedup into it. The 404 still short-circuits retries and surfaces as a fatal error to the caller (preserving the existing Ok(None) conversion), but the log line now reads Not Found (cache miss): \"cas::query_dedup\" api call failed ... 404 Not Found.

Test plan

  • cargo +nightly fmt --all --check clean
  • cargo test -p xet-client --lib cas_client::retry_wrapper — 5 passed (incl. new test_404_expected_is_fatal_and_not_retried)
  • Manually verify HF_XET_CLIENT_READ_TIMEOUT=120 still overrides via env
  • Confirm a session run produces no Fatal Error: lines for the query_dedup 404s
  • Watch the xorb POST error-rate panel on the CAS Grafana dashboard after release; expect the 120s-clustered p50 to disappear

🤖 Generated with Claude Code


Note

Medium Risk
Adjusts client networking defaults (read timeout) and alters retry-wrapper handling/logging for HTTP 404s, which can change behavior and observability for slow uploads and cache-miss paths.

Overview
Raises the default HF_XET_CLIENT_READ_TIMEOUT from 120s to 300s to better tolerate slow-but-progressing transfers.

Adds RetryWrapper::with_expected_404() and opts cas::query_dedup into it so 404 responses are still non-retried/fatal to the caller but are logged as an expected cache miss (with a new unit test covering the no-retry behavior).

Reviewed by Cursor Bugbot for commit 3e88f9c. Bugbot is set up for automated code reviews on this repo. Configure here.

rajatarya and others added 2 commits April 21, 2026 09:43
The 120s default was firing before legitimate upload_xorb requests
could complete on high-latency / transatlantic / bursty links,
producing a chronic 30-50% fleet-wide xorb POST failure rate.
Raise the client default to 300s; env override is unchanged.

Refs #807

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A 404 from cas::query_dedup is an expected cache miss — the caller
converts it to Ok(None) and proceeds to upload the chunk. Today the
retry wrapper logs it as `Fatal Error: "cas::query_dedup" api call
failed ... 404 Not Found`, which produces 20+ alarming-looking lines
per upload session with no actual failure behind them.

Add RetryWrapper::with_expected_404(), mirroring the existing
with_expected_416() pattern, and opt query_dedup into it. The 404
still short-circuits retries and surfaces as a fatal error to the
caller (preserving the existing Ok(None) conversion), but the log
line now reads as a cache miss.

Refs #807

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rajatarya rajatarya changed the title Raise HF_XET_CLIENT_READ_TIMEOUT default from 120s to 300s Raise HF_XET_CLIENT_READ_TIMEOUT to 300s + clean up query_dedup 404 log Apr 21, 2026
@rajatarya rajatarya requested review from assafvayner, hoytak and seanses and removed request for assafvayner April 21, 2026 17:00
@rajatarya rajatarya marked this pull request as ready for review April 22, 2026 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Raise default HF_XET_CLIENT_READ_TIMEOUT from 120s to 300s

2 participants