Skip to content

[hub] XetBlob: use /v2/reconstructions with v1 fallback#2265

Open
assafvayner wants to merge 3 commits into
mainfrom
assaf/xet-reconstruction-v2
Open

[hub] XetBlob: use /v2/reconstructions with v1 fallback#2265
assafvayner wants to merge 3 commits into
mainfrom
assaf/xet-reconstruction-v2

Conversation

@assafvayner

@assafvayner assafvayner commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

What

Updates XetBlob (the @huggingface/hub direct-from-Xet blob reader) to use the /v2/reconstructions endpoint, falling back to /v1/reconstructions when v2 is unavailable. This mirrors what xet-core's RemoteClient already does.

Why

The v2 reconstruction API returns multi-range signed URLs: a single URL can carry several byte ranges for a xorb, coalescing what v1 served as many separate single-range URLs. This reduces the number of fetch requests per download. xet-core already prefers v2 with a v1 fallback; this brings the JS client in line.

How

  • Version routing (#loadReconstructionInfo): build the v2 URL (from casUrl, or by swapping /v1/reconstructions//v2/reconstructions/ on an explicit reconstructionUrl), try v2, and fall back to v1 on 404/501. The detected version is cached per CAS endpoint to avoid re-probing.
  • Normalization: both v1 (fetch_info) and v2 (xorbs) responses are converted into a common fetch-group shape — v1 yields one single-range group per entry, v2 keeps the server's multi-range grouping.
  • Reader: the tuned streaming path is unchanged for single-range groups (all v1, and v2 xorbs with one range). For multi-range v2 groups, the reader issues one combined Range: bytes=s1-e1,s2-e2,… request, buffers and splits the multipart/byteranges response (RFC 7233 §4.1), and feeds each part through the existing chunk parser into the RangeList cache.
  • New util multipart.ts: a port of xet-core's parse_multipart_byteranges.

A v2 multi-range response must be buffered (not streamed) because the signed URL signs the exact multi-range header, so the parts can only be split after the full body arrives.

Tests

  • New multipart.spec.ts (8 tests) for the byteranges parser.
  • New XetBlob v2 tests: multi-range multipart fetch, v2→v1 fallback on 404, and single-range v2.
  • Existing XetBlob mocks now simulate v1-only endpoints (404 on /v2/), so the fallback path is exercised; all prior tests still pass.

33 tests passing; eslint, prettier, and tsc clean on the changed files.


Note

Medium Risk
Changes the core Xet download path and byte reconstruction logic; risk is mitigated by v1 fallback, explicit errors on ambiguous multipart responses, and broad new tests, but regressions could still affect large-file downloads.

Overview
XetBlob now prefers /v2/reconstructions (with /v1 fallback on 404/501), caches the detected API version per CAS URL, and normalizes v1 fetch_info and v2 xorbs into a shared fetch-group model.

For v2 multi-range signed URLs, the reader issues one combined Range request, buffers the response, parses multipart/byteranges via new multipart.ts, and decodes parts into the existing RangeList cache; single-range paths still stream as before. Chunk header parse/decompress logic is shared (parseChunkHeader, decompressChunk, storeChunks), and incomplete or wrong multipart responses throw instead of degrading to single-range fetches that could corrupt data.

Tests add multipart.spec.ts, a v2 reconstruction suite (multi-range, fallback, error cases), clearReconstructionApiVersionCache in beforeEach, and v2 404 stubs in existing mocks.

Reviewed by Cursor Bugbot for commit 2b1ba93. Bugbot is set up for automated code reviews on this repo. Configure here.

Prefer the v2 reconstruction endpoint (multi-range signed URLs returning
multipart/byteranges responses), falling back to v1 on 404/501 and caching
the detected version per CAS endpoint. Mirrors xet-core's RemoteClient.

Both v1 and v2 responses are normalized into a common fetch-group shape. The
streaming reader is unchanged for single-range groups; multi-range v2 groups
are fetched in a single request, split per RFC 7233, and fed through the
existing chunk parser.
@assafvayner assafvayner marked this pull request as ready for review June 30, 2026 23:52
@assafvayner assafvayner requested a review from coyotte508 as a code owner June 30, 2026 23:52
Comment thread packages/hub/src/utils/XetBlob.ts
Comment thread packages/hub/src/utils/XetBlob.ts
coyotte508
coyotte508 previously approved these changes Jul 1, 2026
@coyotte508 coyotte508 dismissed their stale review July 1, 2026 20:17

bugbot comment

…eads

Address Bugbot findings on the v2 multi-range path:
- Throw when a multi-range request gets a non-multipart/byteranges
  response (coalesced or ignored ranges can't be mapped back to the
  requested ranges safely).
- Throw when the term cache is still incomplete after a multi-range
  fetch, instead of falling through to the single-range streaming path
  whose Range header doesn't match the multi-range signed URL.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2b1ba93. Configure here.

}
}

if (termRanges.every((range) => range.data)) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial multipart chunks pass check

High Severity

After a v2 multi-range fetch, completeness is checked with termRanges.every((range) => range.data), which only requires a non-empty chunk array. A multipart/byteranges part that decodes fewer xorb chunks than the descriptor covers can still pass and serve truncated output instead of throwing the intended “did not produce all chunks” error.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2b1ba93. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants