[hub] XetBlob: use /v2/reconstructions with v1 fallback#2265
[hub] XetBlob: use /v2/reconstructions with v1 fallback#2265assafvayner wants to merge 3 commits into
Conversation
Prefer the v2 reconstruction endpoint (multi-range signed URLs returning multipart/byteranges responses), falling back to v1 on 404/501 and caching the detected version per CAS endpoint. Mirrors xet-core's RemoteClient. Both v1 and v2 responses are normalized into a common fetch-group shape. The streaming reader is unchanged for single-range groups; multi-range v2 groups are fetched in a single request, split per RFC 7233, and fed through the existing chunk parser.
…eads Address Bugbot findings on the v2 multi-range path: - Throw when a multi-range request gets a non-multipart/byteranges response (coalesced or ignored ranges can't be mapped back to the requested ranges safely). - Throw when the term cache is still incomplete after a multi-range fetch, instead of falling through to the single-range streaming path whose Range header doesn't match the multi-range signed URL.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2b1ba93. Configure here.
| } | ||
| } | ||
|
|
||
| if (termRanges.every((range) => range.data)) { |
There was a problem hiding this comment.
Partial multipart chunks pass check
High Severity
After a v2 multi-range fetch, completeness is checked with termRanges.every((range) => range.data), which only requires a non-empty chunk array. A multipart/byteranges part that decodes fewer xorb chunks than the descriptor covers can still pass and serve truncated output instead of throwing the intended “did not produce all chunks” error.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 2b1ba93. Configure here.


What
Updates
XetBlob(the@huggingface/hubdirect-from-Xet blob reader) to use the/v2/reconstructionsendpoint, falling back to/v1/reconstructionswhen v2 is unavailable. This mirrors whatxet-core'sRemoteClientalready does.Why
The v2 reconstruction API returns multi-range signed URLs: a single URL can carry several byte ranges for a xorb, coalescing what v1 served as many separate single-range URLs. This reduces the number of fetch requests per download. xet-core already prefers v2 with a v1 fallback; this brings the JS client in line.
How
#loadReconstructionInfo): build the v2 URL (fromcasUrl, or by swapping/v1/reconstructions/→/v2/reconstructions/on an explicitreconstructionUrl), try v2, and fall back to v1 on 404/501. The detected version is cached per CAS endpoint to avoid re-probing.fetch_info) and v2 (xorbs) responses are converted into a common fetch-group shape — v1 yields one single-range group per entry, v2 keeps the server's multi-range grouping.Range: bytes=s1-e1,s2-e2,…request, buffers and splits themultipart/byterangesresponse (RFC 7233 §4.1), and feeds each part through the existing chunk parser into theRangeListcache.multipart.ts: a port of xet-core'sparse_multipart_byteranges.A v2 multi-range response must be buffered (not streamed) because the signed URL signs the exact multi-range header, so the parts can only be split after the full body arrives.
Tests
multipart.spec.ts(8 tests) for the byteranges parser.XetBlobv2 tests: multi-range multipart fetch, v2→v1 fallback on 404, and single-range v2.XetBlobmocks now simulate v1-only endpoints (404 on/v2/), so the fallback path is exercised; all prior tests still pass.33 tests passing; eslint, prettier, and tsc clean on the changed files.
Note
Medium Risk
Changes the core Xet download path and byte reconstruction logic; risk is mitigated by v1 fallback, explicit errors on ambiguous multipart responses, and broad new tests, but regressions could still affect large-file downloads.
Overview
XetBlobnow prefers/v2/reconstructions(with/v1fallback on 404/501), caches the detected API version per CAS URL, and normalizes v1fetch_infoand v2xorbsinto a shared fetch-group model.For v2 multi-range signed URLs, the reader issues one combined
Rangerequest, buffers the response, parsesmultipart/byterangesvia newmultipart.ts, and decodes parts into the existingRangeListcache; single-range paths still stream as before. Chunk header parse/decompress logic is shared (parseChunkHeader,decompressChunk,storeChunks), and incomplete or wrong multipart responses throw instead of degrading to single-range fetches that could corrupt data.Tests add
multipart.spec.ts, av2 reconstructionsuite (multi-range, fallback, error cases),clearReconstructionApiVersionCacheinbeforeEach, and v2 404 stubs in existing mocks.Reviewed by Cursor Bugbot for commit 2b1ba93. Bugbot is set up for automated code reviews on this repo. Configure here.