feat: range aware file write by XciD · Pull Request #717 · huggingface/xet-core

XciD · 2026-03-16T12:43:40Z

Summary

APIs for range-aware file writes: instead of re-uploading an entire file when only part of it changed, compose a new CAS file from stable segments + re-chunked dirty windows. Supports resize edits (insert / delete / arbitrary replace) in addition to in-place rewrites.

API: `upload_ranges`

pub async fn upload_ranges(
    config: Arc<TranslatorConfig>,
    cas_client: Arc<dyn Client>,
    original_hash: MerkleHash,
    original_size: u64,
    dirty_inputs: Vec<DirtyInput>,
) -> Result<XetFileInfo>

/// A single edit applied to the original file: replace `original_range` with
/// `new_length` bytes from `reader`. Edits are expressed in original-file coordinates.
pub struct DirtyInput {
    pub original_range: Range<u64>,
    pub reader: Pin<Box<dyn AsyncRead + Send>>,
    pub new_length: u64,
}

The output file size is derived from the inputs (no total_size parameter): original_size - removed + added.

Edit shapes (all expressible with the same struct)

Operation	`original_range`	`new_length`
In-place edit	`a..b`	`b - a`
Resize replace	`a..b`	any
Pure insert	`p..p`	`> 0`
Pure delete	`a..b`	`0`
Append	`original_size..original_size`	`> 0`
Truncate to N	`N..original_size`	`0`
No-op	empty `dirty_inputs`	—

Motivating example:

abc + upload_ranges([0..1), "foo", 3) = foobc
abc + upload_ranges([0..0), "foo", 3) = fooabc
abc + upload_ranges([0..1), "",    0) = bc

Per-range AsyncRead instead of ReadSeek over the staging file. The earlier prototype took dirty_ranges: &[(u64, u64)] + dirty_source: &mut dyn ReadSeek. That had a subtle bug: for truncation we silently extended the dirty set with a boundary chunk and read those bytes from the staging file, but if the file was never opened for write the staging file contains zeros at those positions (real bytes are in CAS) → silent corruption on the truncation boundary chunk. Pairing each edit with its own reader makes that structurally impossible: any byte not provided by the caller is fetched from CAS.

How it works

High level

                           upload_ranges
   +----------------------+   |   +----------------------+
   |  original file (CAS) |---+-->|  composed file (CAS) |
   +----------------------+       +----------------------+
   only the dirty windows are re-uploaded; everything else
   is reused as whole CAS segments.

Step 1 — coalesce + snap edits to segment boundaries

Edits are user-coordinates (byte ranges). We snap each edit's original_range to the enclosing CAS segments so composition can swap whole segments instead of truncating one mid-chunk. Adjacent / overlapping snapped ranges are then coalesced.

Pure inserts (start == end) snap to the segment that owns start; an insert at original_size snaps to the last segment.

Step 2 — server returns windows + gap subtrees

Single CAS call: GET /v2/file-chunk-hashes/{file_id} with the segment-aligned ranges in an X-Range-Dirty: bytes=A-B,C-D header. Response shape (xetcas#987):

struct FileChunkHashesResponse {
    windows:      Vec<ChunkWindow>,         // one per dirty range
    hash_ranges:  Vec<Option<MerkleHashSubtree>>, // N+1 entries: [gap0, gap1, ..., gapN]
}

windows[i].chunks carries the chunk hashes the server actually owns for that window (we re-upload these bytes). hash_ranges[i] is the MerkleHashSubtree for the i-th unmodified gap, or None when there is no gap there. This is the key to composing the final file hash without touching unmodified bytes.

Step 3 — for each window, stream `[CAS prefix | edits | CAS suffix]` through a fresh cleaner

window = [w_start ............................................. w_end]
edits in this window:        [edit_a]    [edit_b]
                                ^           ^
streamed input to the cleaner:
  CAS bytes [w_start, edit_a.start)
  reader bytes for edit_a (new_length bytes)
  CAS bytes [edit_a.end, edit_b.start)
  reader bytes for edit_b
  CAS bytes [edit_b.end, w_end)

Pure inserts contribute zero original bytes but still emit new_length reader bytes. Pure deletes contribute zero reader bytes. The cleaner produces a new MDBFileInfo per window and a ChunkHashList.

Step 4 — compose the file hash via `MerkleHashSubtree::merge`

merge_seq = [gap0, w0, gap1, w1, ..., wN, gapN]   // skip None gaps

merged          = MerkleHashSubtree::merge(merge_seq)
aggregated_hash = merged.final_hash()
combined_hash   = aggregated_hash.hmac(zero)      // matches cleaner's file_hash

Special-case: if total_size == 0 (e.g. truncate to empty) the result is MerkleHash::default() without HMAC, mirroring file_hash([]).

Step 5 — splice segments + register

Walk the original MDBFileInfo.segments and replace any segment that falls inside a window with that window's freshly-uploaded segments. Verification entries follow segment-for-segment when present. metadata_ext = None (no SHA-256, see Limitations). Then register_composed_file + finalize.

Multi-window example

Two edits: replace [50MB, 51MB) and [150MB, 151MB) on a 200MB file:

+-----------+-------+------------+-------+-----------+
|  GAP 0    |  W0   |   GAP 1    |  W1   |  GAP 2    |
|  reused   |upload |  reused    |upload |  reused   |
| (subtree) | ~1MB  | (subtree)  | ~1MB  | (subtree) |
+-----------+-------+------------+-------+-----------+

Wire transfer: ~2MB upload + a few hundred KB of CAS reads for window
boundary chunks. Old approach: 200MB download + 200MB upload.

Empty original short-circuit

When original_size == 0 there is nothing to compose against — every edit's original_range must be 0..0 (validated). We just stream the new bytes through a fresh cleaner (upload_fresh_file).

Reviewer note: `chunk_window_builder` is a re-implementation of xetcas

xet_client/src/cas_client/chunk_window_builder.rs is a port of the same window-building state machine that already lives in xetcas — it's only used by the local / in-memory simulation clients (local_client, memory_client) so the mock CAS server returns the same shape as the real one in tests. No need to re-review it as part of this PR: it mirrors logic already reviewed and merged in xetcas#987. A follow-up xetcas PR will deduplicate by removing the server-side copy and pulling this one in (or vice versa); the duplication is intentional and temporary.

Limitations

No SHA-256 metadata: composed files have metadata_ext = None since recomputing SHA-256 would require reading the full file. Only suitable for contexts that don't require SHA-256 verification (HF buckets, xet-native repos), not for Git LFS-backed repos.
Memory: for very large files, the per-window in-memory state (chunk hash list + composed segments) is bounded by the dirty regions, not the whole file. The chunk-hashes response is paginated by the server-defined window granularity.

Tests (27)

Covering all edit shapes + edge cases. Notable:

Test	Purpose
`test_resize_edits_abc`	The 3 motivating FUSE examples
`test_resize_large_replace_grows_file`	Replace `[a..b)` with much more data
`test_resize_large_replace_shrinks_file`	Replace `[a..b)` with much less data
`test_resize_mid_file_insert`	Pure insert in the middle
`test_resize_mid_file_delete`	Pure delete in the middle
`test_resize_multi_edit_mix`	Insert + replace + delete in one call
`test_resize_insert_at_segment_boundary`	Snapping correctness for inserts
`test_upload_ranges_mid_file_edit`	In-place edit
`test_upload_ranges_truncation`	Pure truncate (sub-segment)
`test_upload_ranges_truncation_empty_staging`	Truncate when staging is all-zero (boundary read from CAS)
`test_upload_ranges_truncation_with_overlapping_dirty`	Truncate + dirty range overlapping the boundary
`test_truncate_to_empty_matches_clean_empty`	Truncating to 0 hashes to `MerkleHash::default()` (matches a fresh empty cleaner)
`test_upload_ranges_append`	Pure append
`test_append_with_gap_before_dirty_range`	Append where reader covers a sparse gap too
`test_append_sparse_staging_file`	Append on a sparse staging file
`test_mid_edit_plus_append`	Mid-file edit and append in one call (P1 codex regression)
`test_empty_original_append`	`original_size == 0` + append falls into the fresh-file path (P2 codex regression)
`test_empty_original_validates_ranges`	`original_size == 0` still runs validation (reviewer regression)
`test_upload_ranges_at_file_start`	Edit at offset 0 (no stable prefix)
`test_upload_ranges_multiple_regions`	Two non-adjacent dirty windows with stable gap
`test_single_input_spanning_many_chunks`	One edit covering many CDC chunks
`test_data_integrity_scenarios`	5 sub-scenarios covering composition correctness
`test_noop_returns_original_hash`	Empty `dirty_inputs` → no CAS call, original hash returned
`test_rejects_dirty_range_past_total_size`	Validation: range past `original_size`
`test_rejects_overlapping_dirty_ranges`	Validation: overlapping edits
`test_rejects_unsorted_dirty_ranges`	Validation: unsorted edits
`test_upload_ranges_small_file_mid_edit`	Small files (single segment)

Dependencies

xetcas: GET /v2/file-chunk-hashes/{file_id} with windows[] + hash_ranges[] response shape — huggingface-internal/xetcas#987 (merged).
Consumer: feat: sparse writes with range_upload, zero-download write path hf-mount#41.

Note

High Risk
High risk because it adds a new partial-upload composition path that splices CAS segments and recomputes file hashes from window subtrees, touching core data integrity and client/server chunk-boundary logic.

Overview
Adds range-aware file writes via new upload_ranges, letting callers apply insert/delete/replace edits and upload only re-chunked dirty windows while reusing stable CAS segments.

Introduces a new CAS API get_file_chunk_hashes (GET /v2/file-chunk-hashes/{file_id} with X-Range-Dirty) plus response types (FileChunkHashesResponse, ChunkWindow) and simulation support (chunk_window_builder) that extends dirty ranges to stable chunk boundaries and returns gap MerkleHashSubtree summaries + stable-segment verification.

Refactors dedup/cleaning plumbing to expose per-chunk hash lists (ChunkHashList), adds detached cleaner/session completion and register_composed_file to avoid orphan shard entries, and moves/re-exports next_stable_chunk_boundary into xet_core_structures for shared stable-window computations.

^{Reviewed by Cursor Bugbot for commit 2f4cee4. Bugbot is set up for automated code reviews on this repo. Configure here.}

Three new APIs to support range-aware writes: 1. Client::get_file_chunk_hashes(file_id) -> ChunkHashList 2. FileUploadSession::register_composed_file(file_info) 3. upload_ranges(config, cas_client, original_hash, original_size, dirty_ranges, dirty_source, total_size) See https://gist.github.com/XciD/198f7e6cfd68f4a0f19c0c4a37c14b61 for a visual explanation. Supporting changes: - SingleFileCleaner::finish() now returns ChunkHashList (eliminates CAS round-trip) - FileUploadSession::file_info_list() for checkpoint-based flow (single session) - XorbObject::chunk_hash_sizes() helper - ChunkHashList type alias in xet_core_structures::merklehash - get_file_chunk_hashes implemented for MemoryClient/LocalClient

… hash When two dirty regions generate identical content after CDC processing, they will have the same Merkle hash. Previously, the mdb_by_hash HashMap would silently drop the second region's MDBFileInfo due to hash collision, causing the second region to incorrectly use the first region's segments. The fix changes mdb_by_hash from HashMap<hash, MDBFileInfo> to HashMap<hash, Vec<MDBFileInfo>>, allowing it to handle multiple regions with the same hash. Regions are matched in order of upload, preserving the expected composition. Added regression test test_two_regions_identical_hash_collision that verifies two dirty regions with identical content don't cause file corruption.

- Unify duplicated chunk extraction logic in MemoryClient using Cow for borrowed/owned variants - Use custom serde deserializer for type-safe MerkleHash deserialization in API responses - Move hash validation from business logic to deserialization boundary - Simplify response processing by eliminating manual hash parsing

Four bugs found via Codex review: 1. Truncation boundary leak: boundary suffix fed CAS bytes beyond total_size into the cleaner, producing wrong chunks and hash. Fix: cap boundary_end at total_size. 2. Identical hash panic: shard manager deduplicates MDBFileInfo by file_hash (BTreeMap), so two regions with the same content only produce one entry. The old remove(0) would panic on the second region. Fix: clone instead of remove (same hash = same segments). 3. Append hash mismatch: last original chunk (EOF-terminated) was reused verbatim instead of being re-chunked with appended data, producing a different hash than a clean upload. Fix: back up first_chunk to include the last original chunk, download its bytes from CAS via the boundary prefix mechanism. 4. Append gap loss: a dirty range starting after original_size (seek-past-EOF) left bytes [original_size, dirty_start) uncovered. Fix: pull dirty range start to original_size during append merge. All tests now verify hash equality with a clean upload of the same content. Added sparse staging file test to catch the boundary prefix vs staging file data source issue.

The dedup_manager.finalize() now returns ChunkHashList as a second element. Destructure it as _chunk_hashes to fix the WASM build.

- Move DirtyRegion, UploadedRegion, ComposedRegion to module level - Make register_composed_file and file_info_list pub(crate) - Add TODO for blocking I/O in async context - Use pseudo-random data in all tests for reliable multi-chunk CDC - Replace if-guard with assert! in hash collision test (512KB data) - Clean up redundant comments

Replace the single buffered boundary download with two targeted FileReconstructor streams (prefix + suffix) that feed directly into the cleaner. This eliminates the Vec<u8> intermediate buffer and the offset arithmetic for slicing prefix/suffix from it. Each stream covers exactly the bytes needed (typically one CDC chunk), and most cases only need one stream (prefix or suffix, not both).

Replace debug_assert! with real errors for dirty_ranges validation (sorted, non-overlapping, non-empty intervals). Move the no-op early return before validation so empty ranges skip the checks. Add tests for overlapping, empty, and unsorted dirty ranges.

- Extract helpers: build_dirty_regions, stream_cas_range, merge_or_push - Replace silent unwrap_or/clamps with explicit debug_assert or runtime errors - Add precondition validation (dirty_range > total_size) - Add test: truncation on chunk boundary, no-op, dirty range past total_size, build_dirty_regions coalescing, inconsistent chunk data - Remove duplicate tests (truncation/append hash-only variants) - Improve comments: ASCII diagrams on tests, clearer doc on truncation/append - Document SHA-256 limitation in upload_ranges doc - Move test helpers to bottom of test module

…hashes-and-compose (#717) Combined branch for hf-mount that needs both CAS client factory (PR 675) and range upload/compose APIs (PR 717). Chunk cache integration from 675 is stubbed out (param accepted but unused) since the struct shapes diverged.

upload_ranges was emitting composed shards without a verification section, which cas-server now rejects with `MDBShard("Shard verification failure, missing verification section")`. Root cause: the May 1 MerkleHashSubtree refactor (a9d9cfe) dropped client-side computation of FileVerificationEntry for stable original segments, expecting `original_mdb.verification` to be populated — but GET /v1/reconstructions has always returned an empty verification list, so the composed shard always went out without verification. Fix: extend GET /v2/file-chunk-hashes/{file_id} to return one FileVerificationEntry per stable original segment (segments wholly outside the dirty windows), in segment order. range_upload pops these in lockstep with the segment walk to populate the composed shard's verification section. This is the minimum data needed: window segments are re-uploaded fresh so their verification comes from the per-window MDB; only stable segments need the server's help. ~32 bytes per stable segment, much cheaper than the legacy "return the whole chunk hash list" endpoint that was replaced by the multi-range API in 78de688. Backwards-compat: when the original file has no verification entries (legacy / test files registered before verification was introduced), the server returns an empty `gap_verification` and the composed shard is emitted with `with_verification=false` as before. Pairs with a xetcas-side change to populate `gap_verification` in the response. Sim clients (local_client, memory_client) populate it locally via build_file_chunk_hashes_response so the existing 27 range_upload tests pass unchanged.

The CAS server rejects any shard without a verification section. When the entire file falls within dirty windows (no stable segments), gap_verification is empty and original_mdb.verification is stripped, causing original_has_verification to be false and the composed shard to omit verification. Since the cleaner always produces verification entries for its segments, unconditionally set has_verification=true.

- Use checked arithmetic in compute_total_size to prevent overflow - Add clarifying comments on hash_ranges double-Option intent - Add comment noting mdb_by_hash dedup-by-content is intentional

Add finish_with_chunks_detached() to SingleFileCleaner that uploads xorb data but returns the MDBFileInfo directly instead of registering it in the session shard. upload_ranges now uses the MDBFileInfo from each window directly for composition, so only the final composed file gets registered. This eliminates N unreferenced per-window entries that previously polluted the shard.

…ation The server always returns this field; no backward compatibility needed.

… improve error msg - Extract MDB assembly (segment splicing + verification) into compose_mdb() - Remove hardcoded original_has_verification=true and its dead branches - Include file hash in "file not found" error for easier debugging

…se error nightly-2026-05-06 broke __heap_base export needed by wasm-bindgen for threading support. Pin to the last known-good nightly and remove +nightly from build_wasm.sh so the CI-controlled toolchain is used.

…_heap_base error" This reverts commit 16022bd.

…s-and-compose

rajatarya

Re-review (post xetcas#987 merge)

This PR has come a long way since my approval on March 27. I went back and re-read the whole thing end-to-end against the new wire shape, and the rework is genuinely good — what shipped is cleaner than what I approved.

What's new since last pass

Wire shape pivot. Composition is now driven by windows[] + hash_ranges[] of opaque MerkleHashSubtree summaries (xetcas#987), not flat per-chunk hashes. Per-chunk hashes never cross the wire, so the response stays O(windows + gaps) regardless of file size. Final hash is rebuilt with MerkleHashSubtree::merge — no ChunkHashList-glue layer, no duplicated tree-folding logic on the client.
Resize edits. DirtyInput { original_range, reader, new_length } collapses in-place edit / replace / pure-insert / pure-delete / append / truncate into a single shape. The original_range.len() != new_length branch falls out cleanly thanks to per-edit readers (no ReadSeek over a sparse staging file — that bug class is structurally gone).
My GC concern is addressed at the client layer. The new register_single_file_clean_completion_detached + register_composed_file split means window-uploads no longer pollute the session shard with orphan MDBFileInfo entries — only the final composed file is registered. Combined with sirahd's GC tracking issue (huggingface-internal/xet-garbage-collection#2), the design now does the right thing: window xorbs are findable through the composed file, and there are no dangling shard references.
Defensive runtime invariants (commit 87b9061): server-contract violations (windows.len() != n_windows, hash_ranges.len() != n_windows + 1, leftover edits not assigned to any window, gap_verification length mismatch with stable-segment count) are now hard runtime errors instead of silent corruption. Exactly the right call for a composition algorithm where zip would otherwise drop windows quietly.

What `upload_ranges` offers, visually

End-to-end flow:

sequenceDiagram
    autonumber
    participant C as Caller (e.g. hf-mount)
    participant U as upload_ranges
    participant S as CAS server
    participant D as Cleaner / dedup

    C->>U: dirty_inputs[]<br/>(original_range, reader, new_length)
    Note over U: validate + coalesce<br/>snap to segment boundaries
    U->>S: GET /v2/file-chunk-hashes/{id}<br/>X-Range-Dirty: bytes=A-B,C-D
    S-->>U: windows[N] + hash_ranges[N+1]<br/>+ gap_verification[stable_segs]
    loop one cleaner per window
        U->>S: stream CAS prefix (boundary chunks)
        U->>D: feed prefix bytes
        U->>D: feed reader bytes (the edit)
        U->>S: stream CAS suffix
        U->>D: feed suffix bytes
        D-->>U: chunks + new MDBFileInfo (detached)
    end
    Note over U: merge_seq = [gap0, w0, gap1, ..., gapN]<br/>combined_hash = merge(seq).hmac(0)
    U->>S: register_composed_file(MDBFileInfo)
    U->>S: finalize session
    U-->>C: XetFileInfo { hash, size }

Why this is a big deal — the wire savings:

flowchart LR
    subgraph old ["Old: full re-upload"]
        OF["200MB original"] --> OU["download 200MB"]
        OU --> OE["edit 1MB"]
        OE --> OUP["upload 200MB"]
    end
    subgraph new ["upload_ranges: 2MB edit"]
        NF["200MB original"] --> NS["snap edits to<br/>segment boundaries"]
        NS --> NW["server returns<br/>2 windows + 3 gap subtrees"]
        NW --> NB["download ~few KB<br/>boundary chunks"]
        NB --> NU["upload ~2MB<br/>(2 dirty windows)"]
    end
    style old fill:#ffe5e5
    style new fill:#e5ffe5

Hash composition without re-reading the file:

flowchart TB
    F["Original file: chunks 0..M"] --> G0["gap0 subtree<br/>(server-stored)"] & G1["gap1 subtree<br/>(server-stored)"] & G2["gap2 subtree<br/>(server-stored)"]
    R["DirtyInput readers"] --> CL1["cleaner: window 0"] --> W0["w0 subtree<br/>(client-computed)"]
    R --> CL2["cleaner: window 1"] --> W1["w1 subtree<br/>(client-computed)"]
    G0 --> M{{"MerkleHashSubtree::merge<br/>[gap0, w0, gap1, w1, gap2]"}}
    W0 --> M
    G1 --> M
    W1 --> M
    G2 --> M
    M --> H["final_hash().hmac(0)<br/>== file_hash of edited file"]

Edit-shape coverage (one struct, six operations):

flowchart LR
    DI["DirtyInput<br/>{ original_range: a..b, new_length: n }"] --> Q1{"a == b?"}
    Q1 -->|yes| Q2{"n > 0?"}
    Q1 -->|no| Q3{"n == 0?"}
    Q2 -->|yes| INS["pure insert at a"]
    Q2 -->|no| NOOP["degenerate no-op"]
    Q3 -->|yes| DEL["pure delete a..b"]
    Q3 -->|no| Q4{"b - a == n?"}
    Q4 -->|yes| IPE["in-place edit"]
    Q4 -->|no| RES["resize replace"]
    INS -.->|special: a == original_size| APP["append"]
    DEL -.->|special: b == original_size| TRU["truncate to a"]

Items to think about (none blocking)

chunk_window_builder.rs duplication. PR description flags this as intentional and temporary, mirroring xetcas#987 server logic so simulation clients answer get_file_chunk_hashes without HTTP. Worth filing a follow-up issue (or noting #717 in xetcas) to track the eventual dedup so the duplication doesn't quietly stick around as duplications always do without a forcing function.
Empty gap_verification on partial-verification shards. When file_info.verification.len() != file_info.segments.len(), the simulation builder silently emits an empty gap_verification, and then compose_mdb errors out at runtime with a less-targeted message. Inline below — either error up-front or document the silent path.
One readability nit + one defensive question. Inline.

Nothing here blocks the merge — most of these are conversation starters. Approving again on the strength of the current state. Nice work.

- chunk_window_builder: tighten gap_verification mismatch branch into a loud ClientError instead of silently emitting empty, and document the contract (empty or 1:1 with segments). - range_upload: replace w.end.min(original_size) dead defense with a debug_assert, since the server clamps dirty_byte_range.end to file_size. - range_upload: rewrite is_some_and(Option::is_none) as matches!(_, Some(None)) for readability. - cas_types: capitalize X-Range-Dirty header constant to match the local X-Foo-Bar convention (wire is case-insensitive).

Reference test_resize_insert_at_segment_boundary and test_mid_edit_plus_append in the comment block explaining the pure-insert-at-w_end edge case, so a future reader can find the coverage immediately.

… finish() result

This PR adds additional stress testing to #717, causing the simulation server logic to properly use next_stable_chunk_boundary logic for the simulation. As a result, multiple requested ranges for editing could be merged into a single range on the server end, which required updating some checks on the client side. Additional stress tests were added under the simulation feature flag, and smoke tests added as well for the cargo smoke-test feature.  --- > [!NOTE] > **Medium Risk** > Moderate risk: changes core dedup/chunk-window construction and relaxes client/server window shape assumptions, which can affect correctness of range uploads and hashing. Also bumps low-level deps (`ctor`, `openssl`) and adds a Node napi smoke-test example, increasing build surface area. > > **Overview** > Adds a new public helper `next_stable_chunk_boundary` (canonical in `xet_core_structures`, re-exported from `xet_data`) and updates server-side `build_file_chunk_hashes_response` to **extend dirty ranges to the next stable chunk boundary and coalesce overlaps** before computing windows. > > Updates `upload_ranges` to accept that the server may merge windows (validating only `windows` non-empty and `hash_ranges.len() == windows.len() + 1`) and adds targeted regression/stress tests plus a new `xet_data` test suite validating stable-boundary behavior under random prefix mutations. > > Separately: improves retry logging by marking `query_dedup` 404s as *expected cache misses*, increases client read timeout to 300s, bumps `ctor` to v1 and updates `openssl` crates, tightens a few minor iterations/formatting, and adds an `examples/xet_pkg_napi` Node addon smoke-test project (excluded from the workspace). > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 49427f0. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup>  --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Arpit Jain <arpitjain099@gmail.com> Co-authored-by: tison <wander4096@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Di Xiao <seanses@users.noreply.github.com> Co-authored-by: Arpit Jain <3242828+arpitjain099@users.noreply.github.com> Co-authored-by: Assaf Vayner <assaf@huggingface.co> Co-authored-by: Rajat Arya <rajatarya@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s-and-compose

…se' into feat/file-chunk-hashes-and-compose # Conflicts: # api_changes/update_260424_next_stable_chunk_boundary.md # xet_data/src/deduplication/chunking.rs

cursor

Cursor Bugbot has reviewed your changes using default mode and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 2f4cee4. Configure here.}

cursor · 2026-05-21T18:27:26Z

+    let mut input_idx = 0usize;
+    let mut uploaded: Vec<UploadedWindow> = Vec::with_capacity(response.windows.len());
+
+    let mut buf = vec![0u8; STREAM_BLOCK_SIZE];


Shared read buffer across windows prevents concurrent processing

Low Severity

A single buf allocation is hoisted outside the window loop and reused across all windows. While this avoids repeated allocation, it also makes the loop inherently sequential — each window must finish reading before the next can begin. For multi-window edits on large files, allocating a per-window buffer (or using a pool) would enable future parallelization of window processing without a redesign.

^{Reviewed by Cursor Bugbot for commit 2f4cee4. Configure here.}

XciD force-pushed the feat/file-chunk-hashes-and-compose branch 13 times, most recently from fea540f to 981ba18 Compare March 16, 2026 18:07

XciD added 8 commits March 17, 2026 10:33

fix: update WASM cleaner for new finalize() signature

8da6741

The dedup_manager.finalize() now returns ChunkHashList as a second element. Destructure it as _chunk_hashes to fix the WASM build.

XciD force-pushed the feat/file-chunk-hashes-and-compose branch from 5be3af3 to 7eb1169 Compare March 17, 2026 09:33

XciD force-pushed the feat/file-chunk-hashes-and-compose branch from 7eb1169 to e0db8b9 Compare March 17, 2026 09:45

XciD marked this pull request as ready for review March 17, 2026 15:06

XciD requested review from hoytak, rajatarya and seanses March 17, 2026 15:06

XciD changed the title ~~feat: add get_file_chunk_hashes and register_composed_file for range writes~~ feat: range aware file write Mar 17, 2026

cursor Bot reviewed May 6, 2026

View reviewed changes

Comment thread xet_data/src/processing/range_upload.rs

XciD force-pushed the feat/file-chunk-hashes-and-compose branch 3 times, most recently from 44815c9 to 1ba9082 Compare May 6, 2026 08:48

XciD force-pushed the feat/file-chunk-hashes-and-compose branch from 1ba9082 to 45f5937 Compare May 6, 2026 10:07

XciD added 7 commits May 6, 2026 18:17

fix(range_upload): address PR review comments

46e1be3

- Use checked arithmetic in compute_total_size to prevent overflow - Add clarifying comments on hash_ranges double-Option intent - Add comment noting mdb_by_hash dedup-by-content is intentional

refactor(cas_types): remove unnecessary serde(default) on gap_verific…

c180ca6

…ation The server always returns this field; no backward compatibility needed.

Revert "fix(ci): pin WASM nightly to 2026-05-05 to fix wasm-bindgen _…

e3cf4de

…_heap_base error" This reverts commit 16022bd.

XciD requested a review from rajatarya May 7, 2026 09:06

Merge remote-tracking branch 'origin/main' into feat/file-chunk-hashe…

a1c8b19

…s-and-compose

rajatarya reviewed May 7, 2026

View reviewed changes

seanses approved these changes May 18, 2026

View reviewed changes

XciD added 2 commits May 18, 2026 20:38

docs(range_upload): link boundary-case comment to exercising tests

2ebcfcc

Reference test_resize_insert_at_segment_boundary and test_mid_edit_plus_append in the comment block explaining the pure-insert-at-w_end edge case, so a future reader can find the coverage immediately.

hoytak reviewed May 20, 2026

View reviewed changes

Comment thread xet_data/src/processing/file_cleaner.rs Outdated

hoytak reviewed May 20, 2026

View reviewed changes

Comment thread xet_data/src/processing/file_upload_session.rs Outdated

XciD added 2 commits May 21, 2026 02:00

refactor(file_cleaner): use mut self instead of this in finish_inner

624ba7c

refactor(file_upload_session): drop unnecessary destructure+rewrap of…

3fd592e

… finish() result

hoytak mentioned this pull request May 21, 2026

Added stress testing; use of next_stable_chunk_boundary logic #845

Merged

Hoyt Koepke and others added 3 commits May 21, 2026 20:11

Merge remote-tracking branch 'origin/main' into feat/file-chunk-hashe…

cbee472

…s-and-compose

Merge remote-tracking branch 'origin/feat/file-chunk-hashes-and-compo…

2f4cee4

…se' into feat/file-chunk-hashes-and-compose # Conflicts: # api_changes/update_260424_next_stable_chunk_boundary.md # xet_data/src/deduplication/chunking.rs

cursor Bot reviewed May 21, 2026

View reviewed changes

XciD merged commit 40f9530 into main May 21, 2026
10 checks passed

XciD deleted the feat/file-chunk-hashes-and-compose branch May 21, 2026 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: range aware file write#717

feat: range aware file write#717
XciD merged 43 commits into
mainfrom
feat/file-chunk-hashes-and-compose

XciD commented Mar 16, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

rajatarya left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

XciD commented Mar 16, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

API: upload_ranges

Edit shapes (all expressible with the same struct)

High level

Step 1 — coalesce + snap edits to segment boundaries

Step 2 — server returns windows + gap subtrees

Step 3 — for each window, stream [CAS prefix | edits | CAS suffix] through a fresh cleaner

Step 4 — compose the file hash via MerkleHashSubtree::merge

Step 5 — splice segments + register

Multi-window example

Empty original short-circuit

Reviewer note: chunk_window_builder is a re-implementation of xetcas

Limitations

Tests (27)

Dependencies

Uh oh!

Uh oh!

rajatarya left a comment

Choose a reason for hiding this comment

Re-review (post xetcas#987 merge)

What's new since last pass

What upload_ranges offers, visually

Items to think about (none blocking)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 21, 2026

Choose a reason for hiding this comment

Shared read buffer across windows prevents concurrent processing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

XciD commented Mar 16, 2026 •

edited by cursor Bot

Loading

API: `upload_ranges`

Step 3 — for each window, stream `[CAS prefix | edits | CAS suffix]` through a fresh cleaner

Step 4 — compose the file hash via `MerkleHashSubtree::merge`

Reviewer note: `chunk_window_builder` is a re-implementation of xetcas

What `upload_ranges` offers, visually