Skip to content

feat: range aware file write#717

Open
XciD wants to merge 36 commits into
mainfrom
feat/file-chunk-hashes-and-compose
Open

feat: range aware file write#717
XciD wants to merge 36 commits into
mainfrom
feat/file-chunk-hashes-and-compose

Conversation

@XciD
Copy link
Copy Markdown
Member

@XciD XciD commented Mar 16, 2026

Summary

APIs for range-aware file writes: instead of re-uploading an entire file when only part of it changed, compose a new CAS file from stable segments + re-chunked dirty windows. Supports resize edits (insert / delete / arbitrary replace) in addition to in-place rewrites.

API: upload_ranges

pub async fn upload_ranges(
    config: Arc<TranslatorConfig>,
    cas_client: Arc<dyn Client>,
    original_hash: MerkleHash,
    original_size: u64,
    dirty_inputs: Vec<DirtyInput>,
) -> Result<XetFileInfo>
/// A single edit applied to the original file: replace `original_range` with
/// `new_length` bytes from `reader`. Edits are expressed in original-file coordinates.
pub struct DirtyInput {
    pub original_range: Range<u64>,
    pub reader: Pin<Box<dyn AsyncRead + Send>>,
    pub new_length: u64,
}

The output file size is derived from the inputs (no total_size parameter): original_size - removed + added.

Edit shapes (all expressible with the same struct)

Operation original_range new_length
In-place edit a..b b - a
Resize replace a..b any
Pure insert p..p > 0
Pure delete a..b 0
Append original_size..original_size > 0
Truncate to N N..original_size 0
No-op empty dirty_inputs

Motivating example:

abc + upload_ranges([0..1), "foo", 3) = foobc
abc + upload_ranges([0..0), "foo", 3) = fooabc
abc + upload_ranges([0..1), "",    0) = bc

Per-range AsyncRead instead of ReadSeek over the staging file. The earlier prototype took dirty_ranges: &[(u64, u64)] + dirty_source: &mut dyn ReadSeek. That had a subtle bug: for truncation we silently extended the dirty set with a boundary chunk and read those bytes from the staging file, but if the file was never opened for write the staging file contains zeros at those positions (real bytes are in CAS) → silent corruption on the truncation boundary chunk. Pairing each edit with its own reader makes that structurally impossible: any byte not provided by the caller is fetched from CAS.

How it works

High level

                           upload_ranges
   +----------------------+   |   +----------------------+
   |  original file (CAS) |---+-->|  composed file (CAS) |
   +----------------------+       +----------------------+
   only the dirty windows are re-uploaded; everything else
   is reused as whole CAS segments.

Step 1 — coalesce + snap edits to segment boundaries

Edits are user-coordinates (byte ranges). We snap each edit's original_range to the enclosing CAS segments so composition can swap whole segments instead of truncating one mid-chunk. Adjacent / overlapping snapped ranges are then coalesced.

Pure inserts (start == end) snap to the segment that owns start; an insert at original_size snaps to the last segment.

Step 2 — server returns windows + gap subtrees

Single CAS call: GET /v2/file-chunk-hashes/{file_id} with the segment-aligned ranges in an X-Range-Dirty: bytes=A-B,C-D header. Response shape (xetcas#987):

struct FileChunkHashesResponse {
    windows:      Vec<ChunkWindow>,         // one per dirty range
    hash_ranges:  Vec<Option<MerkleHashSubtree>>, // N+1 entries: [gap0, gap1, ..., gapN]
}

windows[i].chunks carries the chunk hashes the server actually owns for that window (we re-upload these bytes). hash_ranges[i] is the MerkleHashSubtree for the i-th unmodified gap, or None when there is no gap there. This is the key to composing the final file hash without touching unmodified bytes.

Step 3 — for each window, stream [CAS prefix | edits | CAS suffix] through a fresh cleaner

window = [w_start ............................................. w_end]
edits in this window:        [edit_a]    [edit_b]
                                ^           ^
streamed input to the cleaner:
  CAS bytes [w_start, edit_a.start)
  reader bytes for edit_a (new_length bytes)
  CAS bytes [edit_a.end, edit_b.start)
  reader bytes for edit_b
  CAS bytes [edit_b.end, w_end)

Pure inserts contribute zero original bytes but still emit new_length reader bytes. Pure deletes contribute zero reader bytes. The cleaner produces a new MDBFileInfo per window and a ChunkHashList.

Step 4 — compose the file hash via MerkleHashSubtree::merge

merge_seq = [gap0, w0, gap1, w1, ..., wN, gapN]   // skip None gaps

merged          = MerkleHashSubtree::merge(merge_seq)
aggregated_hash = merged.final_hash()
combined_hash   = aggregated_hash.hmac(zero)      // matches cleaner's file_hash

Special-case: if total_size == 0 (e.g. truncate to empty) the result is MerkleHash::default() without HMAC, mirroring file_hash([]).

Step 5 — splice segments + register

Walk the original MDBFileInfo.segments and replace any segment that falls inside a window with that window's freshly-uploaded segments. Verification entries follow segment-for-segment when present. metadata_ext = None (no SHA-256, see Limitations). Then register_composed_file + finalize.

Multi-window example

Two edits: replace [50MB, 51MB) and [150MB, 151MB) on a 200MB file:

+-----------+-------+------------+-------+-----------+
|  GAP 0    |  W0   |   GAP 1    |  W1   |  GAP 2    |
|  reused   |upload |  reused    |upload |  reused   |
| (subtree) | ~1MB  | (subtree)  | ~1MB  | (subtree) |
+-----------+-------+------------+-------+-----------+

Wire transfer: ~2MB upload + a few hundred KB of CAS reads for window
boundary chunks. Old approach: 200MB download + 200MB upload.

Empty original short-circuit

When original_size == 0 there is nothing to compose against — every edit's original_range must be 0..0 (validated). We just stream the new bytes through a fresh cleaner (upload_fresh_file).

Reviewer note: chunk_window_builder is a re-implementation of xetcas

xet_client/src/cas_client/chunk_window_builder.rs is a port of the same window-building state machine that already lives in xetcas — it's only used by the local / in-memory simulation clients (local_client, memory_client) so the mock CAS server returns the same shape as the real one in tests. No need to re-review it as part of this PR: it mirrors logic already reviewed and merged in xetcas#987. A follow-up xetcas PR will deduplicate by removing the server-side copy and pulling this one in (or vice versa); the duplication is intentional and temporary.

Limitations

  • No SHA-256 metadata: composed files have metadata_ext = None since recomputing SHA-256 would require reading the full file. Only suitable for contexts that don't require SHA-256 verification (HF buckets, xet-native repos), not for Git LFS-backed repos.
  • Memory: for very large files, the per-window in-memory state (chunk hash list + composed segments) is bounded by the dirty regions, not the whole file. The chunk-hashes response is paginated by the server-defined window granularity.

Tests (27)

Covering all edit shapes + edge cases. Notable:

Test Purpose
test_resize_edits_abc The 3 motivating FUSE examples
test_resize_large_replace_grows_file Replace [a..b) with much more data
test_resize_large_replace_shrinks_file Replace [a..b) with much less data
test_resize_mid_file_insert Pure insert in the middle
test_resize_mid_file_delete Pure delete in the middle
test_resize_multi_edit_mix Insert + replace + delete in one call
test_resize_insert_at_segment_boundary Snapping correctness for inserts
test_upload_ranges_mid_file_edit In-place edit
test_upload_ranges_truncation Pure truncate (sub-segment)
test_upload_ranges_truncation_empty_staging Truncate when staging is all-zero (boundary read from CAS)
test_upload_ranges_truncation_with_overlapping_dirty Truncate + dirty range overlapping the boundary
test_truncate_to_empty_matches_clean_empty Truncating to 0 hashes to MerkleHash::default() (matches a fresh empty cleaner)
test_upload_ranges_append Pure append
test_append_with_gap_before_dirty_range Append where reader covers a sparse gap too
test_append_sparse_staging_file Append on a sparse staging file
test_mid_edit_plus_append Mid-file edit and append in one call (P1 codex regression)
test_empty_original_append original_size == 0 + append falls into the fresh-file path (P2 codex regression)
test_empty_original_validates_ranges original_size == 0 still runs validation (reviewer regression)
test_upload_ranges_at_file_start Edit at offset 0 (no stable prefix)
test_upload_ranges_multiple_regions Two non-adjacent dirty windows with stable gap
test_single_input_spanning_many_chunks One edit covering many CDC chunks
test_data_integrity_scenarios 5 sub-scenarios covering composition correctness
test_noop_returns_original_hash Empty dirty_inputs → no CAS call, original hash returned
test_rejects_dirty_range_past_total_size Validation: range past original_size
test_rejects_overlapping_dirty_ranges Validation: overlapping edits
test_rejects_unsorted_dirty_ranges Validation: unsorted edits
test_upload_ranges_small_file_mid_edit Small files (single segment)

Dependencies


Note

Medium Risk
Adds a new partial-upload composition path (upload_ranges) that changes how file hashes and reconstruction plans are computed and registered, which can impact data integrity if windowing/merge invariants are wrong. Scope is contained to new APIs and plumbing but touches core dedup/finalization return shapes.

Overview
Adds a new range-aware upload API, upload_ranges, which composes an updated file by re-uploading only chunk-aligned “dirty windows” while reusing stable CAS segments for the rest of the file (including resize edits like insert/delete/replace).

Introduces a new CAS client endpoint wrapper get_file_chunk_hashes (HTTP GET /v2/file-chunk-hashes/{file_id} using X-Range-Dirty) and new response types (FileChunkHashesResponse, ChunkWindow, X_RANGE_DIRTY_HEADER) to return dirty windows plus gap MerkleHashSubtree summaries (and segment verification hashes) needed to recompute the final file hash without transferring per-chunk hashes.

Extends the ingestion/dedup pipeline to support composition: FileDeduper::finalize now returns the per-file ChunkHashList, SingleFileCleaner gains finish_with_chunks/finish_with_chunks_detached, and FileUploadSession adds detached completion + register_composed_file so window uploads don’t create orphan shard entries before the final composed MDBFileInfo is registered. Simulation clients implement get_file_chunk_hashes locally via the new chunk_window_builder helper, and tests are added/updated to cover the new edit shapes and regressions.

Reviewed by Cursor Bugbot for commit a1c8b19. Bugbot is set up for automated code reviews on this repo. Configure here.

@XciD XciD force-pushed the feat/file-chunk-hashes-and-compose branch 13 times, most recently from fea540f to 981ba18 Compare March 16, 2026 18:07
XciD added 8 commits March 17, 2026 10:33
Three new APIs to support range-aware writes:

1. Client::get_file_chunk_hashes(file_id) -> ChunkHashList
2. FileUploadSession::register_composed_file(file_info)
3. upload_ranges(config, cas_client, original_hash, original_size, dirty_ranges, dirty_source, total_size)

See https://gist.github.com/XciD/198f7e6cfd68f4a0f19c0c4a37c14b61 for a visual explanation.

Supporting changes:
- SingleFileCleaner::finish() now returns ChunkHashList (eliminates CAS round-trip)
- FileUploadSession::file_info_list() for checkpoint-based flow (single session)
- XorbObject::chunk_hash_sizes() helper
- ChunkHashList type alias in xet_core_structures::merklehash
- get_file_chunk_hashes implemented for MemoryClient/LocalClient
… hash

When two dirty regions generate identical content after CDC processing,
they will have the same Merkle hash. Previously, the mdb_by_hash HashMap
would silently drop the second region's MDBFileInfo due to hash collision,
causing the second region to incorrectly use the first region's segments.

The fix changes mdb_by_hash from HashMap<hash, MDBFileInfo> to
HashMap<hash, Vec<MDBFileInfo>>, allowing it to handle multiple regions
with the same hash. Regions are matched in order of upload, preserving
the expected composition.

Added regression test test_two_regions_identical_hash_collision that
verifies two dirty regions with identical content don't cause file
corruption.
- Unify duplicated chunk extraction logic in MemoryClient using Cow for borrowed/owned variants
- Use custom serde deserializer for type-safe MerkleHash deserialization in API responses
- Move hash validation from business logic to deserialization boundary
- Simplify response processing by eliminating manual hash parsing
Four bugs found via Codex review:

1. Truncation boundary leak: boundary suffix fed CAS bytes beyond
   total_size into the cleaner, producing wrong chunks and hash.
   Fix: cap boundary_end at total_size.

2. Identical hash panic: shard manager deduplicates MDBFileInfo by
   file_hash (BTreeMap), so two regions with the same content only
   produce one entry. The old remove(0) would panic on the second
   region. Fix: clone instead of remove (same hash = same segments).

3. Append hash mismatch: last original chunk (EOF-terminated) was
   reused verbatim instead of being re-chunked with appended data,
   producing a different hash than a clean upload. Fix: back up
   first_chunk to include the last original chunk, download its
   bytes from CAS via the boundary prefix mechanism.

4. Append gap loss: a dirty range starting after original_size
   (seek-past-EOF) left bytes [original_size, dirty_start) uncovered.
   Fix: pull dirty range start to original_size during append merge.

All tests now verify hash equality with a clean upload of the same
content. Added sparse staging file test to catch the boundary prefix
vs staging file data source issue.
The dedup_manager.finalize() now returns ChunkHashList as a second
element. Destructure it as _chunk_hashes to fix the WASM build.
- Move DirtyRegion, UploadedRegion, ComposedRegion to module level
- Make register_composed_file and file_info_list pub(crate)
- Add TODO for blocking I/O in async context
- Use pseudo-random data in all tests for reliable multi-chunk CDC
- Replace if-guard with assert! in hash collision test (512KB data)
- Clean up redundant comments
Replace the single buffered boundary download with two targeted
FileReconstructor streams (prefix + suffix) that feed directly into
the cleaner. This eliminates the Vec<u8> intermediate buffer and
the offset arithmetic for slicing prefix/suffix from it.

Each stream covers exactly the bytes needed (typically one CDC chunk),
and most cases only need one stream (prefix or suffix, not both).
Replace debug_assert! with real errors for dirty_ranges validation
(sorted, non-overlapping, non-empty intervals). Move the no-op early
return before validation so empty ranges skip the checks.

Add tests for overlapping, empty, and unsorted dirty ranges.
@XciD XciD force-pushed the feat/file-chunk-hashes-and-compose branch from 5be3af3 to 7eb1169 Compare March 17, 2026 09:33
- Extract helpers: build_dirty_regions, stream_cas_range, merge_or_push
- Replace silent unwrap_or/clamps with explicit debug_assert or runtime errors
- Add precondition validation (dirty_range > total_size)
- Add test: truncation on chunk boundary, no-op, dirty range past total_size,
  build_dirty_regions coalescing, inconsistent chunk data
- Remove duplicate tests (truncation/append hash-only variants)
- Improve comments: ASCII diagrams on tests, clearer doc on truncation/append
- Document SHA-256 limitation in upload_ranges doc
- Move test helpers to bottom of test module
@XciD XciD force-pushed the feat/file-chunk-hashes-and-compose branch from 7eb1169 to e0db8b9 Compare March 17, 2026 09:45
XciD added a commit that referenced this pull request Mar 17, 2026
…hashes-and-compose (#717)

Combined branch for hf-mount that needs both CAS client factory (PR 675)
and range upload/compose APIs (PR 717). Chunk cache integration from 675
is stubbed out (param accepted but unused) since the struct shapes diverged.
@XciD XciD marked this pull request as ready for review March 17, 2026 15:06
@XciD XciD requested review from hoytak, rajatarya and seanses March 17, 2026 15:06
@XciD XciD changed the title feat: add get_file_chunk_hashes and register_composed_file for range writes feat: range aware file write Mar 17, 2026
- Extract caller-input validation into `validate_dirty_ranges` helper, now
  reused by `upload_ranges` and `upload_fresh_file`.
- Replace `DataError::InternalError` with `DataError::ParameterError` for
  parameter-shape errors (sortedness, bounds, append coverage). The
  remaining `InternalError`s are for genuine internal failures
  (missing reconstruction info, MerkleHashSubtree::merge error, etc.).
- Drop the redundant `covered_up_to >= original_size` guard: `covered_up_to`
  starts at `original_size` and only grows via `.max(end)`, so it is always
  `>= original_size` inside the loop body.
- Use `ParameterError("file not found")` when reconstruction info is missing
  for the requested `original_hash`.
Comment thread xet_data/src/processing/range_upload.rs
… replace)

Switch `DirtyInput` from output-coordinate ranges (where the reader had to yield
exactly `range.end - range.start` bytes) to original-coordinate edits with an
explicit `new_length`. This expresses pure inserts, pure deletes, and arbitrary
in-place replaces in a single uniform shape.

API:
- `DirtyInput { original_range: Range<u64>, reader, new_length: u64 }`
- `upload_ranges(...)` no longer takes `total_size`; it derives it from the
  inputs.

Mapping legacy callers:
- in-place edit: `original_range: a..b, new_length: b - a`
- pure insert: `original_range: X..X, new_length: N`
- pure delete: `original_range: a..b, new_length: 0`
- append: `original_range: original_size..original_size, new_length: N`
- truncate to N: `original_range: N..original_size, new_length: 0`

Implementation:
- Snap each edit's `original_range` to enclosing segment boundaries before
  hitting the server. Pure inserts at `original_size` snap to the last segment;
  inserts elsewhere snap to the segment containing the position. Server windows
  come back equal to the snapped ranges.
- Per-window loop walks `&mut dirty_inputs[input_idx..edits_end]` once;
  `edits_end` is found via `take_while` with a closure that tracks the EOF-insert
  boundary. Single fold for `removed`/`added` totals; streaming buffer hoisted
  out of the inner loop.
- Hash composition (`MerkleHashSubtree::merge`) and segment-swap composition
  unchanged.
- `validate_dirty_ranges` now runs at the very top of `upload_ranges`, before
  the empty-original short-circuit. Fixes a real bug in the previously pushed
  code where `original_size == 0` bypassed validation entirely (Adrien spotted
  it on review): the empty-original path's gap-only check accepted overlapping /
  oversize ranges and would feed mismatched bytes to the cleaner.

Tests:
- 6 new resize tests (replace grow, replace shrink, mid-file insert, mid-file
  delete, multi-edit mix, insert at segment boundary).
- `test_resize_edits_abc` covers the three motivating one-liners
  (`abc -> foobc`, `abc -> fooabc`, `abc -> bc`).
- `test_empty_original_validates_ranges` locks the validation order against the
  bug Adrien reported.
- New `assert_edits` helper consumed by all the resize tests; obsolete
  `make_dummy_inputs` validation tests for empty-range / append-without-coverage
  removed (the new model accepts pure inserts and has no implicit coverage rule).

27/27 tests pass; CI clippy clean.
Comment thread xet_data/src/processing/range_upload.rs
@XciD
Copy link
Copy Markdown
Member Author

XciD commented May 2, 2026

@lhoestq I've modified the api to answer to your need. We can't test right know over the prod as the code is not deployed yet.

…h_chunks

The chunk hash list returned by SingleFileCleaner::finish is only consumed
by upload_ranges (to build partial MerkleHashSubtree nodes for newly-
uploaded windows). Every other caller threw it away as _chunks /
_chunk_hashes. Keep the 3-tuple variant under finish_with_chunks for the
one real consumer and let everyone else use the simpler 2-tuple finish.
Comment thread xet_data/src/processing/range_upload.rs Outdated
debug_assert_eq! is compiled out in release builds, so a malformed
get_file_chunk_hashes response (wrong number of windows or hash_ranges)
would silently truncate the merge sequence via .pop()/.zip() and produce
an incorrect composed file hash with no error to the caller.

Replace the three correctness-critical checks (server window count,
server hash_ranges count, segment-aligned windows, plus the file_size
sanity check on reconstruction info) with real Err returns so a
misbehaving server is detected in release rather than silently corrupting
the composed reconstruction.
Comment thread xet_data/src/processing/range_upload.rs
@XciD XciD requested review from rajatarya, seanses and sirahd May 2, 2026 14:07
Every dirty edit must land in exactly one returned window. If the server
ever returns narrower dirty_byte_ranges than requested, leftover edits
would silently drop and the composed file would be corrupt with no error.
Promote that invariant to a runtime error, matching the existing
response-shape checks above.
Copy link
Copy Markdown
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewing this since the design has shifted substantially since my March approval (per-edit AsyncRead, resize-edit support, MerkleHashSubtree::merge composition, multi-range API adoption). The redesign is a clear improvement — particularly:

  • Per-edit AsyncRead + reader ownership: structurally fixes the prior truncation-corruption hazard (callers literally cannot supply zero bytes for a region not in their edit set).
  • Defensive contract checks on the server response (response.windows.len() != n_windows, hash_ranges.len() != n_windows + 1, plus the post-loop input_idx != dirty_inputs.len() check) — these turn what would otherwise be silent file-hash corruption into loud errors.
  • Validation runs before the empty-original short-circuit — the regression test test_empty_original_validates_ranges is exactly the right shape.
  • Test matrix is excellent. Round-trip + clean-upload-hash equivalence on every content-modifying test is the gold standard, and the codex regressions are captured (test_mid_edit_plus_append, test_empty_original_append, test_truncate_to_empty_matches_clean_empty).

My approval still stands; logging a few items here that came up on this pass:

  1. Orphan window file entries (still present from my March review). Each window's cleaner.finish_with_chunks() ultimately calls register_single_file_clean_completionadd_file_reconstruction_info(fi) (file_upload_session.rs:494), and then register_composed_file registers the composed MDB on top. So a single upload_ranges call writes N+1 file reconstruction entries to the session shard: one per window plus the composed one. Functionally fine — segments are reachable via the composed file — but the per-window entries are unreferenced and pollute the catalog, and v1 GC isn't designed to spot them. Worth confirming this is intentional for v1 and tracking the cleanup, or filing an issue if not yet logged.
  2. compute_total_size (range_upload.rs:437) does plain arithmetic on caller-supplied new_length. Since validation only checks original_range.end <= original_size (not totals), a malicious or buggy caller passing huge new_length values overflows original_size + added. A checked_add (+ ParameterError) would close that.
  3. nit: hash_ranges.first().is_some_and(Option::is_none) (range_upload.rs:285) reads obliquely — a single-line comment like // leading gap is empty ⇒ first window starts at byte 0 would help future readers parse the double-Option intent.
  4. nit: register_composed_file's lookup-by-hash path (collecting mdb_by_hash from session.file_info_list()) implicitly assumes window content-hashes are unique within the session. If two windows happened to produce identical content (unlikely but possible with synthetic data), the HashMap would dedupe them and the .get(&middle_hash) would still resolve, but the segment splice would reuse the same middle_mdb twice — which is actually correct since the content is identical. Worth a brief comment that this is intentional.

Nothing here changes my approve verdict — items 2-4 are small. Item 1 is the same flag from March; happy to proceed if the call is to ship and clean up later.

Comment thread xet_data/src/processing/range_upload.rs Outdated
Comment thread xet_data/src/processing/range_upload.rs Outdated
Comment thread xet_data/src/processing/range_upload.rs
Comment thread xet_data/src/processing/range_upload.rs Outdated
Comment thread xet_data/src/processing/range_upload.rs
@XciD XciD force-pushed the feat/file-chunk-hashes-and-compose branch 3 times, most recently from 44815c9 to 1ba9082 Compare May 6, 2026 08:48
upload_ranges was emitting composed shards without a verification section, which
cas-server now rejects with `MDBShard("Shard verification failure, missing
verification section")`. Root cause: the May 1 MerkleHashSubtree refactor
(a9d9cfe) dropped client-side computation of FileVerificationEntry for stable
original segments, expecting `original_mdb.verification` to be populated — but
GET /v1/reconstructions has always returned an empty verification list, so the
composed shard always went out without verification.

Fix: extend GET /v2/file-chunk-hashes/{file_id} to return one
FileVerificationEntry per stable original segment (segments wholly outside the
dirty windows), in segment order. range_upload pops these in lockstep with the
segment walk to populate the composed shard's verification section.

This is the minimum data needed: window segments are re-uploaded fresh so their
verification comes from the per-window MDB; only stable segments need the
server's help. ~32 bytes per stable segment, much cheaper than the legacy
"return the whole chunk hash list" endpoint that was replaced by the multi-range
API in 78de688.

Backwards-compat: when the original file has no verification entries
(legacy / test files registered before verification was introduced), the
server returns an empty `gap_verification` and the composed shard is emitted
with `with_verification=false` as before.

Pairs with a xetcas-side change to populate `gap_verification` in the response.
Sim clients (local_client, memory_client) populate it locally via
build_file_chunk_hashes_response so the existing 27 range_upload tests pass
unchanged.
@XciD XciD force-pushed the feat/file-chunk-hashes-and-compose branch from 1ba9082 to 45f5937 Compare May 6, 2026 10:07
XciD added 7 commits May 6, 2026 18:17
The CAS server rejects any shard without a verification section. When
the entire file falls within dirty windows (no stable segments),
gap_verification is empty and original_mdb.verification is stripped,
causing original_has_verification to be false and the composed shard
to omit verification. Since the cleaner always produces verification
entries for its segments, unconditionally set has_verification=true.
- Use checked arithmetic in compute_total_size to prevent overflow
- Add clarifying comments on hash_ranges double-Option intent
- Add comment noting mdb_by_hash dedup-by-content is intentional
Add finish_with_chunks_detached() to SingleFileCleaner that uploads
xorb data but returns the MDBFileInfo directly instead of registering
it in the session shard. upload_ranges now uses the MDBFileInfo from
each window directly for composition, so only the final composed file
gets registered. This eliminates N unreferenced per-window entries
that previously polluted the shard.
…ation

The server always returns this field; no backward compatibility needed.
… improve error msg

- Extract MDB assembly (segment splicing + verification) into compose_mdb()
- Remove hardcoded original_has_verification=true and its dead branches
- Include file hash in "file not found" error for easier debugging
…se error

nightly-2026-05-06 broke __heap_base export needed by wasm-bindgen for
threading support. Pin to the last known-good nightly and remove
+nightly from build_wasm.sh so the CI-controlled toolchain is used.
@XciD XciD requested a review from rajatarya May 7, 2026 09:06
Copy link
Copy Markdown
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (post xetcas#987 merge)

This PR has come a long way since my approval on March 27. I went back and re-read the whole thing end-to-end against the new wire shape, and the rework is genuinely good — what shipped is cleaner than what I approved.

What's new since last pass

  1. Wire shape pivot. Composition is now driven by windows[] + hash_ranges[] of opaque MerkleHashSubtree summaries (xetcas#987), not flat per-chunk hashes. Per-chunk hashes never cross the wire, so the response stays O(windows + gaps) regardless of file size. Final hash is rebuilt with MerkleHashSubtree::merge — no ChunkHashList-glue layer, no duplicated tree-folding logic on the client.
  2. Resize edits. DirtyInput { original_range, reader, new_length } collapses in-place edit / replace / pure-insert / pure-delete / append / truncate into a single shape. The original_range.len() != new_length branch falls out cleanly thanks to per-edit readers (no ReadSeek over a sparse staging file — that bug class is structurally gone).
  3. My GC concern is addressed at the client layer. The new register_single_file_clean_completion_detached + register_composed_file split means window-uploads no longer pollute the session shard with orphan MDBFileInfo entries — only the final composed file is registered. Combined with sirahd's GC tracking issue (huggingface-internal/xet-garbage-collection#2), the design now does the right thing: window xorbs are findable through the composed file, and there are no dangling shard references.
  4. Defensive runtime invariants (commit 87b9061): server-contract violations (windows.len() != n_windows, hash_ranges.len() != n_windows + 1, leftover edits not assigned to any window, gap_verification length mismatch with stable-segment count) are now hard runtime errors instead of silent corruption. Exactly the right call for a composition algorithm where zip would otherwise drop windows quietly.

What upload_ranges offers, visually

End-to-end flow:

sequenceDiagram
    autonumber
    participant C as Caller (e.g. hf-mount)
    participant U as upload_ranges
    participant S as CAS server
    participant D as Cleaner / dedup

    C->>U: dirty_inputs[]<br/>(original_range, reader, new_length)
    Note over U: validate + coalesce<br/>snap to segment boundaries
    U->>S: GET /v2/file-chunk-hashes/{id}<br/>X-Range-Dirty: bytes=A-B,C-D
    S-->>U: windows[N] + hash_ranges[N+1]<br/>+ gap_verification[stable_segs]
    loop one cleaner per window
        U->>S: stream CAS prefix (boundary chunks)
        U->>D: feed prefix bytes
        U->>D: feed reader bytes (the edit)
        U->>S: stream CAS suffix
        U->>D: feed suffix bytes
        D-->>U: chunks + new MDBFileInfo (detached)
    end
    Note over U: merge_seq = [gap0, w0, gap1, ..., gapN]<br/>combined_hash = merge(seq).hmac(0)
    U->>S: register_composed_file(MDBFileInfo)
    U->>S: finalize session
    U-->>C: XetFileInfo { hash, size }
Loading

Why this is a big deal — the wire savings:

flowchart LR
    subgraph old ["Old: full re-upload"]
        OF["200MB original"] --> OU["download 200MB"]
        OU --> OE["edit 1MB"]
        OE --> OUP["upload 200MB"]
    end
    subgraph new ["upload_ranges: 2MB edit"]
        NF["200MB original"] --> NS["snap edits to<br/>segment boundaries"]
        NS --> NW["server returns<br/>2 windows + 3 gap subtrees"]
        NW --> NB["download ~few KB<br/>boundary chunks"]
        NB --> NU["upload ~2MB<br/>(2 dirty windows)"]
    end
    style old fill:#ffe5e5
    style new fill:#e5ffe5
Loading

Hash composition without re-reading the file:

flowchart TB
    F["Original file: chunks 0..M"] --> G0["gap0 subtree<br/>(server-stored)"] & G1["gap1 subtree<br/>(server-stored)"] & G2["gap2 subtree<br/>(server-stored)"]
    R["DirtyInput readers"] --> CL1["cleaner: window 0"] --> W0["w0 subtree<br/>(client-computed)"]
    R --> CL2["cleaner: window 1"] --> W1["w1 subtree<br/>(client-computed)"]
    G0 --> M{{"MerkleHashSubtree::merge<br/>[gap0, w0, gap1, w1, gap2]"}}
    W0 --> M
    G1 --> M
    W1 --> M
    G2 --> M
    M --> H["final_hash().hmac(0)<br/>== file_hash of edited file"]
Loading

Edit-shape coverage (one struct, six operations):

flowchart LR
    DI["DirtyInput<br/>{ original_range: a..b, new_length: n }"] --> Q1{"a == b?"}
    Q1 -->|yes| Q2{"n > 0?"}
    Q1 -->|no| Q3{"n == 0?"}
    Q2 -->|yes| INS["pure insert at a"]
    Q2 -->|no| NOOP["degenerate no-op"]
    Q3 -->|yes| DEL["pure delete a..b"]
    Q3 -->|no| Q4{"b - a == n?"}
    Q4 -->|yes| IPE["in-place edit"]
    Q4 -->|no| RES["resize replace"]
    INS -.->|special: a == original_size| APP["append"]
    DEL -.->|special: b == original_size| TRU["truncate to a"]
Loading

Items to think about (none blocking)

  1. chunk_window_builder.rs duplication. PR description flags this as intentional and temporary, mirroring xetcas#987 server logic so simulation clients answer get_file_chunk_hashes without HTTP. Worth filing a follow-up issue (or noting #717 in xetcas) to track the eventual dedup so the duplication doesn't quietly stick around as duplications always do without a forcing function.
  2. Empty gap_verification on partial-verification shards. When file_info.verification.len() != file_info.segments.len(), the simulation builder silently emits an empty gap_verification, and then compose_mdb errors out at runtime with a less-targeted message. Inline below — either error up-front or document the silent path.
  3. One readability nit + one defensive question. Inline.

Nothing here blocks the merge — most of these are conversation starters. Approving again on the strength of the current state. Nice work.

response.hash_ranges.len(),
n_windows + 1
)));
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three guards (windows count, hash_ranges count at L182, edits-not-assigned-to-window at L269, and the gap_verification length checks in compose_mdb) are exactly the right thing to do. A subtle composition algorithm where zip silently truncates the merge sequence is precisely where you want loud bails over quiet corruption. The error messages even include expected vs actual counts, which makes server contract drift trivial to diagnose.

.take_while(|d| {
let r = &d.original_range;
if r.start == r.end {
r.start < w_end || (r.start == w_end && w_end == original_size)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boundary case for pure inserts at exactly w_end is genuinely tricky — an insert at byte w_end belongs to the next window unless w_end == original_size (in which case there is no next window). The comment at L201-203 captures the why. Worth referencing test_resize_insert_at_segment_boundary and test_mid_edit_plus_append here so a future reader can find the exercising tests quickly. nit.

stream_cas_range(&ctx, &cas_client, original_hash, cursor, w_end, &mut cleaner).await?;
}

let (_info, chunks, mdb, _metrics) = cleaner.finish_with_chunks_detached().await?;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _detached split is a real improvement — this resolves the GC concern I had on the prior pass. Window uploads now contribute their xorbs without registering an MDBFileInfo in the session shard, so only the final composed file shows up there. Combined with sirahd's tracking issue, the data lifecycle here is much cleaner than what I originally approved.

let mut hash_ranges = response.hash_ranges;
let trailing_gap = hash_ranges.pop().flatten();
// Leading gap is empty (None) => first window starts at byte 0.
let first_window_at_start = hash_ranges.first().is_some_and(Option::is_none);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: hash_ranges.first().is_some_and(Option::is_none) is concise but a bit cryptic — the reader has to remember Vec<Option<T>>::first() returns Option<&Option<T>> and reason about both layers. matches!(hash_ranges.first(), Some(None)) reads more directly. Same below for last_window_at_end if you want symmetry.

gap_idx += 1;
seg_idx += 1;
}
let original_window_end = w.end.min(original_size);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defensive question, not a fix: under the current server contract, can w.end ever exceed original_size? The server returns chunk-aligned ranges within the original file, and the snap logic on the client also caps at original_size. If w.end > original_size is structurally impossible, this min is dead defense and a debug_assert!(w.end <= original_size) would document the invariant more loudly. If it can happen (e.g. some forward-compat shape), worth a one-line comment about which case justifies it.


// Emit one range hash per stable segment (= no overlap with any window).
// Segments and windows are both monotonic, so a two-pointer walk is O(S+W).
let gap_verification = if file_info.verification.len() == file_info.segments.len() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branch at L172 silently emits an empty gap_verification when file_info.verification.len() != file_info.segments.len(). Defensible for legacy shards that genuinely have no verification entries — but the failure mode for a partially-populated file_info would be a confusing runtime error in compose_mdb ("ran out of gap_verification entries at stable segment N") rather than something pointed at the actual cause. Two reasonable options:

  • Tighten: emit empty only if file_info.verification.is_empty(), otherwise return ClientError::Other(...) here with a clearer message.
  • Or document this branch's contract explicitly so the runtime error in compose_mdb is interpretable.

Not blocking — calling it out since the simulation-only path is what tests run against.

@@ -0,0 +1,205 @@
//! Server-side state machine for `GET /v2/file-chunk-hashes/{file_id}` (mirrored from
//! xetcas PR #987) plus a small driver helper used by simulation clients to produce a
//! [`FileChunkHashesResponse`] without routing through HTTP.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment is honest about the duplication-with-xetcas#987 situation, which I appreciate. Worth filing a follow-up issue to track the eventual dedup (either xetcas pulling this in, or this getting moved to a shared crate) so it doesn't quietly become a permanent fork. Duplications like this almost always do without a forcing function.

/// Distinct from the standard `Range` header (which scopes the response body): this header tags
/// regions that the client intends to re-chunk, and the response covers the whole file (windows +
/// gap subtrees). Value uses the same `bytes=A-B,C-D` syntax as `Range`.
pub const X_RANGE_DIRTY_HEADER: &str = "x-range-dirty";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "x-range-dirty" lowercase is unusual when most header constants in this crate (e.g. SESSION_ID_HEADER at L17 = "X-Xet-Session-Id", REQUEST_ID_HEADER at L19 = "X-Request-Id") use the X-Foo-Bar capitalized form. HTTP headers are case-insensitive on the wire so this is purely cosmetic, but matching the surrounding style would be a small consistency win.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants