Skip to content

Stable chunk boundary detection#815

Open
hoytak wants to merge 2 commits into
mainfrom
hoytak/260424-stable-cut-points
Open

Stable chunk boundary detection#815
hoytak wants to merge 2 commits into
mainfrom
hoytak/260424-stable-cut-points

Conversation

@hoytak
Copy link
Copy Markdown
Collaborator

@hoytak hoytak commented Apr 24, 2026

This PR adds a function, next_stable_chunk_boundary, that takes a list of chunk boundary positions and a starting cut point and returns the next chunk boundary after the cut point such that, for all possible alterations of the data up to the cut point, the chunk boundaries when chunking the entire file will always be the same starting at the stable chunk boundary.

The implication of this is that to alter a specific range of a file [a, b), we would do the following:

  1. Locate the previous chunk boundary before a; call this c_start.
  2. Take the full set of chunk boundary locations, call next_stable_chunk_boundary with b as the cut point. this will return the next stable chunk boundary. Call this c_end.
  3. Make the replacement to [a, b); prepend the original data[c_start, a) and append data[b, c_end); chunk this segment.
  4. Use the merkle hash subtrees for [0, c_start), the new [c_start, c_end), and the original [c_end, end) to calculate the new file hash. This will be the same as chunking the entire new file.

Note

Medium Risk
Adds new boundary-selection logic that will influence resume/partial dedup workflows; while largely additive and well-tested, mistakes could cause chunk misalignment and downstream hash/dedup inconsistencies.

Overview
Adds a new public helper, next_stable_chunk_boundary, to compute a restart-safe resume point using only existing chunk-boundary metadata (no file bytes), and re-exports it from xet_data::deduplication for downstream use.

Updates chunking docs/proof notes around guaranteed partition alignment, and adds a dedicated stress/edge-case test suite that mutates file prefixes to assert chunk-boundary alignment after the returned stable boundary. Also bumps rustls-webpki from 0.103.12 to 0.103.13 in Cargo.lock.

Reviewed by Cursor Bugbot for commit ceaabb0. Bugbot is set up for automated code reviews on this repo. Configure here.

@hoytak hoytak requested review from XciD, rajatarya and seanses April 24, 2026 08:05
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d3c3988. Configure here.

/// `merkle_hash_subtree.rs` for the analogous construction in merkle hashing.
pub fn next_stable_chunk_boundary(starting_position: usize, chunk_boundaries: &[usize]) -> Option<usize> {
let minimum_chunk = *TARGET_CHUNK_SIZE / *MINIMUM_CHUNK_DIVISOR;
let maximum_chunk = *TARGET_CHUNK_SIZE * *MAXIMUM_CHUNK_MULTIPLIER;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded global constants instead of parameterized chunk size

Low Severity

next_stable_chunk_boundary derives minimum_chunk and maximum_chunk from the global *TARGET_CHUNK_SIZE, while the analogous find_partitions function accepts target_chunk_size as a parameter. If chunk boundaries were produced by a Chunker::new(custom_size) with a non-default target (e.g., the wasm crate passes a user-provided value), the hardcoded globals will compute wrong size thresholds, silently breaking the stability guarantee. Accepting target_chunk_size as a parameter would make the API consistent with find_partitions and safe for non-default configurations.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d3c3988. Configure here.

Copy link
Copy Markdown
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful piece of work. The 2 * min_chunk lower bound (vs. min_chunk in find_partitions) is exactly the right call — the comment explaining why (no data access means we can't rule out shadow-zone triggers in [c_k, c_k + min_chunk)) makes the choice obvious to future readers. The proof update in parallel chunking.lyx reversing the earlier "I do not believe so" caveat — by feeding min_chunk - 64 - 1 bytes of warmup so the gear hash window is fully data-dependent at all accepted trigger positions — is a really nice contribution that makes the broader chunking story sound.

Test coverage is impressive. test_stable_chunk_boundary_is_tight (verifying the predecessor of a stable boundary is actually unstable for some mutation) is the kind of test you don't see often — it guards against the bound becoming silently too conservative over time, which is the failure mode that's hardest to catch otherwise. The mutation-type sweep (zero-fill, 0xFF, reverse, XOR, random) plus the constant-data degenerate case (no stable boundary because all cuts are forced max-chunk size) covers the corners well.

+1 to @cursor's parameterization note on chunking.rs:390 — for what it's worth, the testability angle is the strongest argument: tests already use test_set_constants! { TARGET_CHUNK_SIZE = 1024 } to override the global, but a parameterized next_stable_chunk_boundary(start, boundaries, target_chunk_size) would let tests pass values directly and make this consistent with find_partitions. Not blocking.

One thing worth thinking about for downstream callers (xet-core #717 upload_ranges is the obvious one) — see inline.

Approving.

}

#[cfg(test)]
mod tests {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns None for several legitimate cases: small files (< 3 chunks total after start_idx), files with mostly forced max-chunk cuts (constant/highly-compressible data — confirmed by test_stable_chunk_boundary_with_constant_data), and edits very close to EOF.

For the partial-file-editing flow (xet-core #717 upload_ranges), what's the intended fallback when this returns None? Three plausible answers:

  1. Fall back to re-uploading the whole file — safe but defeats the purpose for small or constant-data files.
  2. Use the file's last boundary as a degraded "stable" point — works as long as the caller doesn't rely on bytes after that being identical, which they shouldn't if they're appending/truncating anyway.
  3. Surface the None to the caller as an explicit "not eligible for partial upload" — cleanest contract.

A short doc paragraph here saying "callers receiving None should fall back to a full re-upload, since no portion of the tail can be safely reused without re-reading file bytes" would steer integrators toward the right answer. Otherwise I can see someone treating None as "use file_size as the boundary" and quietly producing wrong hashes. Not blocking but a 2-line doc add would prevent a foot-gun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants