Skip to content

Stable chunk boundary detection#815

Merged
hoytak merged 4 commits into
mainfrom
hoytak/260424-stable-cut-points
May 18, 2026
Merged

Stable chunk boundary detection#815
hoytak merged 4 commits into
mainfrom
hoytak/260424-stable-cut-points

Conversation

@hoytak
Copy link
Copy Markdown
Collaborator

@hoytak hoytak commented Apr 24, 2026

This PR adds a function, next_stable_chunk_boundary, that takes a list of chunk boundary positions and a starting cut point and returns the next chunk boundary after the cut point such that, for all possible alterations of the data up to the cut point, the chunk boundaries when chunking the entire file will always be the same starting at the stable chunk boundary.

The implication of this is that to alter a specific range of a file [a, b), we would do the following:

  1. Locate the previous chunk boundary before a; call this c_start.
  2. Take the full set of chunk boundary locations, call next_stable_chunk_boundary with b as the cut point. this will return the next stable chunk boundary. Call this c_end.
  3. Make the replacement to [a, b); prepend the original data[c_start, a) and append data[b, c_end); chunk this segment.
  4. Use the merkle hash subtrees for [0, c_start), the new [c_start, c_end), and the original [c_end, end) to calculate the new file hash. This will be the same as chunking the entire new file.

Note

Medium Risk
Adds new public chunk-boundary selection logic used to make resumed/partial workflows deterministic; mistakes could cause misalignment or incorrect resume behavior in deduplication/chunking paths. Large new randomized/stress tests reduce risk but the algorithm’s correctness assumptions are subtle.

Overview
Introduces a new public helper, next_stable_chunk_boundary, that computes a restart-safe/stable resume boundary from existing chunk-boundary metadata (no byte access) by scanning for two consecutive chunks that fall within a conservative size window derived from chunking constants.

Updates find_partitions documentation to reflect the hash warmup/hidden-trigger verification approach and to reference the new helper, re-exports the function from xet_data::deduplication, and adds extensive edge-case and randomized mutation/stress tests to validate boundary stability under arbitrary prefix changes.

Reviewed by Cursor Bugbot for commit 9841160. Bugbot is set up for automated code reviews on this repo. Configure here.

@hoytak hoytak requested review from XciD, rajatarya and seanses April 24, 2026 08:05
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d3c3988. Configure here.

/// `merkle_hash_subtree.rs` for the analogous construction in merkle hashing.
pub fn next_stable_chunk_boundary(starting_position: usize, chunk_boundaries: &[usize]) -> Option<usize> {
let minimum_chunk = *TARGET_CHUNK_SIZE / *MINIMUM_CHUNK_DIVISOR;
let maximum_chunk = *TARGET_CHUNK_SIZE * *MAXIMUM_CHUNK_MULTIPLIER;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded global constants instead of parameterized chunk size

Low Severity

next_stable_chunk_boundary derives minimum_chunk and maximum_chunk from the global *TARGET_CHUNK_SIZE, while the analogous find_partitions function accepts target_chunk_size as a parameter. If chunk boundaries were produced by a Chunker::new(custom_size) with a non-default target (e.g., the wasm crate passes a user-provided value), the hardcoded globals will compute wrong size thresholds, silently breaking the stability guarantee. Accepting target_chunk_size as a parameter would make the API consistent with find_partitions and safe for non-default configurations.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d3c3988. Configure here.

Copy link
Copy Markdown
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful piece of work. The 2 * min_chunk lower bound (vs. min_chunk in find_partitions) is exactly the right call — the comment explaining why (no data access means we can't rule out shadow-zone triggers in [c_k, c_k + min_chunk)) makes the choice obvious to future readers. The proof update in parallel chunking.lyx reversing the earlier "I do not believe so" caveat — by feeding min_chunk - 64 - 1 bytes of warmup so the gear hash window is fully data-dependent at all accepted trigger positions — is a really nice contribution that makes the broader chunking story sound.

Test coverage is impressive. test_stable_chunk_boundary_is_tight (verifying the predecessor of a stable boundary is actually unstable for some mutation) is the kind of test you don't see often — it guards against the bound becoming silently too conservative over time, which is the failure mode that's hardest to catch otherwise. The mutation-type sweep (zero-fill, 0xFF, reverse, XOR, random) plus the constant-data degenerate case (no stable boundary because all cuts are forced max-chunk size) covers the corners well.

+1 to @cursor's parameterization note on chunking.rs:390 — for what it's worth, the testability angle is the strongest argument: tests already use test_set_constants! { TARGET_CHUNK_SIZE = 1024 } to override the global, but a parameterized next_stable_chunk_boundary(start, boundaries, target_chunk_size) would let tests pass values directly and make this consistent with find_partitions. Not blocking.

One thing worth thinking about for downstream callers (xet-core #717 upload_ranges is the obvious one) — see inline.

Approving.

}

#[cfg(test)]
mod tests {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns None for several legitimate cases: small files (< 3 chunks total after start_idx), files with mostly forced max-chunk cuts (constant/highly-compressible data — confirmed by test_stable_chunk_boundary_with_constant_data), and edits very close to EOF.

For the partial-file-editing flow (xet-core #717 upload_ranges), what's the intended fallback when this returns None? Three plausible answers:

  1. Fall back to re-uploading the whole file — safe but defeats the purpose for small or constant-data files.
  2. Use the file's last boundary as a degraded "stable" point — works as long as the caller doesn't rely on bytes after that being identical, which they shouldn't if they're appending/truncating anyway.
  3. Surface the None to the caller as an explicit "not eligible for partial upload" — cleanest contract.

A short doc paragraph here saying "callers receiving None should fall back to a full re-upload, since no portion of the tail can be safely reused without re-reading file bytes" would steer integrators toward the right answer. Otherwise I can see someone treating None as "use file_size as the boundary" and quietly producing wrong hashes. Not blocking but a 2-line doc add would prevent a foot-gun.

Copy link
Copy Markdown
Collaborator

@seanses seanses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also keep the pdf file up to date (run pdflatex).

@hoytak hoytak merged commit 6f0cf38 into main May 18, 2026
10 checks passed
@hoytak hoytak deleted the hoytak/260424-stable-cut-points branch May 18, 2026 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants