Stable chunk boundary detection by hoytak · Pull Request #815 · huggingface/xet-core

hoytak · 2026-04-24T08:05:54Z

This PR adds a function, next_stable_chunk_boundary, that takes a list of chunk boundary positions and a starting cut point and returns the next chunk boundary after the cut point such that, for all possible alterations of the data up to the cut point, the chunk boundaries when chunking the entire file will always be the same starting at the stable chunk boundary.

The implication of this is that to alter a specific range of a file [a, b), we would do the following:

Locate the previous chunk boundary before a; call this c_start.
Take the full set of chunk boundary locations, call next_stable_chunk_boundary with b as the cut point. this will return the next stable chunk boundary. Call this c_end.
Make the replacement to [a, b); prepend the original data[c_start, a) and append data[b, c_end); chunk this segment.
Use the merkle hash subtrees for [0, c_start), the new [c_start, c_end), and the original [c_end, end) to calculate the new file hash. This will be the same as chunking the entire new file.

Note

Medium Risk
Adds new public chunk-boundary selection logic used to make resumed/partial workflows deterministic; mistakes could cause misalignment or incorrect resume behavior in deduplication/chunking paths. Large new randomized/stress tests reduce risk but the algorithm’s correctness assumptions are subtle.

Overview
Introduces a new public helper, next_stable_chunk_boundary, that computes a restart-safe/stable resume boundary from existing chunk-boundary metadata (no byte access) by scanning for two consecutive chunks that fall within a conservative size window derived from chunking constants.

Updates find_partitions documentation to reflect the hash warmup/hidden-trigger verification approach and to reference the new helper, re-exports the function from xet_data::deduplication, and adds extensive edge-case and randomized mutation/stress tests to validate boundary stability under arbitrary prefix changes.

^{Reviewed by Cursor Bugbot for commit 9841160. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit d3c3988. Configure here.}

cursor · 2026-04-24T08:16:19Z

+/// `merkle_hash_subtree.rs` for the analogous construction in merkle hashing.
+pub fn next_stable_chunk_boundary(starting_position: usize, chunk_boundaries: &[usize]) -> Option<usize> {
+    let minimum_chunk = *TARGET_CHUNK_SIZE / *MINIMUM_CHUNK_DIVISOR;
+    let maximum_chunk = *TARGET_CHUNK_SIZE * *MAXIMUM_CHUNK_MULTIPLIER;


Hardcoded global constants instead of parameterized chunk size

Low Severity

next_stable_chunk_boundary derives minimum_chunk and maximum_chunk from the global *TARGET_CHUNK_SIZE, while the analogous find_partitions function accepts target_chunk_size as a parameter. If chunk boundaries were produced by a Chunker::new(custom_size) with a non-default target (e.g., the wasm crate passes a user-provided value), the hardcoded globals will compute wrong size thresholds, silently breaking the stability guarantee. Accepting target_chunk_size as a parameter would make the API consistent with find_partitions and safe for non-default configurations.

^{Reviewed by Cursor Bugbot for commit d3c3988. Configure here.}

rajatarya

Beautiful piece of work. The 2 * min_chunk lower bound (vs. min_chunk in find_partitions) is exactly the right call — the comment explaining why (no data access means we can't rule out shadow-zone triggers in [c_k, c_k + min_chunk)) makes the choice obvious to future readers. The proof update in parallel chunking.lyx reversing the earlier "I do not believe so" caveat — by feeding min_chunk - 64 - 1 bytes of warmup so the gear hash window is fully data-dependent at all accepted trigger positions — is a really nice contribution that makes the broader chunking story sound.

Test coverage is impressive. test_stable_chunk_boundary_is_tight (verifying the predecessor of a stable boundary is actually unstable for some mutation) is the kind of test you don't see often — it guards against the bound becoming silently too conservative over time, which is the failure mode that's hardest to catch otherwise. The mutation-type sweep (zero-fill, 0xFF, reverse, XOR, random) plus the constant-data degenerate case (no stable boundary because all cuts are forced max-chunk size) covers the corners well.

+1 to @cursor's parameterization note on chunking.rs:390 — for what it's worth, the testability angle is the strongest argument: tests already use test_set_constants! { TARGET_CHUNK_SIZE = 1024 } to override the global, but a parameterized next_stable_chunk_boundary(start, boundaries, target_chunk_size) would let tests pass values directly and make this consistent with find_partitions. Not blocking.

One thing worth thinking about for downstream callers (xet-core #717 upload_ranges is the obvious one) — see inline.

Approving.

rajatarya · 2026-04-28T01:57:43Z

+}
+
 #[cfg(test)]
 mod tests {


This returns None for several legitimate cases: small files (< 3 chunks total after start_idx), files with mostly forced max-chunk cuts (constant/highly-compressible data — confirmed by test_stable_chunk_boundary_with_constant_data), and edits very close to EOF.

For the partial-file-editing flow (xet-core #717 upload_ranges), what's the intended fallback when this returns None? Three plausible answers:

Fall back to re-uploading the whole file — safe but defeats the purpose for small or constant-data files.

Use the file's last boundary as a degraded "stable" point — works as long as the caller doesn't rely on bytes after that being identical, which they shouldn't if they're appending/truncating anyway.

Surface the None to the caller as an explicit "not eligible for partial upload" — cleanest contract.

A short doc paragraph here saying "callers receiving None should fall back to a full re-upload, since no portion of the tail can be safely reused without re-reading file bytes" would steer integrators toward the right answer. Otherwise I can see someone treating None as "use file_size as the boundary" and quietly producing wrong hashes. Not blocking but a 2-line doc add would prevent a foot-gun.

seanses

Let's also keep the pdf file up to date (run pdflatex).

…cut-points

Updated algorithm for stable cut point detection.

d3c3988

hoytak requested review from XciD, rajatarya and seanses April 24, 2026 08:05

cursor Bot reviewed Apr 24, 2026

View reviewed changes

Updated Cargo.lock to eliminate audit warning.

ceaabb0

XciD approved these changes Apr 24, 2026

View reviewed changes

rajatarya approved these changes Apr 28, 2026

View reviewed changes

seanses approved these changes May 14, 2026

View reviewed changes

hoytak added 2 commits May 18, 2026 11:24

Merge remote-tracking branch 'origin/main' into hoytak/260424-stable-…

fe3b143

…cut-points

Updated paper.

9841160

hoytak merged commit 6f0cf38 into main May 18, 2026
10 checks passed

hoytak deleted the hoytak/260424-stable-cut-points branch May 18, 2026 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable chunk boundary detection#815

Stable chunk boundary detection#815
hoytak merged 4 commits into
mainfrom
hoytak/260424-stable-cut-points

hoytak commented Apr 24, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 24, 2026

Uh oh!

rajatarya left a comment

Uh oh!

rajatarya Apr 28, 2026

Uh oh!

seanses left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hoytak commented Apr 24, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 24, 2026

Choose a reason for hiding this comment

Hardcoded global constants instead of parameterized chunk size

Uh oh!

rajatarya left a comment

Choose a reason for hiding this comment

Uh oh!

rajatarya Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

seanses left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hoytak commented Apr 24, 2026 •

edited by cursor Bot

Loading