Stable chunk boundary detection#815
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d3c3988. Configure here.
| /// `merkle_hash_subtree.rs` for the analogous construction in merkle hashing. | ||
| pub fn next_stable_chunk_boundary(starting_position: usize, chunk_boundaries: &[usize]) -> Option<usize> { | ||
| let minimum_chunk = *TARGET_CHUNK_SIZE / *MINIMUM_CHUNK_DIVISOR; | ||
| let maximum_chunk = *TARGET_CHUNK_SIZE * *MAXIMUM_CHUNK_MULTIPLIER; |
There was a problem hiding this comment.
Hardcoded global constants instead of parameterized chunk size
Low Severity
next_stable_chunk_boundary derives minimum_chunk and maximum_chunk from the global *TARGET_CHUNK_SIZE, while the analogous find_partitions function accepts target_chunk_size as a parameter. If chunk boundaries were produced by a Chunker::new(custom_size) with a non-default target (e.g., the wasm crate passes a user-provided value), the hardcoded globals will compute wrong size thresholds, silently breaking the stability guarantee. Accepting target_chunk_size as a parameter would make the API consistent with find_partitions and safe for non-default configurations.
Reviewed by Cursor Bugbot for commit d3c3988. Configure here.
rajatarya
left a comment
There was a problem hiding this comment.
Beautiful piece of work. The 2 * min_chunk lower bound (vs. min_chunk in find_partitions) is exactly the right call — the comment explaining why (no data access means we can't rule out shadow-zone triggers in [c_k, c_k + min_chunk)) makes the choice obvious to future readers. The proof update in parallel chunking.lyx reversing the earlier "I do not believe so" caveat — by feeding min_chunk - 64 - 1 bytes of warmup so the gear hash window is fully data-dependent at all accepted trigger positions — is a really nice contribution that makes the broader chunking story sound.
Test coverage is impressive. test_stable_chunk_boundary_is_tight (verifying the predecessor of a stable boundary is actually unstable for some mutation) is the kind of test you don't see often — it guards against the bound becoming silently too conservative over time, which is the failure mode that's hardest to catch otherwise. The mutation-type sweep (zero-fill, 0xFF, reverse, XOR, random) plus the constant-data degenerate case (no stable boundary because all cuts are forced max-chunk size) covers the corners well.
+1 to @cursor's parameterization note on chunking.rs:390 — for what it's worth, the testability angle is the strongest argument: tests already use test_set_constants! { TARGET_CHUNK_SIZE = 1024 } to override the global, but a parameterized next_stable_chunk_boundary(start, boundaries, target_chunk_size) would let tests pass values directly and make this consistent with find_partitions. Not blocking.
One thing worth thinking about for downstream callers (xet-core #717 upload_ranges is the obvious one) — see inline.
Approving.
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
This returns None for several legitimate cases: small files (< 3 chunks total after start_idx), files with mostly forced max-chunk cuts (constant/highly-compressible data — confirmed by test_stable_chunk_boundary_with_constant_data), and edits very close to EOF.
For the partial-file-editing flow (xet-core #717 upload_ranges), what's the intended fallback when this returns None? Three plausible answers:
- Fall back to re-uploading the whole file — safe but defeats the purpose for small or constant-data files.
- Use the file's last boundary as a degraded "stable" point — works as long as the caller doesn't rely on bytes after that being identical, which they shouldn't if they're appending/truncating anyway.
- Surface the
Noneto the caller as an explicit "not eligible for partial upload" — cleanest contract.
A short doc paragraph here saying "callers receiving None should fall back to a full re-upload, since no portion of the tail can be safely reused without re-reading file bytes" would steer integrators toward the right answer. Otherwise I can see someone treating None as "use file_size as the boundary" and quietly producing wrong hashes. Not blocking but a 2-line doc add would prevent a foot-gun.
seanses
left a comment
There was a problem hiding this comment.
Let's also keep the pdf file up to date (run pdflatex).


This PR adds a function, next_stable_chunk_boundary, that takes a list of chunk boundary positions and a starting cut point and returns the next chunk boundary after the cut point such that, for all possible alterations of the data up to the cut point, the chunk boundaries when chunking the entire file will always be the same starting at the stable chunk boundary.
The implication of this is that to alter a specific range of a file
[a, b), we would do the following:c_start.c_end.[a, b); prepend the originaldata[c_start, a)and appenddata[b, c_end); chunk this segment.[0, c_start), the new [c_start, c_end), and the original[c_end, end)to calculate the new file hash. This will be the same as chunking the entire new file.Note
Medium Risk
Adds new public chunk-boundary selection logic used to make resumed/partial workflows deterministic; mistakes could cause misalignment or incorrect resume behavior in deduplication/chunking paths. Large new randomized/stress tests reduce risk but the algorithm’s correctness assumptions are subtle.
Overview
Introduces a new public helper,
next_stable_chunk_boundary, that computes a restart-safe/stable resume boundary from existing chunk-boundary metadata (no byte access) by scanning for two consecutive chunks that fall within a conservative size window derived from chunking constants.Updates
find_partitionsdocumentation to reflect the hash warmup/hidden-trigger verification approach and to reference the new helper, re-exports the function fromxet_data::deduplication, and adds extensive edge-case and randomized mutation/stress tests to validate boundary stability under arbitrary prefix changes.Reviewed by Cursor Bugbot for commit 9841160. Bugbot is set up for automated code reviews on this repo. Configure here.