Stable chunk boundary detection#815
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d3c3988. Configure here.
| /// `merkle_hash_subtree.rs` for the analogous construction in merkle hashing. | ||
| pub fn next_stable_chunk_boundary(starting_position: usize, chunk_boundaries: &[usize]) -> Option<usize> { | ||
| let minimum_chunk = *TARGET_CHUNK_SIZE / *MINIMUM_CHUNK_DIVISOR; | ||
| let maximum_chunk = *TARGET_CHUNK_SIZE * *MAXIMUM_CHUNK_MULTIPLIER; |
There was a problem hiding this comment.
Hardcoded global constants instead of parameterized chunk size
Low Severity
next_stable_chunk_boundary derives minimum_chunk and maximum_chunk from the global *TARGET_CHUNK_SIZE, while the analogous find_partitions function accepts target_chunk_size as a parameter. If chunk boundaries were produced by a Chunker::new(custom_size) with a non-default target (e.g., the wasm crate passes a user-provided value), the hardcoded globals will compute wrong size thresholds, silently breaking the stability guarantee. Accepting target_chunk_size as a parameter would make the API consistent with find_partitions and safe for non-default configurations.
Reviewed by Cursor Bugbot for commit d3c3988. Configure here.
rajatarya
left a comment
There was a problem hiding this comment.
Beautiful piece of work. The 2 * min_chunk lower bound (vs. min_chunk in find_partitions) is exactly the right call — the comment explaining why (no data access means we can't rule out shadow-zone triggers in [c_k, c_k + min_chunk)) makes the choice obvious to future readers. The proof update in parallel chunking.lyx reversing the earlier "I do not believe so" caveat — by feeding min_chunk - 64 - 1 bytes of warmup so the gear hash window is fully data-dependent at all accepted trigger positions — is a really nice contribution that makes the broader chunking story sound.
Test coverage is impressive. test_stable_chunk_boundary_is_tight (verifying the predecessor of a stable boundary is actually unstable for some mutation) is the kind of test you don't see often — it guards against the bound becoming silently too conservative over time, which is the failure mode that's hardest to catch otherwise. The mutation-type sweep (zero-fill, 0xFF, reverse, XOR, random) plus the constant-data degenerate case (no stable boundary because all cuts are forced max-chunk size) covers the corners well.
+1 to @cursor's parameterization note on chunking.rs:390 — for what it's worth, the testability angle is the strongest argument: tests already use test_set_constants! { TARGET_CHUNK_SIZE = 1024 } to override the global, but a parameterized next_stable_chunk_boundary(start, boundaries, target_chunk_size) would let tests pass values directly and make this consistent with find_partitions. Not blocking.
One thing worth thinking about for downstream callers (xet-core #717 upload_ranges is the obvious one) — see inline.
Approving.
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
This returns None for several legitimate cases: small files (< 3 chunks total after start_idx), files with mostly forced max-chunk cuts (constant/highly-compressible data — confirmed by test_stable_chunk_boundary_with_constant_data), and edits very close to EOF.
For the partial-file-editing flow (xet-core #717 upload_ranges), what's the intended fallback when this returns None? Three plausible answers:
- Fall back to re-uploading the whole file — safe but defeats the purpose for small or constant-data files.
- Use the file's last boundary as a degraded "stable" point — works as long as the caller doesn't rely on bytes after that being identical, which they shouldn't if they're appending/truncating anyway.
- Surface the
Noneto the caller as an explicit "not eligible for partial upload" — cleanest contract.
A short doc paragraph here saying "callers receiving None should fall back to a full re-upload, since no portion of the tail can be safely reused without re-reading file bytes" would steer integrators toward the right answer. Otherwise I can see someone treating None as "use file_size as the boundary" and quietly producing wrong hashes. Not blocking but a 2-line doc add would prevent a foot-gun.


This PR adds a function, next_stable_chunk_boundary, that takes a list of chunk boundary positions and a starting cut point and returns the next chunk boundary after the cut point such that, for all possible alterations of the data up to the cut point, the chunk boundaries when chunking the entire file will always be the same starting at the stable chunk boundary.
The implication of this is that to alter a specific range of a file
[a, b), we would do the following:c_start.c_end.[a, b); prepend the originaldata[c_start, a)and appenddata[b, c_end); chunk this segment.[0, c_start), the new [c_start, c_end), and the original[c_end, end)to calculate the new file hash. This will be the same as chunking the entire new file.Note
Medium Risk
Adds new boundary-selection logic that will influence resume/partial dedup workflows; while largely additive and well-tested, mistakes could cause chunk misalignment and downstream hash/dedup inconsistencies.
Overview
Adds a new public helper,
next_stable_chunk_boundary, to compute a restart-safe resume point using only existing chunk-boundary metadata (no file bytes), and re-exports it fromxet_data::deduplicationfor downstream use.Updates chunking docs/proof notes around guaranteed partition alignment, and adds a dedicated stress/edge-case test suite that mutates file prefixes to assert chunk-boundary alignment after the returned stable boundary. Also bumps
rustls-webpkifrom0.103.12to0.103.13inCargo.lock.Reviewed by Cursor Bugbot for commit ceaabb0. Bugbot is set up for automated code reviews on this repo. Configure here.