-
Notifications
You must be signed in to change notification settings - Fork 76
Stable chunk boundary detection #815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| This update adds a new public deduplication helper for computing restart-safe chunk boundaries from existing chunk boundary metadata. | ||
|
|
||
| What changed | ||
| - Added `xet_data::deduplication::next_stable_chunk_boundary(starting_position, chunk_boundaries) -> Option<usize>`. | ||
| - Re-exported it from `xet_data::deduplication` so downstream crates can use it directly. | ||
| - The function scans forward from `starting_position` and returns the next chunk boundary that satisfies the stable-boundary condition: | ||
| - two consecutive chunk sizes in `[2 * min_chunk, max_chunk - min_chunk)`, | ||
| - where `min_chunk` and `max_chunk` are derived from chunking constants. | ||
|
|
||
| Why this matters | ||
| - Callers that already have chunk-boundary metadata can locate a stable resume boundary without re-reading file bytes. | ||
| - This enables deterministic alignment behavior for resumed/partial workflows that need chunk boundaries robust to prefix changes. | ||
|
|
||
| Usage notes | ||
| - `chunk_boundaries` should be monotonically increasing chunk-end offsets produced by the same chunking configuration. | ||
| - `starting_position` may be any byte offset (not necessarily a chunk boundary) and is used as the reference offset from which to search for the next stable chunk boundary. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -282,9 +282,15 @@ impl Chunker { | |
| /// partition_scan_bytes is the number of bytes to scan at each | ||
| /// proposed partition boundary in search of a valid chunk. | ||
| /// | ||
| /// Due to a known issue in how we do chunking, note that these | ||
| /// partitions are not 100% guaranteed to align. See the | ||
| /// parallel_chunking.pdf for details. | ||
| /// Partition alignment is guaranteed by the hash warmup fix: the | ||
| /// chunker feeds `min_chunk - 64 - 1` bytes before scanning for | ||
| /// boundaries, ensuring the gear hash window is fully warmed (purely | ||
| /// data-dependent) at all accepted trigger positions. This function | ||
| /// additionally verifies the absence of hidden triggers by re-chunking | ||
| /// with `min_chunk = 0`. See `parallel chunking.lyx` for the proof. | ||
| /// | ||
| /// For finding stable chunk boundaries from existing chunk boundaries (without | ||
| /// data access), see [`next_stable_chunk_boundary`]. | ||
| pub fn find_partitions<R: Read + Seek>( | ||
| reader: &mut R, | ||
| file_size: usize, | ||
|
|
@@ -353,6 +359,53 @@ pub fn find_partitions<R: Read + Seek>( | |
| Ok(partitions) | ||
| } | ||
|
|
||
| /// Given a list of chunk boundaries in a file and an arbitrary reference position, | ||
| /// returns the next stable chunk boundary at or after that position. | ||
| /// | ||
| /// `starting_position` may be any byte offset in the file; it does not need to | ||
| /// be an existing chunk boundary. The search starts at the first chunk boundary | ||
| /// `>= starting_position`. | ||
| /// | ||
| /// A stable chunk boundary is defined such that any possible changes in the data | ||
| /// before `starting_position` would produce the same chunk boundaries at the | ||
| /// stable boundary and later. The fixed data between `starting_position` and | ||
| /// the returned stable boundary is always sufficient to restore the chunker to | ||
| /// its original chunk boundaries. | ||
| /// | ||
| /// The stability condition requires two consecutive chunks after `starting_position`, | ||
| /// both with sizes in `[2 * min_chunk, max_chunk - min_chunk)`. The boundary | ||
| /// at the end of the second such chunk is the stable chunk boundary. | ||
| /// | ||
| /// The lower bound is `2 * min_chunk` rather than `min_chunk` (as used in | ||
| /// [`find_partitions`]) because this function operates on existing chunk | ||
| /// boundaries without data access, and cannot verify the absence of hidden | ||
| /// hash triggers in the `[c_k, c_k + min_chunk)` skip zone. A shadow-zone | ||
| /// trigger can advance a modified chunker by up to `min_chunk`, so the next | ||
| /// chunk must be at least `2 * min_chunk` to remain reachable. | ||
| /// | ||
| /// See `parallel chunking.lyx` for the full proof and `find_stable_start` in | ||
| /// `merkle_hash_subtree.rs` for the analogous construction in merkle hashing. | ||
| pub fn next_stable_chunk_boundary(starting_position: usize, chunk_boundaries: &[usize]) -> Option<usize> { | ||
| let minimum_chunk = *TARGET_CHUNK_SIZE / *MINIMUM_CHUNK_DIVISOR; | ||
| let maximum_chunk = *TARGET_CHUNK_SIZE * *MAXIMUM_CHUNK_MULTIPLIER; | ||
|
|
||
| let start_idx = chunk_boundaries.partition_point(|&x| x < starting_position); | ||
|
|
||
| for i in start_idx..chunk_boundaries.len().saturating_sub(2) { | ||
| let size_a = chunk_boundaries[i + 1] - chunk_boundaries[i]; | ||
| let size_b = chunk_boundaries[i + 2] - chunk_boundaries[i + 1]; | ||
|
|
||
| if size_a >= 2 * minimum_chunk | ||
| && size_a < maximum_chunk - minimum_chunk | ||
| && size_b >= 2 * minimum_chunk | ||
| && size_b < maximum_chunk - minimum_chunk | ||
| { | ||
| return Some(chunk_boundaries[i + 2]); | ||
| } | ||
| } | ||
| None | ||
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This returns For the partial-file-editing flow (xet-core #717
A short doc paragraph here saying "callers receiving |
||
| use std::collections::HashSet; | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded global constants instead of parameterized chunk size
Low Severity
next_stable_chunk_boundaryderivesminimum_chunkandmaximum_chunkfrom the global*TARGET_CHUNK_SIZE, while the analogousfind_partitionsfunction acceptstarget_chunk_sizeas a parameter. If chunk boundaries were produced by aChunker::new(custom_size)with a non-default target (e.g., the wasm crate passes a user-provided value), the hardcoded globals will compute wrong size thresholds, silently breaking the stability guarantee. Acceptingtarget_chunk_sizeas a parameter would make the API consistent withfind_partitionsand safe for non-default configurations.Reviewed by Cursor Bugbot for commit d3c3988. Configure here.