Skip to content

Commit f54cde1

Browse files
wjones127claude
andauthored
fix: tolerate and auto-heal FRI bitmap straddle corruption (#6623) (#6625)
## Summary - Tables corrupted by the bug fixed in #6610 (a deferred-remap compaction committing concurrently with `optimize_indices` on a stale handle leaves a user index whose `fragment_bitmap` straddles an FRI rewrite group) used to panic in `Dataset::load_indices` and were unreadable. - `remap_fragment_bitmap` now tolerates the straddle on read: it drops the indexed old-frag IDs from the bitmap, logs `tracing::warn!`, and skips the new-frag insertion. Affected rows fall through to flat scan until the next `optimize_indices`. No data lost; no retrain. - The fix auto-heals on the next write. `commit_transaction` already seeds `build_manifest` from `load_indices()`, which now returns the cleaned bitmaps, so any commit (e.g. `delete("false")`) persists the corrected manifest. No new API, no user action required. ## Reproduction & test fixture `rust/examples/src/fri_straddle_datagen.rs` reproduces the corrupt state deterministically on pre-#6610 builds via the public `plan_compaction` / `CompactionTask::execute` / `optimize_indices` / `commit_compaction` APIs. Python can't reach this path — `Compaction.commit` hardcodes `CompactionOptions::default()` (no `defer_index_remap`) — so the generator is in Rust. On a fixed build the generator fails loudly on the conflict-resolver rejection, which is the desired forward-compat behaviour. The output is checked in at `test_data/fri_straddle_pre_6610/` and exercised by `index::tests::test_load_indices_tolerates_fri_straddle` and `test_auto_heal_persists_cleaned_bitmap`. The latter opens the fixture, performs a no-op `delete("false")`, and asserts the raw on-disk `fragment_bitmap` (read via `read_manifest_indexes`) matches the cleaned bitmap. ## Test plan - [x] `cargo test -p lance --lib index::tests::test_load_indices_tolerates_fri_straddle index::tests::test_auto_heal_persists_cleaned_bitmap` - [x] `cargo test -p lance --lib index::tests` — 65 passing - [x] `cargo clippy -p lance --lib -- -D warnings` - [x] `cargo fmt --all` ## Follow-ups (out of scope) - Explicit `Dataset::repair()` API for users who want an auditable trigger / report without writing. - Better `Dataset::validate()` reporting that surfaces the corruption explicitly rather than silently healing. Fixes #6623 --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 95a79de commit f54cde1

14 files changed

Lines changed: 7008 additions & 8 deletions

File tree

rust/lance-index/src/frag_reuse.rs

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -323,14 +323,24 @@ impl FragReuseIndex {
323323

324324
if removed > 0 {
325325
if removed != group.old_frags.len() {
326-
// This should never happen because we always commit a full rewrite group
327-
// and we always reindex either the entire group or nothing.
328-
// We use invalid input to be consistent with
329-
// dataset::transaction::recalculate_fragment_bitmap
330-
return Err(Error::invalid_input(format!(
331-
"The compaction plan included a rewrite group that was a split of indexed and non-indexed data: {:?}",
332-
group.old_frags
333-
)));
326+
// Straddle: the index covered only part of this rewrite
327+
// group. Caused by the bug fixed in
328+
// <https://github.com/lance-format/lance/pull/6610>.
329+
// We've already removed the indexed old_frags from the
330+
// bitmap above; deliberately do NOT insert new_frags,
331+
// since the merged fragment also contains rows that
332+
// were never indexed. Affected rows fall through to
333+
// flat scan until the next optimize_indices. The fix
334+
// is persisted on the next write via build_manifest.
335+
tracing::warn!(
336+
"Healing straddling fragment-reuse rewrite group in index bitmap: \
337+
group {:?} was only partially indexed ({} of {} old fragments). \
338+
Affected rows will use flat scan until the next optimize_indices.",
339+
group.old_frags,
340+
removed,
341+
group.old_frags.len(),
342+
);
343+
continue;
334344
}
335345

336346
for new_frag in group.new_frags.iter() {

rust/lance/src/index.rs

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7138,4 +7138,68 @@ mod tests {
71387138
assert_eq!(desc.index_type(), "IVF_FLAT");
71397139
assert!(!desc.field_ids().is_empty());
71407140
}
7141+
7142+
/// FRI-straddle corruption (PR #6610) used to panic in `load_indices`.
7143+
/// The fixture is a pre-#6610 dataset where a user index's
7144+
/// `fragment_bitmap` only partially covers a rewrite group. After the
7145+
/// tolerant-load fix `load_indices` returns Ok; affected old-frag IDs
7146+
/// are dropped, no new-frag IDs are inserted, and `validate()` succeeds.
7147+
#[tokio::test]
7148+
async fn test_load_indices_tolerates_fri_straddle() {
7149+
let tmp = copy_test_data_to_tmp("fri_straddle_pre_6610/fri_straddle_dataset").unwrap();
7150+
let uri = format!("file://{}", tmp.std_path().display());
7151+
let dataset = Dataset::open(&uri).await.unwrap();
7152+
7153+
let indices = dataset.load_indices().await.unwrap();
7154+
assert!(!indices.is_empty());
7155+
dataset.validate().await.unwrap();
7156+
}
7157+
7158+
/// Any commit reseeds indices via `load_indices` → `build_manifest`,
7159+
/// so a single no-op write persists the cleaned bitmap to disk.
7160+
#[tokio::test]
7161+
async fn test_auto_heal_persists_cleaned_bitmap() {
7162+
use lance_table::io::manifest::read_manifest_indexes;
7163+
7164+
let tmp = copy_test_data_to_tmp("fri_straddle_pre_6610/fri_straddle_dataset").unwrap();
7165+
let uri = format!("file://{}", tmp.std_path().display());
7166+
7167+
// Sanity: the on-disk fixture has at least one straddling segment.
7168+
let pre = Dataset::open(&uri).await.unwrap();
7169+
let raw_pre =
7170+
read_manifest_indexes(&pre.object_store, &pre.manifest_location, &pre.manifest)
7171+
.await
7172+
.unwrap();
7173+
let cleaned = pre.load_indices().await.unwrap();
7174+
let any_changed = raw_pre
7175+
.iter()
7176+
.zip(cleaned.iter())
7177+
.any(|(r, c)| r.fragment_bitmap != c.fragment_bitmap);
7178+
assert!(
7179+
any_changed,
7180+
"fixture should have at least one segment whose bitmap is cleaned at load"
7181+
);
7182+
drop(pre);
7183+
7184+
// No-op delete commits a fresh manifest seeded from cleaned indices.
7185+
let mut dataset = Dataset::open(&uri).await.unwrap();
7186+
dataset.delete("false").await.unwrap();
7187+
7188+
// Reopen and read raw manifest indices: cleaned bitmaps now persisted.
7189+
let healed = Dataset::open(&uri).await.unwrap();
7190+
let raw_post = read_manifest_indexes(
7191+
&healed.object_store,
7192+
&healed.manifest_location,
7193+
&healed.manifest,
7194+
)
7195+
.await
7196+
.unwrap();
7197+
let cleaned_post = healed.load_indices().await.unwrap();
7198+
for (r, c) in raw_post.iter().zip(cleaned_post.iter()) {
7199+
assert_eq!(
7200+
r.fragment_bitmap, c.fragment_bitmap,
7201+
"after auto-heal, raw on-disk bitmap should match cleaned bitmap"
7202+
);
7203+
}
7204+
}
71417205
}
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# `fri_straddle_pre_6610`
2+
3+
A dataset corrupted by the bug fixed in
4+
[lance-format/lance#6610](https://github.com/lance-format/lance/pull/6610):
5+
a deferred-remap compaction commits concurrently with `optimize_indices`
6+
on a stale handle, leaving a user index whose `fragment_bitmap` straddles
7+
a fragment-reuse rewrite group. `Dataset::list_indices()` panics with
8+
"The compaction plan included a rewrite group that was a split of indexed
9+
and non-indexed data".
10+
11+
Used by `dataset::repair` Rust tests to verify detection,
12+
`Dataset::validate()` error reporting, and `Dataset::repair()` against a
13+
real on-disk corrupt manifest.
14+
15+
## Regenerating
16+
17+
The reproduction is deterministic in Rust because we can drive
18+
`plan_compaction` / `CompactionTask::execute` / `commit_compaction`
19+
directly with `defer_index_remap=true` and interleave `optimize_indices`
20+
between them. Python's `Compaction.commit` hardcodes
21+
`CompactionOptions::default()` (no defer-remap), so a Python datagen
22+
cannot reach this path without a tight commit race.
23+
24+
The generator is a standalone crate at `datagen/` that pins `lance` to
25+
the last pre-#6610 release tag (`v6.0.0-beta.3`). It is deliberately not
26+
part of the parent Cargo workspace so it can compile against the buggy
27+
build. On a fixed build the conflict resolver rejects the rewrite as a
28+
retryable conflict and the generator fails loudly. To regenerate:
29+
30+
```bash
31+
cargo run --release \
32+
--manifest-path test_data/fri_straddle_pre_6610/datagen/Cargo.toml -- \
33+
test_data/fri_straddle_pre_6610/fri_straddle_dataset
34+
```
35+
36+
Bump the pinned tag in `datagen/Cargo.toml` only to another pre-#6610
37+
build.

0 commit comments

Comments
 (0)