fix: fjall compaction, Drop deadlock, and pre-commit hook toolchain by mwiewior · Pull Request #106 · biodatageeks/datafusion-bio-functions

mwiewior · 2026-04-05T06:51:15Z

Summary

load_sift_cache: add missing major_compact() after persist — without this, every SIFT lookup probes hundreds of L0 SSTs
CacheLoader: add mem::forget after persist + compact to avoid fjall Drop deadlock on large datasets (bounded flume channel fills up)
cache_builder: apply mem::forget fix for both build_variation_fjall_from_parquet() and build_sift_fjall() (cherry-pick from unmerged fix: use mem::forget to avoid fjall Drop deadlock (#86) #87)
pre-commit: switch from doublify/pre-commit-rust v1.0 (2019) to local hooks using the project's own cargo toolchain — the old hooks used a rustfmt that didn't support edition 2024 let-chains
compact_fjall: new utility example for re-compacting existing uncompacted stores without re-ingesting data

Supersedes #87. Closes #86.

Root cause of ~10% perf degradation

Both variation.fjall (1,829 files) and translation_sift.fjall (437 files) were fully uncompacted — every table file was an L0 SST at ~62MB with zero merged levels. This means every point lookup probes bloom filters across all files instead of a handful of compacted segments.

Store	L0 SST files	Total size	Expected after compaction
variation.fjall	1,829	109 GB	~5-10 files
translation_sift.fjall	437	26 GB	~3-5 files

Cause: load_sift_cache never called major_compact(). For variation, either the build was killed during the Drop hang (issue #86) or compaction itself was interrupted.

Test plan

All 490 lib tests pass
cargo clippy --all-targets --all-features clean
cargo fmt --all --check clean
Pre-commit hooks pass with new local config
Rebuild fjall stores and verify compaction completes
Run compact_fjall on existing uncompacted stores

🤖 Generated with Claude Code

1. load_sift_cache: add major_compact() after persist — without this, every SIFT lookup probes hundreds of L0 SSTs (~10% slower) 2. CacheLoader (loader.rs): add mem::forget after persist + compact to avoid fjall Drop deadlock on large datasets (bounded flume channel fills up when background compaction is still running) 3. cache_builder.rs: apply mem::forget fix for both build_variation_fjall_from_parquet() and build_sift_fjall() 4. pre-commit: switch from doublify/pre-commit-rust v1.0 (2019) to local hooks using the project's own cargo toolchain — the old hooks used a rustfmt that didn't support edition 2024 let-chains Also adds compact_fjall utility for re-compacting existing uncompacted stores without re-ingesting data. Closes #86 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Picks up the HGNC_ID propagation fix from biodatageeks/datafusion-bio-formats#162. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…dlock Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…gradation load_context_kv was writing translations/exons keyspaces directly into variation.fjall, adding unused keyspaces that increased fjall block cache pressure during point lookups. Profiling showed variation_lookup time growing progressively across chromosomes (+41s over a full WGS run). Changes: - load_context_kv: write to separate context.fjall instead of variation.fjall - profile_annotation: auto-detect variation.fjall and enable use_fjall=true - strip_fjall: new diagnostic tool to check for extra keyspaces - kv_store: add tests ensuring create/open produce only meta+data keyspaces - sift_store: add tests ensuring create/ingest produce only sift keyspace Expected keyspace layout after rebuild: - variation.fjall: default + meta + data (3 dirs) - translation_sift.fjall: default + sift (2 dirs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mwiewior · 2026-04-05T11:55:57Z

@claude pls review

claude · 2026-04-05T11:56:10Z

Claude finished @mwiewior's task in 1m 56s —— View job

PR Review

Gather context and read changed files
Review core logic changes (fjall compaction, Drop deadlock fixes)
Review pre-commit config changes
Review new examples (compact_fjall.rs, strip_fjall.rs)
Review kv_store and sift_store additions
Summarize findings

Overall the PR is well-motivated and well-documented. The root cause analysis is solid and the fixes are generally correct. One medium-severity bug, two minor issues.

Bug: `compact_fjall.rs` will create the extra keyspaces it's trying to avoid

datafusion/bio-function-vep/examples/compact_fjall.rs:43-51

let known_keyspaces = ["data", "meta", "sift", "translations", "exons"];

for ks_name in &known_keyspaces {
    // ...
    let ks = match db.keyspace(ks_name, fjall::KeyspaceCreateOptions::default) {
        Ok(ks) => ks,
        Err(_) => continue,
    };

db.keyspace() with KeyspaceCreateOptions::default creates the keyspace if it doesn't already exist. Running compact_fjall on variation.fjall will open/create "sift", "translations", and "exons" — exactly the extra keyspaces this PR removes for the ~10% perf regression. By contrast, strip_fjall.rs correctly calls db.keyspace_exists(name) first and skips missing ones.

Fix: add an existence guard before the db.keyspace() call:

if !db.keyspace_exists(ks_name) {
    continue;
}
let ks = match db.keyspace(ks_name, fjall::KeyspaceCreateOptions::default) {

Fix this →

Minor: `strip_fjall.rs` lists "meta" as expected for `translation_sift.fjall`

datafusion/bio-function-vep/examples/strip_fjall.rs:120

check_keyspaces(&sift_fjall, &["meta", "sift"]);

SiftKvStore::create() only opens the "sift" keyspace — there's no "meta" keyspace in translation_sift.fjall (confirmed by test_create_produces_minimal_keyspaces asserting exactly 2 dirs: fjall default + sift). Listing "meta" as expected is harmless (it just won't appear since keyspace_exists("meta") will be false), but it can mislead users of the tool who see no ✓ 'meta' line and wonder if the store is corrupt.

Suggest: check_keyspaces(&sift_fjall, &["sift"]).

Minor: `mem::forget` in `cache_builder.rs` lacks the `#[cfg(not(test))]` guard

datafusion/bio-function-vep/src/cache_builder.rs:908 and :1459

loader.rs wraps its mem::forget in #[cfg(not(test))] so that tests can correctly release the fjall lock file and allow subsequent VepKvStore::open() calls in the same test directory. The cache_builder.rs builder functions don't have this guard. This is probably safe today since tests don't call build_variation_fjall_from_parquet() / build_sift_fjall() directly, but adding the guard consistently would prevent a hard-to-diagnose test failure if that changes.

What's well done

#[cfg(not(test))] guard in loader.rs — exactly right; this is the correct way to handle the test vs. production trade-off for mem::forget.
Compaction before mem::forget — persist() + major_compact() is called before forgetting the store in all paths, so no data is lost.
Deadlock explanation comments — clear, with issue links, explaining the bounded flume channel mechanism. Future maintainers will understand why this isn't a memory leak.
Regression tests for keyspace count — test_create_produces_minimal_keyspaces / test_open_does_not_create_extra_keyspaces in both kv_store.rs and sift_store.rs will catch the previous extra-keyspace bug at build time.
Pre-commit migration — switching from the 2019 doublify/pre-commit-rust (which shipped an old rustfmt without edition 2024 support) to local cargo fmt/cargo clippy is the right call.
load_context_kv.rs behavioral fix — writing context keyspaces to a separate context.fjall instead of polluting variation.fjall is architecturally cleaner; the comment explaining the perf implication is helpful.

… cfg(not(test)) guard 1. compact_fjall: add keyspace_exists() check before db.keyspace() to prevent creating extra keyspaces when compacting (db.keyspace() with default options creates the keyspace if missing) 2. strip_fjall: remove "meta" from expected keyspaces for translation_sift.fjall — SiftKvStore only creates "sift", no "meta" 3. cache_builder: wrap mem::forget in #[cfg(not(test))] consistent with loader.rs, so tests properly release fjall lock files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

These keyspaces should never exist in production fjall stores. Keeping them in the list would compact keyspaces that shouldn't be there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

variation.fjall should only compact meta+data; translation_sift.fjall should only compact sift. The previous flat list would attempt to open all keyspace names on any store, risking creation of foreign keyspaces. Now auto-detects from path: translation_sift → sift only, else → meta+data. Explicit --keyspace arg still honored for manual use. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mem::forget skipped fjall's Drop handler which runs GC/maintenance to delete old pre-compaction SST files. This left ~220GB of dead files on disk after building variation.fjall (331GB instead of 110GB). After major_compact() + persist(), there is no pending work for the worker thread, so Drop completes without the deadlock from issue #86. Changed in: cache_builder.rs, loader.rs, compact_fjall.rs, load_sift_cache.rs, load_context_kv.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cache_builder's build_query() for transcripts used SELECT * which didn't propagate gene_hgnc_id from sibling transcripts sharing the same gene_symbol. This caused 1,730 HGNC_ID mismatches on chr9 (FGF7P3 — 19/46 transcripts missing HGNC:26671). Port the same fix from datafusion-bio-formats PR #162: replace SELECT * with an explicit column list using COALESCE + FIRST_VALUE IGNORE NULLS OVER (PARTITION BY gene_symbol) for gene_hgnc_id. TODO: consolidate build_dedup_query() into bio-formats as a shared library function to avoid duplicate query logic across repos. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

VEP only emits HGNC_ID for transcripts with gene_symbol_source='HGNC'. The cache-level propagation was filling HGNC_ID for EntrezGene/RFAM transcripts too, causing 39 false-positive mismatches on chr9. Add gene_symbol_source = 'HGNC' guard to the COALESCE expression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

VEP's merge_features() propagates gene_hgnc_id purely by gene_symbol, regardless of gene_symbol_source. The gene_symbol_source='HGNC' guard was wrong at cache build time — FGF7P3 has 45/46 transcripts with source=EntrezGene that still need HGNC_ID in the cache. The source-based filtering is a runtime concern (hgnc_backfill flag, #104) that controls whether HGNC_ID appears in annotation output, not whether it's stored in the cache. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mwiewior and others added 5 commits April 5, 2026 08:50

chore: bump datafusion-bio-format-ensembl-cache to b82396d

8648689

Picks up the HGNC_ID propagation fix from biodatageeks/datafusion-bio-formats#162. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add mem::forget to load_context_kv to prevent potential Drop dea…

c17a4ca

…dlock Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

VEP cache fix

d67560b

mwiewior and others added 8 commits April 5, 2026 14:05

fix: remove translations/exons from compact_fjall known keyspaces

02d39bd

These keyspaces should never exist in production fjall stores. Keeping them in the list would compact keyspaces that shouldn't be there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bio-format update

20a7306

mwiewior merged commit 40823b9 into master Apr 5, 2026
2 checks passed

mwiewior deleted the fix/fjall-compaction-and-drop-deadlock branch April 6, 2026 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fjall compaction, Drop deadlock, and pre-commit hook toolchain#106

fix: fjall compaction, Drop deadlock, and pre-commit hook toolchain#106
mwiewior merged 13 commits into
masterfrom
fix/fjall-compaction-and-drop-deadlock

mwiewior commented Apr 5, 2026

Uh oh!

mwiewior commented Apr 5, 2026

Uh oh!

claude Bot commented Apr 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mwiewior commented Apr 5, 2026

Summary

Root cause of ~10% perf degradation

Test plan

Uh oh!

mwiewior commented Apr 5, 2026

Uh oh!

claude Bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

Bug: compact_fjall.rs will create the extra keyspaces it's trying to avoid

Minor: strip_fjall.rs lists "meta" as expected for translation_sift.fjall

Minor: mem::forget in cache_builder.rs lacks the #[cfg(not(test))] guard

What's well done

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 5, 2026 •

edited

Loading

Bug: `compact_fjall.rs` will create the extra keyspaces it's trying to avoid

Minor: `strip_fjall.rs` lists "meta" as expected for `translation_sift.fjall`

Minor: `mem::forget` in `cache_builder.rs` lacks the `#[cfg(not(test))]` guard