Skip to content

fix: fjall compaction, Drop deadlock, and pre-commit hook toolchain#106

Merged
mwiewior merged 13 commits into
masterfrom
fix/fjall-compaction-and-drop-deadlock
Apr 5, 2026
Merged

fix: fjall compaction, Drop deadlock, and pre-commit hook toolchain#106
mwiewior merged 13 commits into
masterfrom
fix/fjall-compaction-and-drop-deadlock

Conversation

@mwiewior
Copy link
Copy Markdown
Contributor

@mwiewior mwiewior commented Apr 5, 2026

Summary

  • load_sift_cache: add missing major_compact() after persist — without this, every SIFT lookup probes hundreds of L0 SSTs
  • CacheLoader: add mem::forget after persist + compact to avoid fjall Drop deadlock on large datasets (bounded flume channel fills up)
  • cache_builder: apply mem::forget fix for both build_variation_fjall_from_parquet() and build_sift_fjall() (cherry-pick from unmerged fix: use mem::forget to avoid fjall Drop deadlock (#86) #87)
  • pre-commit: switch from doublify/pre-commit-rust v1.0 (2019) to local hooks using the project's own cargo toolchain — the old hooks used a rustfmt that didn't support edition 2024 let-chains
  • compact_fjall: new utility example for re-compacting existing uncompacted stores without re-ingesting data

Supersedes #87. Closes #86.

Root cause of ~10% perf degradation

Both variation.fjall (1,829 files) and translation_sift.fjall (437 files) were fully uncompacted — every table file was an L0 SST at ~62MB with zero merged levels. This means every point lookup probes bloom filters across all files instead of a handful of compacted segments.

Store L0 SST files Total size Expected after compaction
variation.fjall 1,829 109 GB ~5-10 files
translation_sift.fjall 437 26 GB ~3-5 files

Cause: load_sift_cache never called major_compact(). For variation, either the build was killed during the Drop hang (issue #86) or compaction itself was interrupted.

Test plan

  • All 490 lib tests pass
  • cargo clippy --all-targets --all-features clean
  • cargo fmt --all --check clean
  • Pre-commit hooks pass with new local config
  • Rebuild fjall stores and verify compaction completes
  • Run compact_fjall on existing uncompacted stores

🤖 Generated with Claude Code

mwiewior and others added 5 commits April 5, 2026 08:50
1. load_sift_cache: add major_compact() after persist — without this,
   every SIFT lookup probes hundreds of L0 SSTs (~10% slower)

2. CacheLoader (loader.rs): add mem::forget after persist + compact to
   avoid fjall Drop deadlock on large datasets (bounded flume channel
   fills up when background compaction is still running)

3. cache_builder.rs: apply mem::forget fix for both
   build_variation_fjall_from_parquet() and build_sift_fjall()

4. pre-commit: switch from doublify/pre-commit-rust v1.0 (2019) to
   local hooks using the project's own cargo toolchain — the old hooks
   used a rustfmt that didn't support edition 2024 let-chains

Also adds compact_fjall utility for re-compacting existing uncompacted
stores without re-ingesting data.

Closes #86

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Picks up the HGNC_ID propagation fix from
biodatageeks/datafusion-bio-formats#162.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dlock

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…gradation

load_context_kv was writing translations/exons keyspaces directly into
variation.fjall, adding unused keyspaces that increased fjall block cache
pressure during point lookups. Profiling showed variation_lookup time
growing progressively across chromosomes (+41s over a full WGS run).

Changes:
- load_context_kv: write to separate context.fjall instead of variation.fjall
- profile_annotation: auto-detect variation.fjall and enable use_fjall=true
- strip_fjall: new diagnostic tool to check for extra keyspaces
- kv_store: add tests ensuring create/open produce only meta+data keyspaces
- sift_store: add tests ensuring create/ingest produce only sift keyspace

Expected keyspace layout after rebuild:
- variation.fjall: default + meta + data (3 dirs)
- translation_sift.fjall: default + sift (2 dirs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mwiewior
Copy link
Copy Markdown
Contributor Author

mwiewior commented Apr 5, 2026

@claude pls review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 5, 2026

Claude finished @mwiewior's task in 1m 56s —— View job


PR Review

  • Gather context and read changed files
  • Review core logic changes (fjall compaction, Drop deadlock fixes)
  • Review pre-commit config changes
  • Review new examples (compact_fjall.rs, strip_fjall.rs)
  • Review kv_store and sift_store additions
  • Summarize findings

Overall the PR is well-motivated and well-documented. The root cause analysis is solid and the fixes are generally correct. One medium-severity bug, two minor issues.


Bug: compact_fjall.rs will create the extra keyspaces it's trying to avoid

datafusion/bio-function-vep/examples/compact_fjall.rs:43-51

let known_keyspaces = ["data", "meta", "sift", "translations", "exons"];

for ks_name in &known_keyspaces {
    // ...
    let ks = match db.keyspace(ks_name, fjall::KeyspaceCreateOptions::default) {
        Ok(ks) => ks,
        Err(_) => continue,
    };

db.keyspace() with KeyspaceCreateOptions::default creates the keyspace if it doesn't already exist. Running compact_fjall on variation.fjall will open/create "sift", "translations", and "exons" — exactly the extra keyspaces this PR removes for the ~10% perf regression. By contrast, strip_fjall.rs correctly calls db.keyspace_exists(name) first and skips missing ones.

Fix: add an existence guard before the db.keyspace() call:

if !db.keyspace_exists(ks_name) {
    continue;
}
let ks = match db.keyspace(ks_name, fjall::KeyspaceCreateOptions::default) {

Fix this →


Minor: strip_fjall.rs lists "meta" as expected for translation_sift.fjall

datafusion/bio-function-vep/examples/strip_fjall.rs:120

check_keyspaces(&sift_fjall, &["meta", "sift"]);

SiftKvStore::create() only opens the "sift" keyspace — there's no "meta" keyspace in translation_sift.fjall (confirmed by test_create_produces_minimal_keyspaces asserting exactly 2 dirs: fjall default + sift). Listing "meta" as expected is harmless (it just won't appear since keyspace_exists("meta") will be false), but it can mislead users of the tool who see no ✓ 'meta' line and wonder if the store is corrupt.

Suggest: check_keyspaces(&sift_fjall, &["sift"]).


Minor: mem::forget in cache_builder.rs lacks the #[cfg(not(test))] guard

datafusion/bio-function-vep/src/cache_builder.rs:908 and :1459

loader.rs wraps its mem::forget in #[cfg(not(test))] so that tests can correctly release the fjall lock file and allow subsequent VepKvStore::open() calls in the same test directory. The cache_builder.rs builder functions don't have this guard. This is probably safe today since tests don't call build_variation_fjall_from_parquet() / build_sift_fjall() directly, but adding the guard consistently would prevent a hard-to-diagnose test failure if that changes.


What's well done

  • #[cfg(not(test))] guard in loader.rs — exactly right; this is the correct way to handle the test vs. production trade-off for mem::forget.
  • Compaction before mem::forgetpersist() + major_compact() is called before forgetting the store in all paths, so no data is lost.
  • Deadlock explanation comments — clear, with issue links, explaining the bounded flume channel mechanism. Future maintainers will understand why this isn't a memory leak.
  • Regression tests for keyspace counttest_create_produces_minimal_keyspaces / test_open_does_not_create_extra_keyspaces in both kv_store.rs and sift_store.rs will catch the previous extra-keyspace bug at build time.
  • Pre-commit migration — switching from the 2019 doublify/pre-commit-rust (which shipped an old rustfmt without edition 2024 support) to local cargo fmt/cargo clippy is the right call.
  • load_context_kv.rs behavioral fix — writing context keyspaces to a separate context.fjall instead of polluting variation.fjall is architecturally cleaner; the comment explaining the perf implication is helpful.

mwiewior and others added 8 commits April 5, 2026 14:05
… cfg(not(test)) guard

1. compact_fjall: add keyspace_exists() check before db.keyspace() to
   prevent creating extra keyspaces when compacting (db.keyspace() with
   default options creates the keyspace if missing)

2. strip_fjall: remove "meta" from expected keyspaces for
   translation_sift.fjall — SiftKvStore only creates "sift", no "meta"

3. cache_builder: wrap mem::forget in #[cfg(not(test))] consistent with
   loader.rs, so tests properly release fjall lock files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These keyspaces should never exist in production fjall stores.
Keeping them in the list would compact keyspaces that shouldn't be there.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
variation.fjall should only compact meta+data; translation_sift.fjall
should only compact sift. The previous flat list would attempt to open
all keyspace names on any store, risking creation of foreign keyspaces.

Now auto-detects from path: translation_sift → sift only, else → meta+data.
Explicit --keyspace arg still honored for manual use.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mem::forget skipped fjall's Drop handler which runs GC/maintenance to
delete old pre-compaction SST files. This left ~220GB of dead files
on disk after building variation.fjall (331GB instead of 110GB).

After major_compact() + persist(), there is no pending work for the
worker thread, so Drop completes without the deadlock from issue #86.

Changed in: cache_builder.rs, loader.rs, compact_fjall.rs,
load_sift_cache.rs, load_context_kv.rs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cache_builder's build_query() for transcripts used SELECT * which
didn't propagate gene_hgnc_id from sibling transcripts sharing the
same gene_symbol. This caused 1,730 HGNC_ID mismatches on chr9
(FGF7P3 — 19/46 transcripts missing HGNC:26671).

Port the same fix from datafusion-bio-formats PR #162: replace
SELECT * with an explicit column list using COALESCE + FIRST_VALUE
IGNORE NULLS OVER (PARTITION BY gene_symbol) for gene_hgnc_id.

TODO: consolidate build_dedup_query() into bio-formats as a shared
library function to avoid duplicate query logic across repos.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
VEP only emits HGNC_ID for transcripts with gene_symbol_source='HGNC'.
The cache-level propagation was filling HGNC_ID for EntrezGene/RFAM
transcripts too, causing 39 false-positive mismatches on chr9.

Add gene_symbol_source = 'HGNC' guard to the COALESCE expression.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
VEP's merge_features() propagates gene_hgnc_id purely by gene_symbol,
regardless of gene_symbol_source. The gene_symbol_source='HGNC' guard
was wrong at cache build time — FGF7P3 has 45/46 transcripts with
source=EntrezGene that still need HGNC_ID in the cache.

The source-based filtering is a runtime concern (hgnc_backfill flag,
#104) that controls whether HGNC_ID appears in annotation output,
not whether it's stored in the cache.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mwiewior mwiewior merged commit 40823b9 into master Apr 5, 2026
2 checks passed
@mwiewior mwiewior deleted the fix/fjall-compaction-and-drop-deadlock branch April 6, 2026 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fjall Database::drop() deadlocks after cache build on large data

1 participant