fix: fjall compaction, Drop deadlock, and pre-commit hook toolchain#106
Conversation
1. load_sift_cache: add major_compact() after persist — without this, every SIFT lookup probes hundreds of L0 SSTs (~10% slower) 2. CacheLoader (loader.rs): add mem::forget after persist + compact to avoid fjall Drop deadlock on large datasets (bounded flume channel fills up when background compaction is still running) 3. cache_builder.rs: apply mem::forget fix for both build_variation_fjall_from_parquet() and build_sift_fjall() 4. pre-commit: switch from doublify/pre-commit-rust v1.0 (2019) to local hooks using the project's own cargo toolchain — the old hooks used a rustfmt that didn't support edition 2024 let-chains Also adds compact_fjall utility for re-compacting existing uncompacted stores without re-ingesting data. Closes #86 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Picks up the HGNC_ID propagation fix from biodatageeks/datafusion-bio-formats#162. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dlock Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…gradation load_context_kv was writing translations/exons keyspaces directly into variation.fjall, adding unused keyspaces that increased fjall block cache pressure during point lookups. Profiling showed variation_lookup time growing progressively across chromosomes (+41s over a full WGS run). Changes: - load_context_kv: write to separate context.fjall instead of variation.fjall - profile_annotation: auto-detect variation.fjall and enable use_fjall=true - strip_fjall: new diagnostic tool to check for extra keyspaces - kv_store: add tests ensuring create/open produce only meta+data keyspaces - sift_store: add tests ensuring create/ingest produce only sift keyspace Expected keyspace layout after rebuild: - variation.fjall: default + meta + data (3 dirs) - translation_sift.fjall: default + sift (2 dirs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@claude pls review |
|
Claude finished @mwiewior's task in 1m 56s —— View job PR Review
Overall the PR is well-motivated and well-documented. The root cause analysis is solid and the fixes are generally correct. One medium-severity bug, two minor issues. Bug:
|
… cfg(not(test)) guard 1. compact_fjall: add keyspace_exists() check before db.keyspace() to prevent creating extra keyspaces when compacting (db.keyspace() with default options creates the keyspace if missing) 2. strip_fjall: remove "meta" from expected keyspaces for translation_sift.fjall — SiftKvStore only creates "sift", no "meta" 3. cache_builder: wrap mem::forget in #[cfg(not(test))] consistent with loader.rs, so tests properly release fjall lock files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These keyspaces should never exist in production fjall stores. Keeping them in the list would compact keyspaces that shouldn't be there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
variation.fjall should only compact meta+data; translation_sift.fjall should only compact sift. The previous flat list would attempt to open all keyspace names on any store, risking creation of foreign keyspaces. Now auto-detects from path: translation_sift → sift only, else → meta+data. Explicit --keyspace arg still honored for manual use. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mem::forget skipped fjall's Drop handler which runs GC/maintenance to delete old pre-compaction SST files. This left ~220GB of dead files on disk after building variation.fjall (331GB instead of 110GB). After major_compact() + persist(), there is no pending work for the worker thread, so Drop completes without the deadlock from issue #86. Changed in: cache_builder.rs, loader.rs, compact_fjall.rs, load_sift_cache.rs, load_context_kv.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cache_builder's build_query() for transcripts used SELECT * which didn't propagate gene_hgnc_id from sibling transcripts sharing the same gene_symbol. This caused 1,730 HGNC_ID mismatches on chr9 (FGF7P3 — 19/46 transcripts missing HGNC:26671). Port the same fix from datafusion-bio-formats PR #162: replace SELECT * with an explicit column list using COALESCE + FIRST_VALUE IGNORE NULLS OVER (PARTITION BY gene_symbol) for gene_hgnc_id. TODO: consolidate build_dedup_query() into bio-formats as a shared library function to avoid duplicate query logic across repos. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
VEP only emits HGNC_ID for transcripts with gene_symbol_source='HGNC'. The cache-level propagation was filling HGNC_ID for EntrezGene/RFAM transcripts too, causing 39 false-positive mismatches on chr9. Add gene_symbol_source = 'HGNC' guard to the COALESCE expression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
VEP's merge_features() propagates gene_hgnc_id purely by gene_symbol, regardless of gene_symbol_source. The gene_symbol_source='HGNC' guard was wrong at cache build time — FGF7P3 has 45/46 transcripts with source=EntrezGene that still need HGNC_ID in the cache. The source-based filtering is a runtime concern (hgnc_backfill flag, #104) that controls whether HGNC_ID appears in annotation output, not whether it's stored in the cache. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
major_compact()after persist — without this, every SIFT lookup probes hundreds of L0 SSTsmem::forgetafter persist + compact to avoid fjallDropdeadlock on large datasets (bounded flume channel fills up)mem::forgetfix for bothbuild_variation_fjall_from_parquet()andbuild_sift_fjall()(cherry-pick from unmerged fix: use mem::forget to avoid fjall Drop deadlock (#86) #87)doublify/pre-commit-rustv1.0 (2019) to local hooks using the project's own cargo toolchain — the old hooks used arustfmtthat didn't support edition 2024 let-chainsSupersedes #87. Closes #86.
Root cause of ~10% perf degradation
Both
variation.fjall(1,829 files) andtranslation_sift.fjall(437 files) were fully uncompacted — every table file was an L0 SST at ~62MB with zero merged levels. This means every point lookup probes bloom filters across all files instead of a handful of compacted segments.Cause:
load_sift_cachenever calledmajor_compact(). For variation, either the build was killed during the Drop hang (issue #86) or compaction itself was interrupted.Test plan
cargo clippy --all-targets --all-featurescleancargo fmt --all --checkcleancompact_fjallon existing uncompacted stores🤖 Generated with Claude Code