Problem
The chr2 merged-cache comparison report shows 294 HGNC_ID false positives where datafusion-bio-function-vep emits an HGNC ID and Ensembl VEP emits an empty value:
- Report:
/Users/mwiewior/research/git/vepyr/e2e-testing/reports/fast_chr2_merged_report.json
- Cache:
/Users/mwiewior/workspace/data_vepyr/115_GRCh38_merged
- 281 extras on
XR_007076390.1 / NBAS / HGNC:15625
- 13 extras on
NR_037931.2 / ANAPC1P1 / HGNC:44150
- All affected rows are
SOURCE=RefSeq
This is not a per-variant propagation bug. VEP propagation is broader than a single variant, but narrower than the current chromosome/global backfill in this crate.
Confirmed VEP release/115 behavior
The relevant behavior is buffer-scoped:
buffer_size is a CLI option for the number of variations read before analysis, defaulting to 5000.
InputBuffer::next() fills up to buffer_size variants, preserves input order, carries a pre-buffer for chromosome transitions, and does not annotate one buffer across a chromosome boundary.
get_all_features_by_InputBuffer() computes cache regions touched by every variant in the current buffer, expands by upstream/downstream distance, fetches/cleans cache features, filters to the buffer min_max range expanded by that same distance, then calls merge_features().
AnnotationType::Transcript::merge_features() builds an HGNC map keyed by _gene_symbol from native _gene_hgnc_id values in the current buffer feature set, then assigns _gene_hgnc_id to same-symbol features in that same feature set. In refseq/merged, it also refills _gene_symbol, _gene_symbol_source, and _gene_hgnc_id from same _gene->stable_id features.
Source links:
Upstream dependency
This crate needs native HGNC provenance from datafusion-bio-format-ensembl-cache so it does not seed VEP-style propagation from cache-propagated values.
Related upstream issue: biodatageeks/datafusion-bio-formats#166
Preferred upstream behavior:
- Add
gene_hgnc_id_native to transcript parquet, populated only from raw VEP object keys gene_hgnc_id / _gene_hgnc_id.
- Remove export-time
gene_hgnc_id propagation, or at minimum ensure this crate never treats propagated gene_hgnc_id as a native donor.
- Keep
gene_hgnc_id_native as a plain pass-through column with no COALESCE / window backfill.
Short-term compatibility in this repo:
- If
gene_hgnc_id_native is absent, parse _gene_hgnc_id / gene_hgnc_id from raw_object_json.
- Do not use promoted
gene_hgnc_id as a donor when native provenance is unavailable unless explicitly running a legacy non-strict mode.
Implementation plan in this repo
-
Extend transcript hydration with native HGNC storage.
Add a gene_hgnc_id_native-style field to TranscriptFeature, and carry gene_symbol, gene_symbol_source, gene_stable_id, transcript source, and stable feature ID needed by VEP's merge_features() logic.
-
Remove chromosome/global HGNC backfill for strict VEP parity.
Delete or bypass backfill_missing_hgnc_ids() in strict parity mode. Reset any working HGNC value to the native value before applying buffer-local propagation.
-
Add VEP-style input-buffer chunking.
Default buffer_size to 5000, chunk the logical input variant stream before CSQ expansion, preserve input order, do not cross chromosome boundaries, compute per-buffer min_max, and expose a config override matching VEP CLI behavior.
-
Compute the buffer-local transcript feature set.
Use the buffer min_max range expanded by the same upstream/downstream distance used for consequence annotation. If transcripts are already loaded per chromosome, filter the in-memory feature set by that expanded range. If transcript loading becomes lazy later, load only the 1 Mb cache regions touched by the buffer before applying the same min_max filter.
-
Port the HGNC-relevant parts of merge_features().
Build hgnc_by_symbol from native _gene_hgnc_id values in the current buffer feature set; apply same-symbol HGNC propagation only inside that feature set; for refseq/merged, build the same-gene refill map keyed by gene_stable_id for gene_symbol, gene_symbol_source, and native HGNC; do not carry propagated values into the next buffer.
-
Keep propagation as a buffer-local overlay.
Mutate a per-buffer working copy or maintain transcript_id -> propagated_hgnc_id overlay state for the current buffer. Do not permanently mutate the shared per-contig transcript cache.
Acceptance criteria
- Unit test symbol-keyed
merge_features() HGNC propagation with native donors and null recipients.
- Unit test the
refseq/merged same-gene_stable_id refill path.
- Regression-test chr2 NBAS: variants before the 5000-record buffer boundary stay empty; variants in the next buffer can receive
HGNC:15625.
- Regression-test chr2 ANAPC1P1: variants before the 5000-record buffer boundary stay empty; variants in the next buffer can receive
HGNC:44150.
- Rerun the chr2 merged report and verify the 294
HGNC_ID false positives drop to zero without introducing post-boundary false negatives.
Performance guard
Do not re-query parquet per buffer if the current execution path already hydrates transcripts per contig. Build a coordinate-sorted transcript index or interval index if repeated per-buffer scans show up in profiles. The semantic requirement is buffer-local propagation; the implementation should be proportional to candidate transcripts near the buffer, not all chromosome transcripts multiplied by number of buffers.
Problem
The chr2 merged-cache comparison report shows 294
HGNC_IDfalse positives wheredatafusion-bio-function-vepemits an HGNC ID and Ensembl VEP emits an empty value:/Users/mwiewior/research/git/vepyr/e2e-testing/reports/fast_chr2_merged_report.json/Users/mwiewior/workspace/data_vepyr/115_GRCh38_mergedXR_007076390.1/NBAS/HGNC:15625NR_037931.2/ANAPC1P1/HGNC:44150SOURCE=RefSeqThis is not a per-variant propagation bug. VEP propagation is broader than a single variant, but narrower than the current chromosome/global backfill in this crate.
Confirmed VEP release/115 behavior
The relevant behavior is buffer-scoped:
buffer_sizeis a CLI option for the number of variations read before analysis, defaulting to5000.InputBuffer::next()fills up tobuffer_sizevariants, preserves input order, carries a pre-buffer for chromosome transitions, and does not annotate one buffer across a chromosome boundary.get_all_features_by_InputBuffer()computes cache regions touched by every variant in the current buffer, expands by upstream/downstream distance, fetches/cleans cache features, filters to the buffermin_maxrange expanded by that same distance, then callsmerge_features().AnnotationType::Transcript::merge_features()builds an HGNC map keyed by_gene_symbolfrom native_gene_hgnc_idvalues in the current buffer feature set, then assigns_gene_hgnc_idto same-symbol features in that same feature set. Inrefseq/merged, it also refills_gene_symbol,_gene_symbol_source, and_gene_hgnc_idfrom same_gene->stable_idfeatures.Source links:
buffer_sizeoption/default: https://github.com/Ensembl/ensembl-vep/blob/release/115/modules/Bio/EnsEMBL/VEP/Config.pm#L118-L119 and https://github.com/Ensembl/ensembl-vep/blob/release/115/modules/Bio/EnsEMBL/VEP/Config.pm#L290-L296min_max: https://github.com/Ensembl/ensembl-vep/blob/release/115/modules/Bio/EnsEMBL/VEP/InputBuffer.pm#L437-L471Upstream dependency
This crate needs native HGNC provenance from
datafusion-bio-format-ensembl-cacheso it does not seed VEP-style propagation from cache-propagated values.Related upstream issue: biodatageeks/datafusion-bio-formats#166
Preferred upstream behavior:
gene_hgnc_id_nativeto transcript parquet, populated only from raw VEP object keysgene_hgnc_id/_gene_hgnc_id.gene_hgnc_idpropagation, or at minimum ensure this crate never treats propagatedgene_hgnc_idas a native donor.gene_hgnc_id_nativeas a plain pass-through column with noCOALESCE/ window backfill.Short-term compatibility in this repo:
gene_hgnc_id_nativeis absent, parse_gene_hgnc_id/gene_hgnc_idfromraw_object_json.gene_hgnc_idas a donor when native provenance is unavailable unless explicitly running a legacy non-strict mode.Implementation plan in this repo
Extend transcript hydration with native HGNC storage.
Add a
gene_hgnc_id_native-style field toTranscriptFeature, and carrygene_symbol,gene_symbol_source,gene_stable_id, transcriptsource, and stable feature ID needed by VEP'smerge_features()logic.Remove chromosome/global HGNC backfill for strict VEP parity.
Delete or bypass
backfill_missing_hgnc_ids()in strict parity mode. Reset any working HGNC value to the native value before applying buffer-local propagation.Add VEP-style input-buffer chunking.
Default
buffer_sizeto5000, chunk the logical input variant stream before CSQ expansion, preserve input order, do not cross chromosome boundaries, compute per-buffermin_max, and expose a config override matching VEP CLI behavior.Compute the buffer-local transcript feature set.
Use the buffer
min_maxrange expanded by the same upstream/downstream distance used for consequence annotation. If transcripts are already loaded per chromosome, filter the in-memory feature set by that expanded range. If transcript loading becomes lazy later, load only the 1 Mb cache regions touched by the buffer before applying the samemin_maxfilter.Port the HGNC-relevant parts of
merge_features().Build
hgnc_by_symbolfrom native_gene_hgnc_idvalues in the current buffer feature set; apply same-symbol HGNC propagation only inside that feature set; forrefseq/merged, build the same-gene refill map keyed bygene_stable_idforgene_symbol,gene_symbol_source, and native HGNC; do not carry propagated values into the next buffer.Keep propagation as a buffer-local overlay.
Mutate a per-buffer working copy or maintain
transcript_id -> propagated_hgnc_idoverlay state for the current buffer. Do not permanently mutate the shared per-contig transcript cache.Acceptance criteria
merge_features()HGNC propagation with native donors and null recipients.refseq/mergedsame-gene_stable_idrefill path.HGNC:15625.HGNC:44150.HGNC_IDfalse positives drop to zero without introducing post-boundary false negatives.Performance guard
Do not re-query parquet per buffer if the current execution path already hydrates transcripts per contig. Build a coordinate-sorted transcript index or interval index if repeated per-buffer scans show up in profiles. The semantic requirement is buffer-local propagation; the implementation should be proportional to candidate transcripts near the buffer, not all chromosome transcripts multiplied by number of buffers.