Skip to content

Match VEP buffer-scoped HGNC_ID propagation for merged RefSeq #143

@mwiewior

Description

@mwiewior

Problem

The chr2 merged-cache comparison report shows 294 HGNC_ID false positives where datafusion-bio-function-vep emits an HGNC ID and Ensembl VEP emits an empty value:

  • Report: /Users/mwiewior/research/git/vepyr/e2e-testing/reports/fast_chr2_merged_report.json
  • Cache: /Users/mwiewior/workspace/data_vepyr/115_GRCh38_merged
  • 281 extras on XR_007076390.1 / NBAS / HGNC:15625
  • 13 extras on NR_037931.2 / ANAPC1P1 / HGNC:44150
  • All affected rows are SOURCE=RefSeq

This is not a per-variant propagation bug. VEP propagation is broader than a single variant, but narrower than the current chromosome/global backfill in this crate.

Confirmed VEP release/115 behavior

The relevant behavior is buffer-scoped:

  • buffer_size is a CLI option for the number of variations read before analysis, defaulting to 5000.
  • InputBuffer::next() fills up to buffer_size variants, preserves input order, carries a pre-buffer for chromosome transitions, and does not annotate one buffer across a chromosome boundary.
  • get_all_features_by_InputBuffer() computes cache regions touched by every variant in the current buffer, expands by upstream/downstream distance, fetches/cleans cache features, filters to the buffer min_max range expanded by that same distance, then calls merge_features().
  • AnnotationType::Transcript::merge_features() builds an HGNC map keyed by _gene_symbol from native _gene_hgnc_id values in the current buffer feature set, then assigns _gene_hgnc_id to same-symbol features in that same feature set. In refseq/merged, it also refills _gene_symbol, _gene_symbol_source, and _gene_hgnc_id from same _gene->stable_id features.

Source links:

Upstream dependency

This crate needs native HGNC provenance from datafusion-bio-format-ensembl-cache so it does not seed VEP-style propagation from cache-propagated values.

Related upstream issue: biodatageeks/datafusion-bio-formats#166

Preferred upstream behavior:

  • Add gene_hgnc_id_native to transcript parquet, populated only from raw VEP object keys gene_hgnc_id / _gene_hgnc_id.
  • Remove export-time gene_hgnc_id propagation, or at minimum ensure this crate never treats propagated gene_hgnc_id as a native donor.
  • Keep gene_hgnc_id_native as a plain pass-through column with no COALESCE / window backfill.

Short-term compatibility in this repo:

  • If gene_hgnc_id_native is absent, parse _gene_hgnc_id / gene_hgnc_id from raw_object_json.
  • Do not use promoted gene_hgnc_id as a donor when native provenance is unavailable unless explicitly running a legacy non-strict mode.

Implementation plan in this repo

  1. Extend transcript hydration with native HGNC storage.

    Add a gene_hgnc_id_native-style field to TranscriptFeature, and carry gene_symbol, gene_symbol_source, gene_stable_id, transcript source, and stable feature ID needed by VEP's merge_features() logic.

  2. Remove chromosome/global HGNC backfill for strict VEP parity.

    Delete or bypass backfill_missing_hgnc_ids() in strict parity mode. Reset any working HGNC value to the native value before applying buffer-local propagation.

  3. Add VEP-style input-buffer chunking.

    Default buffer_size to 5000, chunk the logical input variant stream before CSQ expansion, preserve input order, do not cross chromosome boundaries, compute per-buffer min_max, and expose a config override matching VEP CLI behavior.

  4. Compute the buffer-local transcript feature set.

    Use the buffer min_max range expanded by the same upstream/downstream distance used for consequence annotation. If transcripts are already loaded per chromosome, filter the in-memory feature set by that expanded range. If transcript loading becomes lazy later, load only the 1 Mb cache regions touched by the buffer before applying the same min_max filter.

  5. Port the HGNC-relevant parts of merge_features().

    Build hgnc_by_symbol from native _gene_hgnc_id values in the current buffer feature set; apply same-symbol HGNC propagation only inside that feature set; for refseq/merged, build the same-gene refill map keyed by gene_stable_id for gene_symbol, gene_symbol_source, and native HGNC; do not carry propagated values into the next buffer.

  6. Keep propagation as a buffer-local overlay.

    Mutate a per-buffer working copy or maintain transcript_id -> propagated_hgnc_id overlay state for the current buffer. Do not permanently mutate the shared per-contig transcript cache.

Acceptance criteria

  • Unit test symbol-keyed merge_features() HGNC propagation with native donors and null recipients.
  • Unit test the refseq/merged same-gene_stable_id refill path.
  • Regression-test chr2 NBAS: variants before the 5000-record buffer boundary stay empty; variants in the next buffer can receive HGNC:15625.
  • Regression-test chr2 ANAPC1P1: variants before the 5000-record buffer boundary stay empty; variants in the next buffer can receive HGNC:44150.
  • Rerun the chr2 merged report and verify the 294 HGNC_ID false positives drop to zero without introducing post-boundary false negatives.

Performance guard

Do not re-query parquet per buffer if the current execution path already hydrates transcripts per contig. Build a coordinate-sorted transcript index or interval index if repeated per-buffer scans show up in profiles. The semantic requirement is buffer-local propagation; the implementation should be proportional to candidate transcripts near the buffer, not all chromosome transcripts multiplied by number of buffers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions