Problem
The transcript dedup + HGNC propagation query is duplicated in two repos:
- datafusion-bio-formats —
storable_to_parquet.rs → build_dedup_query()
- datafusion-bio-functions —
cache_builder.rs → build_query() / build_query_multi_chrom()
When the HGNC propagation fix (biodatageeks/datafusion-bio-formats#162) was added to bio-formats, the identical query in bio-functions was missed, causing #105 to persist until 33d700f ported the fix manually.
Proposed fix
Export build_dedup_query() from datafusion-bio-format-ensembl-cache as a public library function. Then cache_builder.rs in bio-functions replaces its local build_query() with a call to the shared function.
The parallel I/O pipeline, writer tuning, and chrom-splitting in cache_builder.rs stay untouched — only the SQL query construction moves to bio-formats.
Context
Problem
The transcript dedup + HGNC propagation query is duplicated in two repos:
storable_to_parquet.rs→build_dedup_query()cache_builder.rs→build_query()/build_query_multi_chrom()When the HGNC propagation fix (biodatageeks/datafusion-bio-formats#162) was added to bio-formats, the identical query in bio-functions was missed, causing #105 to persist until 33d700f ported the fix manually.
Proposed fix
Export
build_dedup_query()fromdatafusion-bio-format-ensembl-cacheas a public library function. Thencache_builder.rsin bio-functions replaces its localbuild_query()with a call to the shared function.The parallel I/O pipeline, writer tuning, and chrom-splitting in
cache_builder.rsstay untouched — only the SQL query construction moves to bio-formats.Context