Skip to content

Consolidate build_dedup_query() into bio-formats to eliminate duplicate query logic #107

@mwiewior

Description

@mwiewior

Problem

The transcript dedup + HGNC propagation query is duplicated in two repos:

  1. datafusion-bio-formatsstorable_to_parquet.rsbuild_dedup_query()
  2. datafusion-bio-functionscache_builder.rsbuild_query() / build_query_multi_chrom()

When the HGNC propagation fix (biodatageeks/datafusion-bio-formats#162) was added to bio-formats, the identical query in bio-functions was missed, causing #105 to persist until 33d700f ported the fix manually.

Proposed fix

Export build_dedup_query() from datafusion-bio-format-ensembl-cache as a public library function. Then cache_builder.rs in bio-functions replaces its local build_query() with a call to the shared function.

The parallel I/O pipeline, writer tuning, and chrom-splitting in cache_builder.rs stay untouched — only the SQL query construction moves to bio-formats.

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions