Skip to content

Feature/optimize unnest groupby#31

Open
psuszyns wants to merge 29 commits into
masterfrom
feature/optimize_unnest_groupby
Open

Feature/optimize unnest groupby#31
psuszyns wants to merge 29 commits into
masterfrom
feature/optimize_unnest_groupby

Conversation

@psuszyns
Copy link
Copy Markdown
Collaborator

No description provided.

@mwiewior
Copy link
Copy Markdown
Contributor

mwiewior commented Mar 11, 2026

@codex pls review

@mwiewior
Copy link
Copy Markdown
Contributor

@claude pls review

@claude
Copy link
Copy Markdown

claude Bot commented Mar 11, 2026

Claude finished @mwiewior's task in 2m 21s —— View job


PR Review: Feature/optimize unnest groupby

Tasks:

  • Gather context and understand the PR
  • Read and analyze changed files
  • Review logical optimizer rule
  • Review physical execution plan
  • Review integration tests
  • Provide comprehensive feedback

Summary

This PR introduces a new datafusion-bio-function-vcftools crate implementing a FusedArrayTransform optimization that replaces the unnest → transform → array_agg pattern with a single streaming operator, avoiding the memory explosion that occurs when GROUP BY materializes all unnested rows. The design is well-motivated and the architecture is sensible. Below are findings ranging from correctness issues to style suggestions.


Critical Issues

1. Ord implementation is inconsistent with PartialEq (UB risk)

FusedArrayTransform implements PartialEq by comparing transform_exprs (via their Debug representation), but Ord::cmp only compares array_columns, passthrough_columns, and output_columns — ignoring transform_exprs. This violates the Rust invariant that a == b implies a.cmp(&b) == Ordering::Equal.

// logical/fused_array_transform.rs:224-229
impl Ord for FusedArrayTransform {
    fn cmp(&self, other: &Self) -> Ordering {
        self.array_columns
            .cmp(&other.array_columns)
            .then_with(|| self.passthrough_columns.cmp(&other.passthrough_columns))
            .then_with(|| self.output_columns.cmp(&other.output_columns))
        // ← transform_exprs NOT compared here, but compared in PartialEq
    }
}

Fix this →

2. LargeList/FixedSizeList silently unsupported in execution

build_element_schema in all three locations handles List, LargeList, and FixedSizeList, but apply_transforms in fused_array_transform_exec.rs only downcasts to ListArray:

// physical/fused_array_transform_exec.rs:314-317
let list_arr = col.as_any().downcast_ref::<ListArray>().ok_or_else(|| {
    DataFusionError::Plan(format!("Column '{col_name}' is not a List type"))
})?;

If the input contains LargeList or FixedSizeList columns (which the schema allows via build_element_schema), execution will fail at runtime with a misleading error. Same issue in apply_identity_transform. Fix this →

3. Failing tests committed without #[ignore]

test_arithmetic_transform and test_conditional_transform (lines 949–1258) are documented with large comments stating they currently fail due to unimplemented transformation CTE support, yet they are not marked #[ignore]:

// tests/integration_test.rs:953-966
// NOTE: This test currently FAILS because the FusedArrayTransform optimization
//       does NOT support transformations in a separate CTE yet.

These tests will break CI. Either mark them #[ignore] or remove them until the feature is implemented. Fix this →


Significant Issues

4. Code duplication: build_element_schema defined 3 times

The function appears identically (or near-identically) in:

  • logical/fused_array_transform.rs
  • physical/extension_planner.rs
  • physical/fused_array_transform_exec.rs

This is a DRY violation and future bug hotspot. Consider extracting to a shared utils.rs or schema_utils.rs module.

5. Silent Float64 fallback on type inference failure

When transform expression type inference fails, the code falls back silently to DataType::Float64 and only logs a warn!. For non-Float64 data (e.g., integer or string arrays), this will produce a schema that mismatches the actual data, causing a runtime error or silent data corruption. The fallback should at minimum fail loudly or propagate the error.

// logical/fused_array_transform.rs:109-119
warn!("Type inference failed for output '{output_col}': {e}. Using Float64 fallback.");
DataType::Float64

6. Hash and PartialEq use Debug string comparison for Expr

// logical/fused_array_transform.rs:195-198
for expr in &self.transform_exprs {
    format!("{expr:?}").hash(state);
}

This is fragile — two semantically identical expressions with different formatting would not match; two expressions with different semantics could theoretically match. This is also potentially slow for large expressions. Since DataFusion Expr implements structural equality (via PartialEq), prefer that directly.

7. Environment variable-based feature toggle is not thread-safe

The is_enabled() check on every optimizer call reads from the environment (fine), but the integration tests use unsafe { std::env::set_var(...) } with serial_test::serial to prevent races. A test panic before remove_var will leave other tests with incorrect state. Consider replacing the env var with a std::sync::atomic::AtomicBool on the optimizer rule itself, or using a proper configuration struct. The README example also uses set_var without unsafe:

// README.md:71 (Rust 2024 edition requires unsafe for set_var)
std::env::set_var("BIO_FUSED_ARRAY_TRANSFORM", "1");

8. Missing FUSED_ARRAY_TRANSFORM.md referenced in lib.rs

lib.rs line 50 references: See FUSED_ARRAY_TRANSFORM.md for detailed design documentation. — but no such file exists in the PR.


Minor Issues

9. VcfQueryPlanner allocates a new DefaultPhysicalPlanner per query

// physical/query_planner.rs:36-38
let physical_planner = DefaultPhysicalPlanner::with_extension_planners(vec![Arc::new(
    FusedArrayTransformPlanner::new(),
)]);

This is correct but creates a new planner object on every call. Consider caching the planner or making it a field of VcfQueryPlanner.

10. Crate name uses singular "vcftools" inconsistently

The Cargo.toml package name is datafusion-bio-function-vcftools (singular) while the directory is bio-function-vcftools. This is fine, but worth a comment in case it's an oversight.

11. parquet dev-dependency is not from workspace

# datafusion/bio-function-vcftools/Cargo.toml:22
parquet = "56.1"

Should likely use { workspace = true } if parquet is used across the workspace, for version consistency.

12. async-trait not from workspace

async-trait = "0.1.88"

This is a pinned version not in workspace dependencies. Consider adding to workspace or at least leaving a comment about the version requirement.


Test Quality

  • Good coverage of edge cases (NULL arrays, empty arrays, NULL elements, mismatched lengths, struct fields).
  • The false-positive tests (ensuring optimization is NOT applied to mixed aggregates) are a nice addition.
  • However, two tests that are expected to fail are not guarded — see issue Add Claude Code GitHub Workflow #3 above.
  • The baseline vs. optimized context comparison pattern is clean and readable.

Overall Assessment

The core idea is solid and well-implemented. The main blockers are:

  1. The Ord/PartialEq inconsistency (correctness)
  2. Incomplete LargeList/FixedSizeList support in execution
  3. Committed failing tests without #[ignore]

@mwiewior
Copy link
Copy Markdown
Contributor

@codex pls review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28e8755918

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread datafusion/bio-function-vcftools/src/logical/optimizer_rule.rs Outdated
Comment thread datafusion/bio-function-vcftools/src/physical/fused_array_transform_exec.rs Outdated
Comment thread datafusion/bio-function-vcftools/src/physical/fused_array_transform_exec.rs Outdated
@psuszyns
Copy link
Copy Markdown
Collaborator Author

all code review comments addressed but this resulted in significant changes being applied - @claude and @codex please redo the code review

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 11, 2026

Claude finished @psuszyns's task in 3m 57s —— View job


PR Review: Feature/optimize unnest groupby (Re-review)

Tasks:

  • Gather context and understand the PR
  • Read and analyze changed files
  • Review logical optimizer rule and fused_array_transform
  • Review physical execution plan
  • Review integration tests
  • Provide comprehensive feedback

Summary

Good progress from the previous review — most of the critical and significant issues have been addressed. The shared common.rs module cleanly eliminates the previous code duplication, the AtomicBool approach is much better than the env-var-only approach, and the error-on-failure type inference is the right call. Below are remaining issues found in this re-review.


Remaining Issues from Previous Review

1. Ord vs PartialEq still inconsistent (reduced, not eliminated)

PartialEq now uses structural Expr equality (a == b), but Ord uses Display string comparison (a.to_string().cmp(&b.to_string())). The invariant requires that a == b implies a.cmp(b) == Equal AND a.cmp(b) == Equal implies a == b.

The second direction is broken: two structurally different expressions could render to the same Display string, making a.cmp(b) == Equal while a != b. This violates the Ord + PartialEq consistency contract.

// logical/fused_array_transform.rs:231-242
.then_with(|| {
    for (a, b) in self.transform_exprs.iter().zip(other.transform_exprs.iter()) {
        let cmp = a.to_string().cmp(&b.to_string());  // ← Display string
        // ...
    }
})
// But PartialEq uses: a == b  ← structural equality

The safest fix: since Expr doesn't implement Ord, derive a canonical representation used for both. For example, compare by serialized Display in both PartialEq and Ord, or use a dedicated method that's consistent. Fix this →

2. test_conditional_transform has stale "currently fails" comment but no #[ignore]

// tests/integration_test.rs:1311-1315
// NOTE: This test currently FAILS because the FusedArrayTransform optimization
//       does NOT support transformations in a separate CTE yet.
//
// Same limitation as test_arithmetic_transform - transformation CTEs not detected.

This comment says the test currently fails, yet the test isn't marked #[ignore]. If the functionality was implemented (the parquet round-trip test at line 1748 covers an identical pattern), the comment should be removed. If it still fails, the test needs #[ignore]. The stale comment is a CI risk and causes confusion. Fix this →


New Issues Found

3. Dead code: is_list_null in common.rs is never called

// common.rs:140-158
pub fn is_list_null(column: &dyn Array, row_idx: usize, col_name: &str) -> Result<bool> {
    // ...
}

This function is pub but not used anywhere in the codebase. Dead code in a public API is confusing (consumers might think they need it) and adds maintenance burden. Either use it where appropriate (e.g., in build_padded_array) or remove it. Fix this →

4. build_row_mini_batch creates a new Arc<Schema> for every row

// physical/fused_array_transform_exec.rs:503
let schema = Arc::new(Schema::new(fields));
RecordBatch::try_new(schema, columns).map_err(DataFusionError::from)

build_row_mini_batch is called once per input row per transform expression. For a 10,000-row batch, this creates 10,000 heap-allocated Schema objects. Since the schema structure is the same for all rows (same column names and types), it can be pre-computed once and reused. This is a potential performance regression relative to the baseline it's supposed to beat. Fix this →

5. VcfQueryPlanner creates a new planner instance on every create_physical_plan call (unchanged)

// physical/query_planner.rs:36-38
let physical_planner = DefaultPhysicalPlanner::with_extension_planners(vec![Arc::new(
    FusedArrayTransformPlanner::new(),
)]);

This allocates a new DefaultPhysicalPlanner and FusedArrayTransformPlanner on each query. Both are stateless; make physical_planner a field of VcfQueryPlanner and initialize it once. Fix this →

6. test_mismatched_array_lengths has an incorrect error message

// tests/integration_test.rs:960-963
assert!(
    plan_str_optimized.contains("FusedArrayTransform"),
    "FusedArrayTransform optimization was NOT applied for empty arrays case! ..."
//                                                       ^^^^^^^^^^^
// This message says "empty arrays" but the test is for "mismatched lengths"
);

And similarly at lines 974-977:

assert_eq!(baseline_str, optimized_str,
    "Baseline and optimized results differ for empty arrays case");
//                                         ^^^^^^^^^^ should be "mismatched lengths"

Fix this →


Minor Issues (Unchanged from Previous Review)

7. async-trait not from workspaceasync-trait = "0.1.88" in Cargo.toml is a pinned version not in workspace deps. Consider adding to workspace for uniformity. Same for serial_test = "3.1" and parquet = "56.1" in dev-dependencies.


What Was Fixed Well

  • Code duplication: build_element_schema is now cleanly centralized in common.rs with both Arrow and DFSchema variants
  • LargeList/FixedSizeList support: extract_list_values in common.rs properly handles all three list variants
  • Type inference fallback: Now returns a proper DataFusionError instead of silently using Float64
  • Hash implementation: Now uses Display string (more canonical) instead of Debug
  • PartialEq implementation: Now uses structural Expr equality instead of Debug string
  • Thread safety: FUSED_ARRAY_TRANSFORM_ENABLED AtomicBool replaces env-var-only approach; unsafe set_var is no longer needed in tests
  • NULL/empty array semantics: Rows where all arrays are NULL or empty are now correctly dropped to match UNNEST + GROUP BY baseline behavior
  • all_array_agg guard: Correctly rejects ORDER BY, DISTINCT, and FILTER modifiers on array_agg

Overall Assessment

The PR is substantially improved. The two blockers are:

  1. The stale "currently fails" comment in test_conditional_transform — either the test passes (remove the comment) or it doesn't (add #[ignore])
  2. The Ord/PartialEq inconsistency — technically still present, though reduced in impact

@psuszyns
Copy link
Copy Markdown
Collaborator Author

review comments addressed, @claude please redo the review again

@claude
Copy link
Copy Markdown

claude Bot commented Mar 11, 2026

Claude finished @psuszyns's task in 4m 14s —— View job


PR Review: Feature/optimize unnest groupby (3rd Review)

Tasks:

  • Gather context from previous reviews
  • Read and analyze all changed files
  • Review logical optimizer and fused_array_transform
  • Review physical execution plan
  • Review integration tests
  • Check previous issues are resolved
  • Provide comprehensive feedback

Summary

Excellent progress across all three rounds of review. All critical and significant blockers from previous reviews are now resolved. The code is substantially cleaner and more correct. A few minor remaining issues are noted below.


What Was Fixed Well (Since Last Review)

  • Ord/PartialEq consistency — both now consistently use Display to_string() comparison. Hash also uses to_string(). All three traits are self-consistent. ✅
  • Stale "currently FAILS" comments — removed from both test_arithmetic_transform and test_conditional_transform. These tests now have proper assertions and run with the optimization applied. ✅
  • Dead code is_list_null — removed from common.rs. ✅
  • Schema allocation per rowbuild_mini_batch_schema is now pre-computed once before the row loop and passed into build_row_mini_batch. This avoids 10k+ heap allocations per batch. ✅
  • VcfQueryPlanner planner per queryphysical_planner is now a struct field initialized in VcfQueryPlanner::new(), not recreated on each create_physical_plan call. ✅

Remaining Issues

1. Identity transform path doesn't apply row_mask filter to array outputs

In process_batch, passthrough columns are filtered with bool_mask to drop rows where all arrays are null/empty. But in the identity transform branch (transform_exprs.is_empty()), array output columns are pushed unchanged (unfiltered):

// physical/fused_array_transform_exec.rs:321-328
if self.transform_exprs.is_empty() {
    for col_name in &self.array_columns {
        let idx = input_schema.index_of(col_name)?;
        let col = batch.column(idx);
        let output_array = self.apply_identity_transform(col, col_name)?;
        output_columns.push(output_array); // ← NOT filtered by bool_mask!
    }
}

If any row has all-null or empty arrays (row_mask contains false), the passthrough columns will have num_output_rows rows while the array output columns will have num_rows rows. RecordBatch::try_new will then panic with a schema mismatch.

This path isn't reached via the optimizer (which always generates ≥1 transform expression), but the API allows it and the unit test test_identity_transform only uses data with non-empty arrays. The fix is to apply the same bool_mask filter in the identity branch:

let filtered_col = datafusion::arrow::compute::filter(col.as_ref(), &bool_mask)?;
output_columns.push(Arc::new(filtered_col));

Fix this →

2. Stale error message in test_mismatched_array_lengths

This minor issue from the previous review still hasn't been fixed. Lines 962 and 976 say "empty arrays case" but the test is test_mismatched_array_lengths:

// tests/integration_test.rs:961-962
assert!(
    plan_str_optimized.contains("FusedArrayTransform"),
    "FusedArrayTransform optimization was NOT applied for empty arrays case! ..."
//                                                       ^^^^^^^^^^^^ should be "mismatched lengths"
);
// ...
// line 974-977:
assert_eq!(baseline_str, optimized_str,
    "Baseline and optimized results differ for empty arrays case");
//                                         ^^^^^^^^^^^^ should be "mismatched lengths"

Fix this →

3. resolve_expr silently swallows errors

// logical/optimizer_rule.rs:313-325
fn resolve_expr(expr: &Expr, definitions: &HashMap<String, Expr>) -> Expr {
    expr.clone()
        .transform(|e| { ... })
        .map(|t| t.data)
        .unwrap_or_else(|_| expr.clone())  // ← silently returns unresolved expr on error
}

If expression tree traversal fails (e.g., due to an unexpected expression variant), the original unresolved expression is silently returned and passed to the physical planner. This makes bugs in expression resolution very hard to diagnose. Since resolve_expr is called within try_optimize which already returns Option<Result<_>>, it can propagate errors. Consider changing the signature to Result<Expr> and returning Err instead.


Minor Issues (Carried from Previous Review)

4. async-trait, serial_test, parquet dev-deps not from workspace

# Cargo.toml
async-trait = "0.1.88"   # not in workspace
# dev-dependencies:
serial_test = "3.1"       # not in workspace
parquet = "56.1"          # not in workspace

For a workspace crate, version management is cleaner with workspace deps. Low priority, but worth tidying up.


Minor: i32 offset buffer could overflow for large batches

In apply_single_transform (line 375–399), the offset buffer for the output ListArray uses i32:

let mut offsets: Vec<i32> = vec![0];
let mut current_offset: i32 = 0;
// ...
current_offset += result_array.len() as i32;

If the total number of output elements in a single batch exceeds i32::MAX (~2.1 billion), this silently overflows (wraps in debug mode, UB in release). For genomics data with large arrays this is unlikely in a single batch, but Arrow's LargeList uses i64 for this reason. If the input type can be LargeList, the output should also use i64 offsets (i.e., produce a LargeListArray). For the current use case of List output this is probably fine, but worth a comment acknowledging the limitation.


Overall Assessment

The implementation is in very good shape. The primary actionable issue is #1 (identity transform path correctness with null/empty arrays), which is a real correctness bug in the API surface even if it isn't reached by the optimizer today. Issues #2 and #3 are easy one-line fixes. Issue #4 is cosmetic.

@psuszyns psuszyns force-pushed the feature/optimize_unnest_groupby branch from a2d0363 to e42b16b Compare March 12, 2026 10:51
@mwiewior
Copy link
Copy Markdown
Contributor

@claude pls review before merge

@mwiewior
Copy link
Copy Markdown
Contributor

@codex pls review before merge

@claude
Copy link
Copy Markdown

claude Bot commented Mar 12, 2026

Claude finished @mwiewior's task in 2m 49s —— View job


PR Review: Feature/optimize unnest groupby (Final Pre-Merge Review)

Tasks:

  • Gather context from previous reviews
  • Read and analyze all changed files
  • Check remaining issues from 3rd review are resolved
  • Check for any new issues
  • Provide final assessment

What Was Fixed Since Last Review

All three issues flagged as blockers in the 3rd review have been resolved:

1. Identity transform path row_mask filter — ✅ Fixed

fused_array_transform_exec.rs:327-329 now correctly applies bool_mask to array outputs in the identity branch:

let filtered_col = datafusion::arrow::compute::filter(col.as_ref(), &bool_mask)?;
let output_array = self.apply_identity_transform(&filtered_col, col_name)?;

A new unit test test_identity_transform_with_empty_array covers this path and verifies row filtering.

2. Stale "empty arrays" message in test_mismatched_array_lengths — ✅ Fixed

Lines 962 and 976 now correctly say "mismatched lengths case".

3. resolve_expr silently swallowing errors — ✅ Fixed

optimizer_rule.rs:326-337 now returns Result<Expr> and properly propagates errors through ? at the call sites. The unwrap_or_else silent fallback is gone.


Remaining Minor Issues

1. async-trait is a pinned direct dependency, not from workspace

# datafusion/bio-function-vcftools/Cargo.toml:16
async-trait = "0.1.88"

This is the only remaining dependency not managed via { workspace = true }. In Rust 2024 edition (which this crate uses), async functions in traits work natively — async-trait may not be needed at all. DataFusion 50.x's QueryPlanner and ExtensionPlanner traits may already work without it. If the dependency is genuinely needed, it should be added to the workspace. Check this →

2. Minor: Stale comment in test_mismatched_array_lengths

Line 967 still says "filter out the empty-array row" but the test is about mismatched-length arrays, not empty arrays:

// datafusion/bio-function-vcftools/tests/integration_test.rs:967
// Both baseline and optimized should filter out the empty-array row and have identical output
//                                                    ^^^^^^^^^^^^^^ misleading

Fix this →

3. Acknowledged: i32 offset buffer in apply_single_transform

// fused_array_transform_exec.rs:377-401
let mut offsets: Vec<i32> = vec![0];
let mut current_offset: i32 = 0;
// ...
current_offset += result_array.len() as i32;

This was flagged in the previous review. For the current genomics use case (VCF genotype arrays), i32::MAX (~2.1B elements per batch) is unlikely to be hit, so this is acceptable. A short comment documenting the limitation would be good practice.


Overall Assessment

This PR is ready to merge. All critical correctness bugs from three rounds of review have been resolved. The implementation is clean, well-tested, and the architecture is sound. The remaining issues are cosmetic and can be addressed post-merge.

The optimization correctly:

  • Detects Aggregate(array_agg) → Projection* → Unnest patterns
  • Rejects unsupported ORDER BY/DISTINCT/FILTER modifiers
  • Handles NULL and empty arrays with correct UNNEST semantics (row filtering)
  • Supports List, LargeList, and FixedSizeList column types
  • Pre-computes the mini-batch schema once per input batch (not per row)
  • Caches the physical planner instance in VcfQueryPlanner
  • Uses AtomicBool for thread-safe enable/disable

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e42b16b2e9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread datafusion/bio-function-vcftools/src/logical/optimizer_rule.rs
Comment thread datafusion/bio-function-vcftools/src/physical/extension_planner.rs Outdated
Comment thread datafusion/bio-function-vcftools/src/logical/optimizer_rule.rs Outdated
@psuszyns psuszyns force-pushed the feature/optimize_unnest_groupby branch from 6f750ce to 3213fb8 Compare March 12, 2026 16:36
@psuszyns
Copy link
Copy Markdown
Collaborator Author

@claude please:

  • review if the changes applied to the code after the code review didn't introduce any additional problems
  • search for memory leaks and ways to reduce memory usage

@claude
Copy link
Copy Markdown

claude Bot commented Mar 12, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@psuszyns psuszyns force-pushed the feature/optimize_unnest_groupby branch from d87256b to 87ee1dc Compare March 22, 2026 17:12
psuszyns and others added 9 commits March 22, 2026 19:32
Add a high-performance Table-Valued Function that replaces the
unnest→transform→array_agg SQL pattern with direct Arrow array
processing. Operates on flat List value arrays without materialization,
yielding ~10-20x speedup over FusedArrayTransform for VCF sample QC.

Features:
- Site filtering (QUAL, avg GQ/DP thresholds)
- Per-sample GT normalization, quality gating, PL correction, DS dosage
- Writes multi-sample VCF directly (GT:GQ:DP:PL:DS FORMAT)
- Sample names from comma-separated string, text file, or auto-generated
- Configurable thresholds via optional positional args
- 15 tests covering processing, writing, SQL e2e, and formula correctness

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensures the TVF is available for any session created via
BioSessionExt, not just create_bio_session().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Output to `.vcf.gz` or `.vcf.bgz` is automatically gzip-compressed
(level 6). Plain `.vcf` remains uncompressed. Uses flate2 with a
type-erased WriterInner enum to avoid generics on every method.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace flate2 GzEncoder with noodles-bgzf Writer so .vcf.gz output
is proper BGZF (blocked gzip with EOF marker), directly indexable
by tabix/bcftools.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… extraction

Two critical performance fixes:

1. Replace flat sample loop with variant-major nested loop, eliminating
   the O(variants) linear search in extract_pl_triple that ran for
   every sample (~180M comparisons/batch → 0).

2. Pre-extract all PL triples into flat Vecs in one pass via PlFlatData,
   using direct offset-based indexing into the inner ListArray's flat
   values buffer. Eliminates ~600K Arc<dyn Array> allocations per batch
   from ListArray::value() and per-sample downcast attempts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix #3: list_avg handles Int32 natively (sum as i64), avoids full
  f64 cast of flat value arrays for GQ/DP averaging.

Fix #4: skip arrow::compute::filter (8 calls) when all variants
  pass site filtering — use original columns directly.

Fix #5: writer buffers entire variant line into Vec<u8> before a
  single write_all call. Eliminates ~10 tiny write syscalls per
  sample (6M → 300 writes/batch for 2000 samples). Uses itoa for
  zero-alloc integer formatting.

Fix #6: pre-computed lookup table for 10^(-PL/10) replaces 3×
  powf() per sample in dosage computation (~60ns → ~3ns each).

Fix #8: FlatNumeric enum keeps GQ/DP as native Int32, compares
  with f64 thresholds directly — no cast to Float64 then back.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants