Feature/optimize unnest groupby by psuszyns · Pull Request #31 · biodatageeks/datafusion-bio-functions

psuszyns · 2026-03-11T16:28:23Z

No description provided.

mwiewior · 2026-03-11T16:29:09Z

@codex pls review

mwiewior · 2026-03-11T16:29:25Z

@claude pls review

claude · 2026-03-11T16:29:39Z

Claude finished @mwiewior's task in 2m 21s —— View job

PR Review: Feature/optimize unnest groupby

Tasks:

Gather context and understand the PR
Read and analyze changed files
Review logical optimizer rule
Review physical execution plan
Review integration tests
Provide comprehensive feedback

Summary

This PR introduces a new datafusion-bio-function-vcftools crate implementing a FusedArrayTransform optimization that replaces the unnest → transform → array_agg pattern with a single streaming operator, avoiding the memory explosion that occurs when GROUP BY materializes all unnested rows. The design is well-motivated and the architecture is sensible. Below are findings ranging from correctness issues to style suggestions.

Critical Issues

1. Ord implementation is inconsistent with PartialEq (UB risk)

FusedArrayTransform implements PartialEq by comparing transform_exprs (via their Debug representation), but Ord::cmp only compares array_columns, passthrough_columns, and output_columns — ignoring transform_exprs. This violates the Rust invariant that a == b implies a.cmp(&b) == Ordering::Equal.

// logical/fused_array_transform.rs:224-229
impl Ord for FusedArrayTransform {
    fn cmp(&self, other: &Self) -> Ordering {
        self.array_columns
            .cmp(&other.array_columns)
            .then_with(|| self.passthrough_columns.cmp(&other.passthrough_columns))
            .then_with(|| self.output_columns.cmp(&other.output_columns))
        // ← transform_exprs NOT compared here, but compared in PartialEq
    }
}

Fix this →

2. LargeList/FixedSizeList silently unsupported in execution

build_element_schema in all three locations handles List, LargeList, and FixedSizeList, but apply_transforms in fused_array_transform_exec.rs only downcasts to ListArray:

// physical/fused_array_transform_exec.rs:314-317
let list_arr = col.as_any().downcast_ref::<ListArray>().ok_or_else(|| {
    DataFusionError::Plan(format!("Column '{col_name}' is not a List type"))
})?;

If the input contains LargeList or FixedSizeList columns (which the schema allows via build_element_schema), execution will fail at runtime with a misleading error. Same issue in apply_identity_transform. Fix this →

3. Failing tests committed without #[ignore]

test_arithmetic_transform and test_conditional_transform (lines 949–1258) are documented with large comments stating they currently fail due to unimplemented transformation CTE support, yet they are not marked #[ignore]:

// tests/integration_test.rs:953-966
// NOTE: This test currently FAILS because the FusedArrayTransform optimization
//       does NOT support transformations in a separate CTE yet.

These tests will break CI. Either mark them #[ignore] or remove them until the feature is implemented. Fix this →

Significant Issues

4. Code duplication: build_element_schema defined 3 times

The function appears identically (or near-identically) in:

logical/fused_array_transform.rs
physical/extension_planner.rs
physical/fused_array_transform_exec.rs

This is a DRY violation and future bug hotspot. Consider extracting to a shared utils.rs or schema_utils.rs module.

5. Silent Float64 fallback on type inference failure

When transform expression type inference fails, the code falls back silently to DataType::Float64 and only logs a warn!. For non-Float64 data (e.g., integer or string arrays), this will produce a schema that mismatches the actual data, causing a runtime error or silent data corruption. The fallback should at minimum fail loudly or propagate the error.

// logical/fused_array_transform.rs:109-119
warn!("Type inference failed for output '{output_col}': {e}. Using Float64 fallback.");
DataType::Float64

6. Hash and PartialEq use Debug string comparison for Expr

// logical/fused_array_transform.rs:195-198
for expr in &self.transform_exprs {
    format!("{expr:?}").hash(state);
}

This is fragile — two semantically identical expressions with different formatting would not match; two expressions with different semantics could theoretically match. This is also potentially slow for large expressions. Since DataFusion Expr implements structural equality (via PartialEq), prefer that directly.

7. Environment variable-based feature toggle is not thread-safe

The is_enabled() check on every optimizer call reads from the environment (fine), but the integration tests use unsafe { std::env::set_var(...) } with serial_test::serial to prevent races. A test panic before remove_var will leave other tests with incorrect state. Consider replacing the env var with a std::sync::atomic::AtomicBool on the optimizer rule itself, or using a proper configuration struct. The README example also uses set_var without unsafe:

// README.md:71 (Rust 2024 edition requires unsafe for set_var)
std::env::set_var("BIO_FUSED_ARRAY_TRANSFORM", "1");

8. Missing FUSED_ARRAY_TRANSFORM.md referenced in lib.rs

lib.rs line 50 references: See FUSED_ARRAY_TRANSFORM.md for detailed design documentation. — but no such file exists in the PR.

Minor Issues

9. VcfQueryPlanner allocates a new DefaultPhysicalPlanner per query

// physical/query_planner.rs:36-38
let physical_planner = DefaultPhysicalPlanner::with_extension_planners(vec![Arc::new(
    FusedArrayTransformPlanner::new(),
)]);

This is correct but creates a new planner object on every call. Consider caching the planner or making it a field of VcfQueryPlanner.

10. Crate name uses singular "vcftools" inconsistently

The Cargo.toml package name is datafusion-bio-function-vcftools (singular) while the directory is bio-function-vcftools. This is fine, but worth a comment in case it's an oversight.

11. parquet dev-dependency is not from workspace

# datafusion/bio-function-vcftools/Cargo.toml:22
parquet = "56.1"

Should likely use { workspace = true } if parquet is used across the workspace, for version consistency.

12. async-trait not from workspace

async-trait = "0.1.88"

This is a pinned version not in workspace dependencies. Consider adding to workspace or at least leaving a comment about the version requirement.

Test Quality

Good coverage of edge cases (NULL arrays, empty arrays, NULL elements, mismatched lengths, struct fields).
The false-positive tests (ensuring optimization is NOT applied to mixed aggregates) are a nice addition.
However, two tests that are expected to fail are not guarded — see issue Add Claude Code GitHub Workflow #3 above.
The baseline vs. optimized context comparison pattern is clean and readable.

Overall Assessment

The core idea is solid and well-implemented. The main blockers are:

The Ord/PartialEq inconsistency (correctness)
Incomplete LargeList/FixedSizeList support in execution
Committed failing tests without #[ignore]

mwiewior · 2026-03-11T16:31:46Z

@codex pls review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28e8755918

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

psuszyns · 2026-03-11T23:12:24Z

all code review comments addressed but this resulted in significant changes being applied - @claude and @codex please redo the code review

chatgpt-codex-connector · 2026-03-11T23:12:30Z

To use Codex here, create a Codex account and connect to github.

claude · 2026-03-11T23:12:43Z

Claude finished @psuszyns's task in 3m 57s —— View job

PR Review: Feature/optimize unnest groupby (Re-review)

Tasks:

Gather context and understand the PR
Read and analyze changed files
Review logical optimizer rule and fused_array_transform
Review physical execution plan
Review integration tests
Provide comprehensive feedback

Summary

Good progress from the previous review — most of the critical and significant issues have been addressed. The shared common.rs module cleanly eliminates the previous code duplication, the AtomicBool approach is much better than the env-var-only approach, and the error-on-failure type inference is the right call. Below are remaining issues found in this re-review.

Remaining Issues from Previous Review

1. Ord vs PartialEq still inconsistent (reduced, not eliminated)

PartialEq now uses structural Expr equality (a == b), but Ord uses Display string comparison (a.to_string().cmp(&b.to_string())). The invariant requires that a == b implies a.cmp(b) == Equal AND a.cmp(b) == Equal implies a == b.

The second direction is broken: two structurally different expressions could render to the same Display string, making a.cmp(b) == Equal while a != b. This violates the Ord + PartialEq consistency contract.

// logical/fused_array_transform.rs:231-242
.then_with(|| {
    for (a, b) in self.transform_exprs.iter().zip(other.transform_exprs.iter()) {
        let cmp = a.to_string().cmp(&b.to_string());  // ← Display string
        // ...
    }
})
// But PartialEq uses: a == b  ← structural equality

The safest fix: since Expr doesn't implement Ord, derive a canonical representation used for both. For example, compare by serialized Display in both PartialEq and Ord, or use a dedicated method that's consistent. Fix this →

2. test_conditional_transform has stale "currently fails" comment but no #[ignore]

// tests/integration_test.rs:1311-1315
// NOTE: This test currently FAILS because the FusedArrayTransform optimization
//       does NOT support transformations in a separate CTE yet.
//
// Same limitation as test_arithmetic_transform - transformation CTEs not detected.

This comment says the test currently fails, yet the test isn't marked #[ignore]. If the functionality was implemented (the parquet round-trip test at line 1748 covers an identical pattern), the comment should be removed. If it still fails, the test needs #[ignore]. The stale comment is a CI risk and causes confusion. Fix this →

New Issues Found

3. Dead code: is_list_null in common.rs is never called

// common.rs:140-158
pub fn is_list_null(column: &dyn Array, row_idx: usize, col_name: &str) -> Result<bool> {
    // ...
}

This function is pub but not used anywhere in the codebase. Dead code in a public API is confusing (consumers might think they need it) and adds maintenance burden. Either use it where appropriate (e.g., in build_padded_array) or remove it. Fix this →

4. build_row_mini_batch creates a new Arc<Schema> for every row

// physical/fused_array_transform_exec.rs:503
let schema = Arc::new(Schema::new(fields));
RecordBatch::try_new(schema, columns).map_err(DataFusionError::from)

build_row_mini_batch is called once per input row per transform expression. For a 10,000-row batch, this creates 10,000 heap-allocated Schema objects. Since the schema structure is the same for all rows (same column names and types), it can be pre-computed once and reused. This is a potential performance regression relative to the baseline it's supposed to beat. Fix this →

5. VcfQueryPlanner creates a new planner instance on every create_physical_plan call (unchanged)

// physical/query_planner.rs:36-38
let physical_planner = DefaultPhysicalPlanner::with_extension_planners(vec![Arc::new(
    FusedArrayTransformPlanner::new(),
)]);

This allocates a new DefaultPhysicalPlanner and FusedArrayTransformPlanner on each query. Both are stateless; make physical_planner a field of VcfQueryPlanner and initialize it once. Fix this →

6. test_mismatched_array_lengths has an incorrect error message

// tests/integration_test.rs:960-963
assert!(
    plan_str_optimized.contains("FusedArrayTransform"),
    "FusedArrayTransform optimization was NOT applied for empty arrays case! ..."
//                                                       ^^^^^^^^^^^
// This message says "empty arrays" but the test is for "mismatched lengths"
);

And similarly at lines 974-977:

assert_eq!(baseline_str, optimized_str,
    "Baseline and optimized results differ for empty arrays case");
//                                         ^^^^^^^^^^ should be "mismatched lengths"

Fix this →

Minor Issues (Unchanged from Previous Review)

7. async-trait not from workspace — async-trait = "0.1.88" in Cargo.toml is a pinned version not in workspace deps. Consider adding to workspace for uniformity. Same for serial_test = "3.1" and parquet = "56.1" in dev-dependencies.

What Was Fixed Well

Code duplication: build_element_schema is now cleanly centralized in common.rs with both Arrow and DFSchema variants
LargeList/FixedSizeList support: extract_list_values in common.rs properly handles all three list variants
Type inference fallback: Now returns a proper DataFusionError instead of silently using Float64
Hash implementation: Now uses Display string (more canonical) instead of Debug
PartialEq implementation: Now uses structural Expr equality instead of Debug string
Thread safety: FUSED_ARRAY_TRANSFORM_ENABLED AtomicBool replaces env-var-only approach; unsafe set_var is no longer needed in tests
NULL/empty array semantics: Rows where all arrays are NULL or empty are now correctly dropped to match UNNEST + GROUP BY baseline behavior
all_array_agg guard: Correctly rejects ORDER BY, DISTINCT, and FILTER modifiers on array_agg

Overall Assessment

The PR is substantially improved. The two blockers are:

The stale "currently fails" comment in test_conditional_transform — either the test passes (remove the comment) or it doesn't (add #[ignore])
The Ord/PartialEq inconsistency — technically still present, though reduced in impact

psuszyns · 2026-03-11T23:49:47Z

review comments addressed, @claude please redo the review again

claude · 2026-03-11T23:49:59Z

Claude finished @psuszyns's task in 4m 14s —— View job

PR Review: Feature/optimize unnest groupby (3rd Review)

Tasks:

Gather context from previous reviews
Read and analyze all changed files
Review logical optimizer and fused_array_transform
Review physical execution plan
Review integration tests
Check previous issues are resolved
Provide comprehensive feedback

Summary

Excellent progress across all three rounds of review. All critical and significant blockers from previous reviews are now resolved. The code is substantially cleaner and more correct. A few minor remaining issues are noted below.

What Was Fixed Well (Since Last Review)

Ord/PartialEq consistency — both now consistently use Display to_string() comparison. Hash also uses to_string(). All three traits are self-consistent. ✅
Stale "currently FAILS" comments — removed from both test_arithmetic_transform and test_conditional_transform. These tests now have proper assertions and run with the optimization applied. ✅
Dead code is_list_null — removed from common.rs. ✅
Schema allocation per row — build_mini_batch_schema is now pre-computed once before the row loop and passed into build_row_mini_batch. This avoids 10k+ heap allocations per batch. ✅
VcfQueryPlanner planner per query — physical_planner is now a struct field initialized in VcfQueryPlanner::new(), not recreated on each create_physical_plan call. ✅

Remaining Issues

1. Identity transform path doesn't apply row_mask filter to array outputs

In process_batch, passthrough columns are filtered with bool_mask to drop rows where all arrays are null/empty. But in the identity transform branch (transform_exprs.is_empty()), array output columns are pushed unchanged (unfiltered):

// physical/fused_array_transform_exec.rs:321-328
if self.transform_exprs.is_empty() {
    for col_name in &self.array_columns {
        let idx = input_schema.index_of(col_name)?;
        let col = batch.column(idx);
        let output_array = self.apply_identity_transform(col, col_name)?;
        output_columns.push(output_array); // ← NOT filtered by bool_mask!
    }
}

If any row has all-null or empty arrays (row_mask contains false), the passthrough columns will have num_output_rows rows while the array output columns will have num_rows rows. RecordBatch::try_new will then panic with a schema mismatch.

This path isn't reached via the optimizer (which always generates ≥1 transform expression), but the API allows it and the unit test test_identity_transform only uses data with non-empty arrays. The fix is to apply the same bool_mask filter in the identity branch:

let filtered_col = datafusion::arrow::compute::filter(col.as_ref(), &bool_mask)?;
output_columns.push(Arc::new(filtered_col));

Fix this →

2. Stale error message in test_mismatched_array_lengths

This minor issue from the previous review still hasn't been fixed. Lines 962 and 976 say "empty arrays case" but the test is test_mismatched_array_lengths:

// tests/integration_test.rs:961-962
assert!(
    plan_str_optimized.contains("FusedArrayTransform"),
    "FusedArrayTransform optimization was NOT applied for empty arrays case! ..."
//                                                       ^^^^^^^^^^^^ should be "mismatched lengths"
);
// ...
// line 974-977:
assert_eq!(baseline_str, optimized_str,
    "Baseline and optimized results differ for empty arrays case");
//                                         ^^^^^^^^^^^^ should be "mismatched lengths"

Fix this →

3. resolve_expr silently swallows errors

// logical/optimizer_rule.rs:313-325
fn resolve_expr(expr: &Expr, definitions: &HashMap<String, Expr>) -> Expr {
    expr.clone()
        .transform(|e| { ... })
        .map(|t| t.data)
        .unwrap_or_else(|_| expr.clone())  // ← silently returns unresolved expr on error
}

If expression tree traversal fails (e.g., due to an unexpected expression variant), the original unresolved expression is silently returned and passed to the physical planner. This makes bugs in expression resolution very hard to diagnose. Since resolve_expr is called within try_optimize which already returns Option<Result<_>>, it can propagate errors. Consider changing the signature to Result<Expr> and returning Err instead.

Minor Issues (Carried from Previous Review)

4. async-trait, serial_test, parquet dev-deps not from workspace

# Cargo.toml
async-trait = "0.1.88"   # not in workspace
# dev-dependencies:
serial_test = "3.1"       # not in workspace
parquet = "56.1"          # not in workspace

For a workspace crate, version management is cleaner with workspace deps. Low priority, but worth tidying up.

Minor: `i32` offset buffer could overflow for large batches

In apply_single_transform (line 375–399), the offset buffer for the output ListArray uses i32:

let mut offsets: Vec<i32> = vec![0];
let mut current_offset: i32 = 0;
// ...
current_offset += result_array.len() as i32;

If the total number of output elements in a single batch exceeds i32::MAX (~2.1 billion), this silently overflows (wraps in debug mode, UB in release). For genomics data with large arrays this is unlikely in a single batch, but Arrow's LargeList uses i64 for this reason. If the input type can be LargeList, the output should also use i64 offsets (i.e., produce a LargeListArray). For the current use case of List output this is probably fine, but worth a comment acknowledging the limitation.

Overall Assessment

The implementation is in very good shape. The primary actionable issue is #1 (identity transform path correctness with null/empty arrays), which is a real correctness bug in the API surface even if it isn't reached by the optimizer today. Issues #2 and #3 are easy one-line fixes. Issue #4 is cosmetic.

mwiewior · 2026-03-12T11:14:42Z

@claude pls review before merge

mwiewior · 2026-03-12T11:14:51Z

@codex pls review before merge

claude · 2026-03-12T11:14:54Z

Claude finished @mwiewior's task in 2m 49s —— View job

PR Review: Feature/optimize unnest groupby (Final Pre-Merge Review)

Tasks:

Gather context from previous reviews
Read and analyze all changed files
Check remaining issues from 3rd review are resolved
Check for any new issues
Provide final assessment

What Was Fixed Since Last Review

All three issues flagged as blockers in the 3rd review have been resolved:

1. Identity transform path row_mask filter — ✅ Fixed

fused_array_transform_exec.rs:327-329 now correctly applies bool_mask to array outputs in the identity branch:

let filtered_col = datafusion::arrow::compute::filter(col.as_ref(), &bool_mask)?;
let output_array = self.apply_identity_transform(&filtered_col, col_name)?;

A new unit test test_identity_transform_with_empty_array covers this path and verifies row filtering.

2. Stale "empty arrays" message in test_mismatched_array_lengths — ✅ Fixed

Lines 962 and 976 now correctly say "mismatched lengths case".

3. resolve_expr silently swallowing errors — ✅ Fixed

optimizer_rule.rs:326-337 now returns Result<Expr> and properly propagates errors through ? at the call sites. The unwrap_or_else silent fallback is gone.

Remaining Minor Issues

1. async-trait is a pinned direct dependency, not from workspace

# datafusion/bio-function-vcftools/Cargo.toml:16
async-trait = "0.1.88"

This is the only remaining dependency not managed via { workspace = true }. In Rust 2024 edition (which this crate uses), async functions in traits work natively — async-trait may not be needed at all. DataFusion 50.x's QueryPlanner and ExtensionPlanner traits may already work without it. If the dependency is genuinely needed, it should be added to the workspace. Check this →

2. Minor: Stale comment in test_mismatched_array_lengths

Line 967 still says "filter out the empty-array row" but the test is about mismatched-length arrays, not empty arrays:

// datafusion/bio-function-vcftools/tests/integration_test.rs:967
// Both baseline and optimized should filter out the empty-array row and have identical output
//                                                    ^^^^^^^^^^^^^^ misleading

Fix this →

3. Acknowledged: i32 offset buffer in apply_single_transform

// fused_array_transform_exec.rs:377-401
let mut offsets: Vec<i32> = vec![0];
let mut current_offset: i32 = 0;
// ...
current_offset += result_array.len() as i32;

This was flagged in the previous review. For the current genomics use case (VCF genotype arrays), i32::MAX (~2.1B elements per batch) is unlikely to be hit, so this is acceptable. A short comment documenting the limitation would be good practice.

Overall Assessment

This PR is ready to merge. All critical correctness bugs from three rounds of review have been resolved. The implementation is clean, well-tested, and the architecture is sound. The remaining issues are cosmetic and can be addressed post-merge.

The optimization correctly:

Detects Aggregate(array_agg) → Projection* → Unnest patterns
Rejects unsupported ORDER BY/DISTINCT/FILTER modifiers
Handles NULL and empty arrays with correct UNNEST semantics (row filtering)
Supports List, LargeList, and FixedSizeList column types
Pre-computes the mini-batch schema once per input batch (not per row)
Caches the physical planner instance in VcfQueryPlanner
Uses AtomicBool for thread-safe enable/disable

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e42b16b2e9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

psuszyns · 2026-03-12T17:41:24Z

@claude please:

review if the changes applied to the code after the code review didn't introduce any additional problems
search for memory leaks and ways to reduce memory usage

claude · 2026-03-12T17:41:42Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

…ll be useful for processing genotypes from vcf

…tures -- -D warnings'

…are not supported

…ROUP BY clause

…n-ranges and bio-function-vcftools

Add a high-performance Table-Valued Function that replaces the unnest→transform→array_agg SQL pattern with direct Arrow array processing. Operates on flat List value arrays without materialization, yielding ~10-20x speedup over FusedArrayTransform for VCF sample QC. Features: - Site filtering (QUAL, avg GQ/DP thresholds) - Per-sample GT normalization, quality gating, PL correction, DS dosage - Writes multi-sample VCF directly (GT:GQ:DP:PL:DS FORMAT) - Sample names from comma-separated string, text file, or auto-generated - Configurable thresholds via optional positional args - 15 tests covering processing, writing, SQL e2e, and formula correctness Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ensures the TVF is available for any session created via BioSessionExt, not just create_bio_session(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Output to `.vcf.gz` or `.vcf.bgz` is automatically gzip-compressed (level 6). Plain `.vcf` remains uncompressed. Uses flate2 with a type-erased WriterInner enum to avoid generics on every method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace flate2 GzEncoder with noodles-bgzf Writer so .vcf.gz output is proper BGZF (blocked gzip with EOF marker), directly indexable by tabix/bcftools. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… extraction Two critical performance fixes: 1. Replace flat sample loop with variant-major nested loop, eliminating the O(variants) linear search in extract_pl_triple that ran for every sample (~180M comparisons/batch → 0). 2. Pre-extract all PL triples into flat Vecs in one pass via PlFlatData, using direct offset-based indexing into the inner ListArray's flat values buffer. Eliminates ~600K Arc<dyn Array> allocations per batch from ListArray::value() and per-sample downcast attempts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix #3: list_avg handles Int32 natively (sum as i64), avoids full f64 cast of flat value arrays for GQ/DP averaging. Fix #4: skip arrow::compute::filter (8 calls) when all variants pass site filtering — use original columns directly. Fix #5: writer buffers entire variant line into Vec<u8> before a single write_all call. Eliminates ~10 tiny write syscalls per sample (6M → 300 writes/batch for 2000 samples). Uses itoa for zero-alloc integer formatting. Fix #6: pre-computed lookup table for 10^(-PL/10) replaces 3× powf() per sample in dosage computation (~60ns → ~3ns each). Fix #8: FlatNumeric enum keeps GQ/DP as native Int32, compares with f64 thresholds directly — no cast to Float64 then back. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Per-test placement (per `porting-tests/detailed_plans/Parser.md`): - `src/allele.rs::tests::port_parser_*` (4 tests): subtests #17 (GA/GT minimise via `vcf_to_vep_allele`), #18/#19 (MNV-preserves-suffix — documented divergence from Perl's broader trim per Ensembl-Variation `trim_sequences()` rule at allele.rs:300-302), Axis B B1 (degenerate same-allele input collapses to "-/-" under MNV rule). - `src/transcript_consequence.rs::tests::port_parser_*` (3 tests): subtests #31 (bare chrom in valid set), #32 (`chr` prefix stripped by `normalize_chrom` then matches), #34 (unknown chrom rejected). - `tests/port_parser.rs` (4 tests): subtests #7 (nonexistent path — catches panic from `object_storage.rs:183` unwrap), #12 (gzipped VCF reads), #57 (CRLF line endings tolerated), Axis B B3 (`.vcf.bgz` extension routes through gzip reader). Axis B B2 (`buffer_variant_bounds` multi-chrom first-chrom-wins) is cross-referenced — already covered by `src/annotate_provider.rs::tests::port_input_buffer_axisB1_buffer_variant_bounds_multi_chrom_binds_to_first_chrom`. Coverage parity: 12 / 77 Perl-denominator unit-ports (incl. 3 Axis B); 47 architectural-no-analogue rows documented inline in `tests/port_parser.rs` (multi-format Parser base class, `validate_vf` warning emission, lowercase coercion, DB mode/LRG, 27 `detect_format` / STDIN / factory-dispatch rows); 21 blocked-future-work rows as commented-out stubs (all fold into existing future-work-vepyr.md entries — SO-term cluster engine blocker #2 × 6, chromosome-synonyms × 4, check_ref × 5, lookup_ref × 4, Uploaded_variation × 1, chr_filter × 1). **0 new future-work entries created**. v2 paradigm anchors: ~/.claude/skills/port-to-vepyr/references/v2-paradigm.md - sztywno 1:1: every Perl subtest gets a Rust analogue (test or documented stub). - standalone: no docker, no golden.vcf, no run_and_compare_csq. cargo test: 11 active passing tests across the 3 files; full lib suite 834 passed / 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…no-analogue + 25 blocked-future-work stubs Adds tests/port_cache_transcript.rs (replaces v1 file deleted earlier): INTEGRATION-PORTS (5 active tests): - #24 transcript_cache_read_returns_at_least_one_row_for_chr21_region (filter_transcript base case — row exists in v115 cache) - #39 cache_read_first_stable_id_at_chr21_25_to_25_1mb (lex-min stable_id at chr21:25.0-25.1 Mb = ENST00000418942) - #47 cache_read_first_stable_id_for_full_buffer_pull (lex-min over 25-26 Mb = ENST00000284971; >=1 row gating) - #49 cache_read_empty_buffer_returns_zero_rows (chr21:1 has 0 overlapping transcripts in v115 fixture) AXIS B (1 active integration + 1 dedup-deferred): - B1 transcript_parquet_read_is_deterministic_across_sessions (two independent SessionContext reads return identical Arrow data) - B4 hgnc_propagation_across_adjacent_buffers_reuses_features (#[ignore] — deduplicates with bio-format-ensembl-cache/tests/ hgnc_propagation_tests.rs at cache-format level; annotate_provider- side coverage implicit in e2e ct_07 test) ARCH-NO-ANALOGUE (28 #[ignore] stubs): #4, #11, #14, #15, #17-#22, #28-#31, #40, #43, #45, #48, #54, #55, #57, #58, #70-#75 — each carries a one-line rationale per detailed_plan §Architectural-no-analogue justifications. BLOCKED-FUTURE-WORK (25 commented-out stubs): #13, #23, #25-#27, #32-#38, #41, #52, #53, #56, and plan-amend reclassifications #61-#69 (NEAREST CSQ column absent from engine output). Each cites its future-work-vepyr.md entry. Oracle values verified via VEP 115 on v115 cache (queried via pyarrow against transcript/21.parquet directly at commit time; Perl test values were v84-fixture-specific).

chatgpt-codex-connector Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread datafusion/bio-function-vcftools/src/logical/optimizer_rule.rs Outdated

Comment thread datafusion/bio-function-vcftools/src/physical/fused_array_transform_exec.rs Outdated

Comment thread datafusion/bio-function-vcftools/src/physical/fused_array_transform_exec.rs Outdated

psuszyns force-pushed the feature/optimize_unnest_groupby branch from a2d0363 to e42b16b Compare March 12, 2026 10:51

chatgpt-codex-connector Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread datafusion/bio-function-vcftools/src/logical/optimizer_rule.rs

Comment thread datafusion/bio-function-vcftools/src/physical/extension_planner.rs Outdated

Comment thread datafusion/bio-function-vcftools/src/logical/optimizer_rule.rs Outdated

psuszyns force-pushed the feature/optimize_unnest_groupby branch from 6f750ce to 3213fb8 Compare March 12, 2026 16:36

psuszyns added 12 commits March 22, 2026 18:11

Optimization of the unnest -> transform -> aggregate pattern which wi…

3bd5933

…ll be useful for processing genotypes from vcf

bio-function-vcftools: fix error with transformed column types

fbb381b

cargo fmt --all

8fb81ab

vcftools: changes after running 'cargo clippy --all-targets --all-fea…

5a789a7

…tures -- -D warnings'

fix incorrect handling of nulls and mismatched array lengths

d43a3e8

making code more readbale, extracting functions

8989be9

cargo fmt --all

4036a98

vcftools: ORDER BY, DISTINCT and FILTER modifiers inside array_agg() …

b19be99

…are not supported

vcftools: address review comments

0d167be

vcftools: update README.md

c20dcf1

vcftools: remove obsolete comment

e347544

vcftools: address most important comments from second review

f562312

psuszyns added 8 commits March 22, 2026 18:11

vcftools: reuse DefaultPhysicalPlanner

0e6d41d

vcftools: address remaining review comments

64d16af

vcftools: check for the presence of a row identifying column in the G…

cef1eb8

…ROUP BY clause

vcftools: address remaining review comments

c5935a0

vcftools: add support for transforming passthrough columns

82c23f5

vcftools: minor changes after review

65c23f1

vcftools: add test for multiple transforms

9f16ae8

formatting

87ee1dc

psuszyns force-pushed the feature/optimize_unnest_groupby branch from d87256b to 87ee1dc Compare March 22, 2026 17:12

psuszyns and others added 9 commits March 22, 2026 19:32

datafusion-bio-query-planner: single QueryPanner for both bio-functio…

9c42a46

…n-ranges and bio-function-vcftools

formatting + clippy

2612a2b

new bio-function module with unified BioSessionExt

f48e1bd

register vcf_sample_qc TVF in with_config_rt_bio session setup

242f56e

Ensures the TVF is available for any session created via BioSessionExt, not just create_bio_session(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vcf_sample_qc: use noodles-bgzf for tabix-compatible BGZF output

735474d

Replace flate2 GzEncoder with noodles-bgzf Writer so .vcf.gz output is proper BGZF (blocked gzip with EOF marker), directly indexable by tabix/bcftools. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sitekwb mentioned this pull request May 28, 2026

Port Parser.t (v2 paradigm, Tier 3) #171

Open

4 tasks

Uh oh!

Conversation

psuszyns commented Mar 11, 2026

Uh oh!

mwiewior commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mwiewior commented Mar 11, 2026

Uh oh!

claude Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Feature/optimize unnest groupby

Summary

Critical Issues

Significant Issues

Minor Issues

Test Quality

Overall Assessment

Uh oh!

mwiewior commented Mar 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

psuszyns commented Mar 11, 2026

Uh oh!

chatgpt-codex-connector Bot commented Mar 11, 2026

Uh oh!

claude Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Feature/optimize unnest groupby (Re-review)

Summary

Remaining Issues from Previous Review

New Issues Found

Minor Issues (Unchanged from Previous Review)

What Was Fixed Well

Overall Assessment

Uh oh!

psuszyns commented Mar 11, 2026

Uh oh!

claude Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Feature/optimize unnest groupby (3rd Review)

Summary

What Was Fixed Well (Since Last Review)

Remaining Issues

Minor Issues (Carried from Previous Review)

Minor: i32 offset buffer could overflow for large batches

Overall Assessment

Uh oh!

mwiewior commented Mar 12, 2026

Uh oh!

mwiewior commented Mar 12, 2026

Uh oh!

claude Bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Feature/optimize unnest groupby (Final Pre-Merge Review)

What Was Fixed Since Last Review

Remaining Minor Issues

Overall Assessment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

psuszyns commented Mar 12, 2026

Uh oh!

claude Bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

mwiewior commented Mar 11, 2026 •

edited

Loading

claude Bot commented Mar 11, 2026 •

edited

Loading

claude Bot commented Mar 11, 2026 •

edited

Loading

claude Bot commented Mar 11, 2026 •

edited

Loading

Minor: `i32` offset buffer could overflow for large batches

claude Bot commented Mar 12, 2026 •

edited

Loading

claude Bot commented Mar 12, 2026 •

edited

Loading