Feature/optimize unnest groupby#31
Conversation
|
@codex pls review |
|
@claude pls review |
|
Claude finished @mwiewior's task in 2m 21s —— View job PR Review: Feature/optimize unnest groupbyTasks:
SummaryThis PR introduces a new Critical Issues1.
// logical/fused_array_transform.rs:224-229
impl Ord for FusedArrayTransform {
fn cmp(&self, other: &Self) -> Ordering {
self.array_columns
.cmp(&other.array_columns)
.then_with(|| self.passthrough_columns.cmp(&other.passthrough_columns))
.then_with(|| self.output_columns.cmp(&other.output_columns))
// ← transform_exprs NOT compared here, but compared in PartialEq
}
}2.
// physical/fused_array_transform_exec.rs:314-317
let list_arr = col.as_any().downcast_ref::<ListArray>().ok_or_else(|| {
DataFusionError::Plan(format!("Column '{col_name}' is not a List type"))
})?;If the input contains 3. Failing tests committed without
// tests/integration_test.rs:953-966
// NOTE: This test currently FAILS because the FusedArrayTransform optimization
// does NOT support transformations in a separate CTE yet.These tests will break CI. Either mark them Significant Issues4. Code duplication: The function appears identically (or near-identically) in:
This is a DRY violation and future bug hotspot. Consider extracting to a shared 5. Silent When transform expression type inference fails, the code falls back silently to // logical/fused_array_transform.rs:109-119
warn!("Type inference failed for output '{output_col}': {e}. Using Float64 fallback.");
DataType::Float646. // logical/fused_array_transform.rs:195-198
for expr in &self.transform_exprs {
format!("{expr:?}").hash(state);
}This is fragile — two semantically identical expressions with different formatting would not match; two expressions with different semantics could theoretically match. This is also potentially slow for large expressions. Since DataFusion 7. Environment variable-based feature toggle is not thread-safe The // README.md:71 (Rust 2024 edition requires unsafe for set_var)
std::env::set_var("BIO_FUSED_ARRAY_TRANSFORM", "1");8. Missing
Minor Issues9. // physical/query_planner.rs:36-38
let physical_planner = DefaultPhysicalPlanner::with_extension_planners(vec![Arc::new(
FusedArrayTransformPlanner::new(),
)]);This is correct but creates a new planner object on every call. Consider caching the planner or making it a field of 10. Crate name uses singular "vcftools" inconsistently The Cargo.toml package name is 11. # datafusion/bio-function-vcftools/Cargo.toml:22
parquet = "56.1"Should likely use 12. async-trait = "0.1.88"This is a pinned version not in workspace dependencies. Consider adding to workspace or at least leaving a comment about the version requirement. Test Quality
Overall AssessmentThe core idea is solid and well-implemented. The main blockers are:
|
|
@codex pls review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 28e8755918
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
To use Codex here, create a Codex account and connect to github. |
|
Claude finished @psuszyns's task in 3m 57s —— View job PR Review: Feature/optimize unnest groupby (Re-review)Tasks:
SummaryGood progress from the previous review — most of the critical and significant issues have been addressed. The shared Remaining Issues from Previous Review1.
The second direction is broken: two structurally different expressions could render to the same // logical/fused_array_transform.rs:231-242
.then_with(|| {
for (a, b) in self.transform_exprs.iter().zip(other.transform_exprs.iter()) {
let cmp = a.to_string().cmp(&b.to_string()); // ← Display string
// ...
}
})
// But PartialEq uses: a == b ← structural equalityThe safest fix: since 2. // tests/integration_test.rs:1311-1315
// NOTE: This test currently FAILS because the FusedArrayTransform optimization
// does NOT support transformations in a separate CTE yet.
//
// Same limitation as test_arithmetic_transform - transformation CTEs not detected.This comment says the test currently fails, yet the test isn't marked New Issues Found3. Dead code: // common.rs:140-158
pub fn is_list_null(column: &dyn Array, row_idx: usize, col_name: &str) -> Result<bool> {
// ...
}This function is 4. // physical/fused_array_transform_exec.rs:503
let schema = Arc::new(Schema::new(fields));
RecordBatch::try_new(schema, columns).map_err(DataFusionError::from)
5. // physical/query_planner.rs:36-38
let physical_planner = DefaultPhysicalPlanner::with_extension_planners(vec![Arc::new(
FusedArrayTransformPlanner::new(),
)]);This allocates a new 6. // tests/integration_test.rs:960-963
assert!(
plan_str_optimized.contains("FusedArrayTransform"),
"FusedArrayTransform optimization was NOT applied for empty arrays case! ..."
// ^^^^^^^^^^^
// This message says "empty arrays" but the test is for "mismatched lengths"
);And similarly at lines 974-977: assert_eq!(baseline_str, optimized_str,
"Baseline and optimized results differ for empty arrays case");
// ^^^^^^^^^^ should be "mismatched lengths"Minor Issues (Unchanged from Previous Review)7. What Was Fixed Well
Overall AssessmentThe PR is substantially improved. The two blockers are:
|
|
review comments addressed, @claude please redo the review again |
|
Claude finished @psuszyns's task in 4m 14s —— View job PR Review: Feature/optimize unnest groupby (3rd Review)Tasks:
SummaryExcellent progress across all three rounds of review. All critical and significant blockers from previous reviews are now resolved. The code is substantially cleaner and more correct. A few minor remaining issues are noted below. What Was Fixed Well (Since Last Review)
Remaining Issues1. Identity transform path doesn't apply In // physical/fused_array_transform_exec.rs:321-328
if self.transform_exprs.is_empty() {
for col_name in &self.array_columns {
let idx = input_schema.index_of(col_name)?;
let col = batch.column(idx);
let output_array = self.apply_identity_transform(col, col_name)?;
output_columns.push(output_array); // ← NOT filtered by bool_mask!
}
}If any row has all-null or empty arrays ( This path isn't reached via the optimizer (which always generates let filtered_col = datafusion::arrow::compute::filter(col.as_ref(), &bool_mask)?;
output_columns.push(Arc::new(filtered_col));2. Stale error message in This minor issue from the previous review still hasn't been fixed. Lines 962 and 976 say // tests/integration_test.rs:961-962
assert!(
plan_str_optimized.contains("FusedArrayTransform"),
"FusedArrayTransform optimization was NOT applied for empty arrays case! ..."
// ^^^^^^^^^^^^ should be "mismatched lengths"
);
// ...
// line 974-977:
assert_eq!(baseline_str, optimized_str,
"Baseline and optimized results differ for empty arrays case");
// ^^^^^^^^^^^^ should be "mismatched lengths"3. // logical/optimizer_rule.rs:313-325
fn resolve_expr(expr: &Expr, definitions: &HashMap<String, Expr>) -> Expr {
expr.clone()
.transform(|e| { ... })
.map(|t| t.data)
.unwrap_or_else(|_| expr.clone()) // ← silently returns unresolved expr on error
}If expression tree traversal fails (e.g., due to an unexpected expression variant), the original unresolved expression is silently returned and passed to the physical planner. This makes bugs in expression resolution very hard to diagnose. Since Minor Issues (Carried from Previous Review)4. # Cargo.toml
async-trait = "0.1.88" # not in workspace
# dev-dependencies:
serial_test = "3.1" # not in workspace
parquet = "56.1" # not in workspaceFor a workspace crate, version management is cleaner with workspace deps. Low priority, but worth tidying up. Minor:
|
a2d0363 to
e42b16b
Compare
|
@claude pls review before merge |
|
@codex pls review before merge |
|
Claude finished @mwiewior's task in 2m 49s —— View job PR Review: Feature/optimize unnest groupby (Final Pre-Merge Review)Tasks:
What Was Fixed Since Last ReviewAll three issues flagged as blockers in the 3rd review have been resolved: 1. Identity transform path
let filtered_col = datafusion::arrow::compute::filter(col.as_ref(), &bool_mask)?;
let output_array = self.apply_identity_transform(&filtered_col, col_name)?;A new unit test 2. Stale "empty arrays" message in Lines 962 and 976 now correctly say 3.
Remaining Minor Issues1. # datafusion/bio-function-vcftools/Cargo.toml:16
async-trait = "0.1.88"This is the only remaining dependency not managed via 2. Minor: Stale comment in Line 967 still says // datafusion/bio-function-vcftools/tests/integration_test.rs:967
// Both baseline and optimized should filter out the empty-array row and have identical output
// ^^^^^^^^^^^^^^ misleading3. Acknowledged: // fused_array_transform_exec.rs:377-401
let mut offsets: Vec<i32> = vec![0];
let mut current_offset: i32 = 0;
// ...
current_offset += result_array.len() as i32;This was flagged in the previous review. For the current genomics use case (VCF genotype arrays), Overall AssessmentThis PR is ready to merge. All critical correctness bugs from three rounds of review have been resolved. The implementation is clean, well-tested, and the architecture is sound. The remaining issues are cosmetic and can be addressed post-merge. The optimization correctly:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e42b16b2e9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
6f750ce to
3213fb8
Compare
|
@claude please:
|
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
…ll be useful for processing genotypes from vcf
…tures -- -D warnings'
…are not supported
d87256b to
87ee1dc
Compare
…n-ranges and bio-function-vcftools
Add a high-performance Table-Valued Function that replaces the unnest→transform→array_agg SQL pattern with direct Arrow array processing. Operates on flat List value arrays without materialization, yielding ~10-20x speedup over FusedArrayTransform for VCF sample QC. Features: - Site filtering (QUAL, avg GQ/DP thresholds) - Per-sample GT normalization, quality gating, PL correction, DS dosage - Writes multi-sample VCF directly (GT:GQ:DP:PL:DS FORMAT) - Sample names from comma-separated string, text file, or auto-generated - Configurable thresholds via optional positional args - 15 tests covering processing, writing, SQL e2e, and formula correctness Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensures the TVF is available for any session created via BioSessionExt, not just create_bio_session(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Output to `.vcf.gz` or `.vcf.bgz` is automatically gzip-compressed (level 6). Plain `.vcf` remains uncompressed. Uses flate2 with a type-erased WriterInner enum to avoid generics on every method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace flate2 GzEncoder with noodles-bgzf Writer so .vcf.gz output is proper BGZF (blocked gzip with EOF marker), directly indexable by tabix/bcftools. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… extraction Two critical performance fixes: 1. Replace flat sample loop with variant-major nested loop, eliminating the O(variants) linear search in extract_pl_triple that ran for every sample (~180M comparisons/batch → 0). 2. Pre-extract all PL triples into flat Vecs in one pass via PlFlatData, using direct offset-based indexing into the inner ListArray's flat values buffer. Eliminates ~600K Arc<dyn Array> allocations per batch from ListArray::value() and per-sample downcast attempts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix #3: list_avg handles Int32 natively (sum as i64), avoids full f64 cast of flat value arrays for GQ/DP averaging. Fix #4: skip arrow::compute::filter (8 calls) when all variants pass site filtering — use original columns directly. Fix #5: writer buffers entire variant line into Vec<u8> before a single write_all call. Eliminates ~10 tiny write syscalls per sample (6M → 300 writes/batch for 2000 samples). Uses itoa for zero-alloc integer formatting. Fix #6: pre-computed lookup table for 10^(-PL/10) replaces 3× powf() per sample in dosage computation (~60ns → ~3ns each). Fix #8: FlatNumeric enum keeps GQ/DP as native Int32, compares with f64 thresholds directly — no cast to Float64 then back. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No description provided.