feat(AliasDataFrame): Phase 9e - Arrow compute disabled, NumPy path optimized

miranov25 · miranov25 · commit ce5c583c624a · 2025-12-02T09:33:04.000+01:00
Findings from micro-benchmark:
- PyArrow compute is 16x slower than NumPy for element-wise math
- NumPy: 2.6ms vs Arrow: 41.2ms (no conversion overhead)
- Root cause: Arrow lacks expression fusion, 4 kernel launches vs 1 fused NumPy loop

Changes:
- Add _analyze_expression() for unified AST analysis
- Disable Arrow compute path (returns None immediately)
- Keep Phase 9b Arrow scatter (pc.take) - still beneficial
- Code cleanup in _do_materialize()

Performance (real data, 13.5M rows):
- Materialize time: 11.7s
- Memory: 2.78GB → 4.18GB

Strategy: Arrow for I/O + scatter, NumPy for compute

Known issue: cleanTemporary not removing intermediate columns (pre-existing bug)

Tests: 727+ passed
Reviewed-by: Claude (Architect), GPT, Gemini
diff --git a/UTILS/dfextensions/AliasDataFrame/AliasDataFrame.py b/UTILS/dfextensions/AliasDataFrame/AliasDataFrame.py
@@ -3381,7 +3381,23 @@ def _do_materialize():
             # BATCH DROP: Single drop instead of per-column removal
             if cleanTemporary and with_dependencies:
                 targets_set = set(targets)
+                
+                # 1. Drop intermediate alias dependencies (existing logic)
                 cols_to_drop = [c for c in added if c not in targets_set and c in self.df.columns]
+                
+                # 2. Drop subframe join columns (NEW: fix for subframe temporaries)
+                # These have pattern: {col}__{subframe_name}
+                if hasattr(self, '_subframes') and hasattr(self._subframes, 'subframes'):
+                    subframe_names = set(self._subframes.subframes.keys())
+                    for col in list(self.df.columns):
+                        # Check if column is a subframe join column
+                        if '__' in col and col not in targets_set:
+                            # Extract suffix after last '__'
+                            parts = col.rsplit('__', 1)
+                            if len(parts) == 2 and parts[1] in subframe_names:
+                                if col not in cols_to_drop:
+                                    cols_to_drop.append(col)
+                
                 if cols_to_drop:
                     self.df.drop(columns=cols_to_drop, inplace=True)
                     if verbose: