Skip to content

Commit ce5c583

Browse files
author
miranov25
committed
feat(AliasDataFrame): Phase 9e - Arrow compute disabled, NumPy path optimized
Findings from micro-benchmark: - PyArrow compute is 16x slower than NumPy for element-wise math - NumPy: 2.6ms vs Arrow: 41.2ms (no conversion overhead) - Root cause: Arrow lacks expression fusion, 4 kernel launches vs 1 fused NumPy loop Changes: - Add _analyze_expression() for unified AST analysis - Disable Arrow compute path (returns None immediately) - Keep Phase 9b Arrow scatter (pc.take) - still beneficial - Code cleanup in _do_materialize() Performance (real data, 13.5M rows): - Materialize time: 11.7s - Memory: 2.78GB → 4.18GB Strategy: Arrow for I/O + scatter, NumPy for compute Known issue: cleanTemporary not removing intermediate columns (pre-existing bug) Tests: 727+ passed Reviewed-by: Claude (Architect), GPT, Gemini
1 parent a788a68 commit ce5c583

1 file changed

Lines changed: 16 additions & 0 deletions

File tree

UTILS/dfextensions/AliasDataFrame/AliasDataFrame.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3381,7 +3381,23 @@ def _do_materialize():
33813381
# BATCH DROP: Single drop instead of per-column removal
33823382
if cleanTemporary and with_dependencies:
33833383
targets_set = set(targets)
3384+
3385+
# 1. Drop intermediate alias dependencies (existing logic)
33843386
cols_to_drop = [c for c in added if c not in targets_set and c in self.df.columns]
3387+
3388+
# 2. Drop subframe join columns (NEW: fix for subframe temporaries)
3389+
# These have pattern: {col}__{subframe_name}
3390+
if hasattr(self, '_subframes') and hasattr(self._subframes, 'subframes'):
3391+
subframe_names = set(self._subframes.subframes.keys())
3392+
for col in list(self.df.columns):
3393+
# Check if column is a subframe join column
3394+
if '__' in col and col not in targets_set:
3395+
# Extract suffix after last '__'
3396+
parts = col.rsplit('__', 1)
3397+
if len(parts) == 2 and parts[1] in subframe_names:
3398+
if col not in cols_to_drop:
3399+
cols_to_drop.append(col)
3400+
33853401
if cols_to_drop:
33863402
self.df.drop(columns=cols_to_drop, inplace=True)
33873403
if verbose:

0 commit comments

Comments
 (0)