11# AliasDataFrame Phase History
22
33> ** Purpose** : Development history for architecture reviews and restart prompts.
4- > ** Last Updated** : 2025-12-11
4+ > ** Last Updated** : 2026-03-27
55> ** Maintained By** : Marian Ivanov (miranov25)
66
77## How to Use This File
@@ -20,6 +20,7 @@ This file is intended for AI reviewers and human collaborators as a **restart co
2020## Table of Contents
2121
2222- [ Overview] ( #overview )
23+ - [ Phase 13: Advanced Features] ( #phase-13-advanced-features )
2324- [ Phase 9: PyArrow Acceleration] ( #phase-9-pyarrow-acceleration )
2425- [ Phase 8: Numba Acceleration] ( #phase-8-numba-acceleration )
2526- [ Phase 7: Lazy Loading] ( #phase-7-lazy-loading )
@@ -29,6 +30,7 @@ This file is intended for AI reviewers and human collaborators as a **restart co
2930- [ Phase 3: Benchmark Infrastructure] ( #phase-3-benchmark-infrastructure )
3031- [ Phase 2: Batch Optimization] ( #phase-2-batch-optimization )
3132- [ Phase 1: Foundation & Schema] ( #phase-1-foundation--schema )
33+ - [ Bug Fixes] ( #bug-fixes )
3234- [ Architecture Decisions] ( #architecture-decisions )
3335- [ Performance Summary] ( #performance-summary )
3436
@@ -40,7 +42,7 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
4042
4143** Key Metrics:**
4244- Performance: 60-770x speedups achieved
43- - Test Coverage: 1178 + tests passing
45+ - Test Coverage: 1425 + tests passing
4446- Lines of Code: ~ 10,000 (AliasDataFrame.py)
4547
4648** Development Team:**
@@ -50,6 +52,85 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
5052
5153---
5254
55+ ## Phase 13: Advanced Features
56+
57+ ** Dates** : 2026-03-22 to 2026-03-27
58+ ** Status** : 🔄 In Progress
59+
60+ ### Phase 13.12.DF: Profile Enhancements
61+ ** Date** : 2026-03-22
62+ ** Commit** : ` 8e6e5d87d60206ad24b39af5847424d1d5db3c9d `
63+
64+ dfdraw profile() enhancements:
65+ - F1: ` return_data=True ` exports profile statistics as DataFrame
66+ - F2: ` min_entries=3 ` suppresses low-statistics bins
67+ - F3: ` group_by_bins ` /` group_by_quantiles ` auto-bins float columns
68+ - F4: ` sort_groups=True ` sorts legend numerically/alphabetically
69+ - v1.1: ` weights ` parameter for weighted mean/std/sem
70+
71+ ** Tests** : 19 tests (16 original + 3 weights)
72+
73+ ### Phase 13.9.ADF: PolynomialSpec
74+ ** Date** : 2026-03-23
75+ ** Commit** : ` eb54d3bc16e216e057da32c50a7a67e9a5dac568 `
76+
77+ N-dimensional polynomial specification for calibration workflows:
78+ - ` PolynomialSpec.py ` (337 lines) — polynomial term generation
79+ - ` basis_expressions() ` → (key_name, expression) tuples for GB Regression
80+ - ` numba_evaluator() ` → Numba JIT with subframe coefficient lookup (42× vs eval)
81+ - ` to_schema() ` /` from_schema() ` — JSON serialization
82+ - ` to_root_expression() ` — C++ expression for TTree::Draw
83+
84+ New methods on AliasDataFrame:
85+ - ` register_function(name, func, overwrite=False) ` — generic callable registry
86+ - ` register_polynomial_from_subframe() ` — polynomial alias from subframe coefficients
87+
88+ ** Architecture** : Coefficient matrix stays at subframe size (e.g., 36×96 = 27 KB). No materialization to main frame length.
89+
90+ ** Tests** : 26 new tests, 1402 total passed
91+
92+ ### Phase 13.10.ADF: register_evaluator
93+ ** Date** : 2026-03-27
94+ ** Commit** : ` 66164aadd6dd8216d18e545e1df5dac87ba8a087 `
95+
96+ GroupByRegressionEvaluator integration:
97+ - ` register_evaluator(name, evaluator, coord_columns, predictor_columns, overwrite) `
98+ - Wraps any object with ` .evaluate(positions: dict) → ndarray `
99+ - Enables lazy evaluation via ` add_alias('dy_corr', 'corr_I1(xM, driftM, dsectorM)') `
100+ - Multi-predictor: auto-selects single, raises ValueError if multiple without ` predictor_columns `
101+ - Schema stores interface contract only — evaluator must be re-registered after load
102+
103+ ** Tests** : 24 new tests, 1425 total passed
104+
105+ ---
106+
107+ ## Bug Fixes
108+
109+ ### BUG_AliasDataFrame_20260324_draw_subframe_resolution
110+ ** Dates** : 2026-03-25
111+ ** Status** : ✅ Fixed
112+
113+ ** Problem** : ` adf.draw("Side.dy:row") ` fails — dfdraw receives unresolved ` Side.dy ` , pandas eval interprets dots as attribute access.
114+
115+ ** Commits** :
116+ - ` 75609ea2 ` — draw() fix (5 tests)
117+ - ` 3c069a55 ` — draw_batch() fix
118+ - ` b4c845c1 ` — consolidated fix (19 tests, 1392 passed)
119+ - ` 0f4fd63f ` — draw_figures() extension (4 tests, 23 total draw subframe tests)
120+
121+ ** Solution** : Detect ` Subframe.column ` patterns in expr/selection/group_by, materialize via ` pd.merge ` on index columns only, rename to underscore format (` Side_dy ` ) for pandas eval safety, rewrite all expressions to match.
122+
123+ ** Key insight** : ` pd.merge ` operates only on index columns, NOT full DataFrame — memory is O(N × index_cols), not O(N × all_cols).
124+
125+ ### Production Fixes (2026-03-24 to 2026-03-26)
126+ ** Commits** :
127+ - ` 4c45c4dd ` , ` abd5d518 ` , ` cef9bfef ` — ` fill_value ` parameter for ` add_alias() ` (division-by-zero handling)
128+ - ` f7d2335c ` — ` register_polynomial_from_subframe() ` accepts ` overwrite ` parameter
129+ - ` f7d2335c ` — ` _draw_single_figure() ` prints errors to console when verbose=True
130+ - ` 486ff33c ` , ` 6930690d ` — ` draw_figures() ` per-figure defaults cascade (temporary fix)
131+
132+ ---
133+
53134## Phase 9: PyArrow Acceleration
54135
55136** Dates** : 2025-12-01 to 2025-12-02
@@ -85,8 +166,7 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
85166** Date** : 2025-12-02
86167** Commit** : ` a788a687c6d0684fbaa71427e443451dd09445b7 `
87168
88- ** Finding** : PyArrow compute 16x slower than NumPy for element-wise math
89- - NumPy: 2.6ms vs Arrow: 41.2ms
169+ ** Finding** : PyArrow compute 8-10× slower than NumPy for element-wise math
90170- Root cause: Arrow lacks expression fusion
91171
92172** Design Decision** : Keep Arrow for I/O + scatter operations, use NumPy for compute. This hybrid approach gets the best of both worlds.
@@ -125,7 +205,6 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
125205** Performance** :
126206- Phase 7: 0.34s → Phase 8c: 0.228s (1.5x faster)
127207- Total speedup vs baseline: 10.8x
128- - Efficiency vs theoretical: 6.8%
129208
130209---
131210
@@ -179,11 +258,6 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
179258- ` __file_idx__ ` column for file provenance
180259- Custom exception hierarchy
181260
182- ** Architecture** :
183- - Composition pattern ✅
184- - LRU cache K=8 ✅
185- - Single pd.concat with ignore_index=True ✅
186-
187261** Tests** : 51 new chain tests, 136 total lazy/chain
188262
189263### Phase 7.5a: Lazy Single-File Subframes
@@ -205,22 +279,14 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
205279- Reuses LazyChainReader from Phase 7.4
206280- Validation modes: first, strict, intersection, union
207281
208- ** Architecture Decisions** (4-reviewer consensus):
209- - Add 'type': 'file' to single-file config ✅
210- - No file index for subframes ✅
211- - Trust reader protocol ✅
212- - Schema has both 'index' and 'index_columns' ✅
213-
214282** Tests** : 19 new, 52 total lazy subframe tests
215283
216284---
217285
218286## Phase 6: dfdraw Integration & Drawing Validation
219287
220288** Dates** : 2025-12-07 to 2025-12-11
221- ** Status** : 🔄 In Progress (6.8e/c)
222-
223- > ** Note** : The "6.8" label has been extended. Original 6.8 (dfdraw integration: axis titles + slice-first evaluation) is now part of the broader "Phase 6.8: Drawing Validation & Benchmarks" package.
289+ ** Status** : ✅ Complete
224290
225291### Phase 6.8 Core: Duck-Typed Axis Titles
226292** Date** : 2025-12-08
@@ -240,33 +306,6 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
240306- ` _eval_alias_on_df() ` for arbitrary DataFrame subset
241307- Performance: 12M rows with 10K mask evaluates on 10K only
242308
243- ### Phase 6.8d: Synthetic Data Generator
244- ** Date** : 2025-12-11
245-
246- - Extended ` generate_synthetic_data.py ` with ` --chain ` and ` --subframe-chain ` flags
247- - Created examples/ directory with four documented examples for draw invariance and lazy loading
248- - Known relationships for testing:
249- - ` y_derived = 2 * x ` (exact, no noise)
250- - ` gain[sec] = 1.0 + 0.01 * sec `
251- - ` gain[run,sec] = 1.0 + 0.01*sec + 0.001*(run-1000) `
252-
253- ### Phase 6.8e/c: Draw Invariance Tests (Current)
254- ** Date** : 2025-12-11
255-
256- - ` test_draw_invariance.py ` : Core invariants, subframe joins, dtype preservation
257- - ` test_draw_chain_integration.py ` : Chain loading modes, lazy subframes, error handling
258- - Tests validate eager vs lazy equivalence across all loading modes
259-
260- ** draw() API** (from dfdraw):
261- - Returns: ` (fig, ax, stats_dict) ` tuple
262- - ` stats_dict ` keys: ` 'n' ` , ` 'mean' ` , ` 'std' ` , ` 'min' ` , ` 'max' `
263- - Use ` lazy=True ` or call ` materialize_alias() ` before draw
264-
265- ### Phase 6.8 Remaining (per spec)
266- - ** 6.8a** : Execution-order fixes (if needed from test failures)
267- - ** 6.8b** : Per-spec selection (nice-to-have)
268- - ** 6.8f** : Performance benchmarks (nice-to-have)
269-
270309---
271310
272311## Phase 5: RDataFrame Integration
@@ -293,7 +332,7 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
293332
294333### Phase 5.3: Runtime Composite Keys
295334** Date** : 2025-12-06
296- ** Commit** : ` d5591a07b252fe432afefe68b94ceced68cbfbb1 `
335+ ** Commit** : ` d5591a07b252fe432afefe68b94ceced68cbfbb11 `
297336
298337- Runtime composite key generation for >2 index columns
299338- TMemFile + SetFile approach
@@ -444,6 +483,15 @@ All major decisions require consensus from 3+ AI reviewers:
444483| intersection | Common branches only |
445484| union | All branches, fill missing |
446485
486+ ### Phase 13 Decisions
487+
488+ | Decision | Rationale |
489+ | ----------| -----------|
490+ | PolynomialSpec ` tgSlp ` as standard dimension | Coupling via alias algebra, not special parameter |
491+ | Evaluator schema stores contract only | User must re-register after load; GBAI handles serialization |
492+ | Multi-predictor requires explicit selection | Silent default on multi-predictor is P1 violation |
493+ | pd.merge on index columns only | Memory O(N × index_cols), not O(N × all_cols) |
494+
447495---
448496
449497## Performance Summary
@@ -456,13 +504,15 @@ All major decisions require consensus from 3+ AI reviewers:
456504| Phase 7 | 86% | Join caching |
457505| Phase 8 | 10.8× | Numba acceleration |
458506| Phase 5 | 25× | vs TTree::Draw |
507+ | Phase 13.9 | 42× | Numba polynomial evaluator vs eval |
459508
460509### Memory Savings
461510
462511| Feature | Savings |
463512| ---------| ---------|
464513| Lazy loading | 90%+ |
465514| Compression | 50-80% |
515+ | Coefficient matrix (Phase 13.9) | 36×96 = 27 KB vs 13.5M rows |
466516
467517### Current Efficiency
468518
@@ -484,7 +534,23 @@ Remaining overhead is Python/Pandas framework cost.
484534| 7.4 | 51 | 136 (lazy/chain) |
485535| 7.5a | 33 | 33 (lazy subframe) |
486536| 7.5b | 19 | 52 (lazy subframe) |
487- | 6.8e/c | 43 | 1178+ |
537+ | 6.8 | 43 | 1178+ |
538+ | 13.9.ADF | 26 | 1402 |
539+ | 13.10.ADF | 24 | 1425 |
540+ | BUG draw_subframe | 23 | 1392 (at fix time) |
541+
542+ ---
543+
544+ ## Pending Items
545+
546+ - [ ] GBAI benchmark: evaluator at 82M rows D=3/D=4 with peak RSS
547+ - [ ] Fix ` register_subframe_lazy() ` bug (BUG_AliasDataFrame_20260116)
548+ - [ ] SCHEMA_VERSION export in ` __init__.py `
549+ - [ ] P1 tests: I2_6, I4_2, I4_3 fixes
550+ - [ ] CAPABILITY_MATRIX.md creation
551+ - [ ] PHASE_BEGIN_AliasDataFrame tag
552+ - [ ] Axis title lookup for subframe columns (` Sub_dy ` vs ` Side.dy ` )
553+ - [ ] dfdraw ` same=True ` — awaiting Team 3 response
488554
489555---
490556
0 commit comments