Skip to content

Commit 33c0d2c

Browse files
author
miranov25
committed
docs: Update PHASE_HISTORY.md with Phase 13 work
- Add Phase 13.12.DF (profile enhancements) - Add Phase 13.9.ADF (PolynomialSpec) - Add Phase 13.10.ADF (register_evaluator) - Add Bug Fixes section (draw_subframe_resolution) - Add Pending Items section - Update test counts to 1425+ - Fix PyArrow performance finding (8-10× not 16×)
1 parent 66164aa commit 33c0d2c

1 file changed

Lines changed: 114 additions & 48 deletions

File tree

UTILS/dfextensions/AliasDataFrame/docs/PHASE_HISTORY.md

Lines changed: 114 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# AliasDataFrame Phase History
22

33
> **Purpose**: Development history for architecture reviews and restart prompts.
4-
> **Last Updated**: 2025-12-11
4+
> **Last Updated**: 2026-03-27
55
> **Maintained By**: Marian Ivanov (miranov25)
66
77
## How to Use This File
@@ -20,6 +20,7 @@ This file is intended for AI reviewers and human collaborators as a **restart co
2020
## Table of Contents
2121

2222
- [Overview](#overview)
23+
- [Phase 13: Advanced Features](#phase-13-advanced-features)
2324
- [Phase 9: PyArrow Acceleration](#phase-9-pyarrow-acceleration)
2425
- [Phase 8: Numba Acceleration](#phase-8-numba-acceleration)
2526
- [Phase 7: Lazy Loading](#phase-7-lazy-loading)
@@ -29,6 +30,7 @@ This file is intended for AI reviewers and human collaborators as a **restart co
2930
- [Phase 3: Benchmark Infrastructure](#phase-3-benchmark-infrastructure)
3031
- [Phase 2: Batch Optimization](#phase-2-batch-optimization)
3132
- [Phase 1: Foundation & Schema](#phase-1-foundation--schema)
33+
- [Bug Fixes](#bug-fixes)
3234
- [Architecture Decisions](#architecture-decisions)
3335
- [Performance Summary](#performance-summary)
3436

@@ -40,7 +42,7 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
4042

4143
**Key Metrics:**
4244
- Performance: 60-770x speedups achieved
43-
- Test Coverage: 1178+ tests passing
45+
- Test Coverage: 1425+ tests passing
4446
- Lines of Code: ~10,000 (AliasDataFrame.py)
4547

4648
**Development Team:**
@@ -50,6 +52,85 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
5052

5153
---
5254

55+
## Phase 13: Advanced Features
56+
57+
**Dates**: 2026-03-22 to 2026-03-27
58+
**Status**: 🔄 In Progress
59+
60+
### Phase 13.12.DF: Profile Enhancements
61+
**Date**: 2026-03-22
62+
**Commit**: `8e6e5d87d60206ad24b39af5847424d1d5db3c9d`
63+
64+
dfdraw profile() enhancements:
65+
- F1: `return_data=True` exports profile statistics as DataFrame
66+
- F2: `min_entries=3` suppresses low-statistics bins
67+
- F3: `group_by_bins`/`group_by_quantiles` auto-bins float columns
68+
- F4: `sort_groups=True` sorts legend numerically/alphabetically
69+
- v1.1: `weights` parameter for weighted mean/std/sem
70+
71+
**Tests**: 19 tests (16 original + 3 weights)
72+
73+
### Phase 13.9.ADF: PolynomialSpec
74+
**Date**: 2026-03-23
75+
**Commit**: `eb54d3bc16e216e057da32c50a7a67e9a5dac568`
76+
77+
N-dimensional polynomial specification for calibration workflows:
78+
- `PolynomialSpec.py` (337 lines) — polynomial term generation
79+
- `basis_expressions()` → (key_name, expression) tuples for GB Regression
80+
- `numba_evaluator()` → Numba JIT with subframe coefficient lookup (42× vs eval)
81+
- `to_schema()`/`from_schema()` — JSON serialization
82+
- `to_root_expression()` — C++ expression for TTree::Draw
83+
84+
New methods on AliasDataFrame:
85+
- `register_function(name, func, overwrite=False)` — generic callable registry
86+
- `register_polynomial_from_subframe()` — polynomial alias from subframe coefficients
87+
88+
**Architecture**: Coefficient matrix stays at subframe size (e.g., 36×96 = 27 KB). No materialization to main frame length.
89+
90+
**Tests**: 26 new tests, 1402 total passed
91+
92+
### Phase 13.10.ADF: register_evaluator
93+
**Date**: 2026-03-27
94+
**Commit**: `66164aadd6dd8216d18e545e1df5dac87ba8a087`
95+
96+
GroupByRegressionEvaluator integration:
97+
- `register_evaluator(name, evaluator, coord_columns, predictor_columns, overwrite)`
98+
- Wraps any object with `.evaluate(positions: dict) → ndarray`
99+
- Enables lazy evaluation via `add_alias('dy_corr', 'corr_I1(xM, driftM, dsectorM)')`
100+
- Multi-predictor: auto-selects single, raises ValueError if multiple without `predictor_columns`
101+
- Schema stores interface contract only — evaluator must be re-registered after load
102+
103+
**Tests**: 24 new tests, 1425 total passed
104+
105+
---
106+
107+
## Bug Fixes
108+
109+
### BUG_AliasDataFrame_20260324_draw_subframe_resolution
110+
**Dates**: 2026-03-25
111+
**Status**: ✅ Fixed
112+
113+
**Problem**: `adf.draw("Side.dy:row")` fails — dfdraw receives unresolved `Side.dy`, pandas eval interprets dots as attribute access.
114+
115+
**Commits**:
116+
- `75609ea2` — draw() fix (5 tests)
117+
- `3c069a55` — draw_batch() fix
118+
- `b4c845c1` — consolidated fix (19 tests, 1392 passed)
119+
- `0f4fd63f` — draw_figures() extension (4 tests, 23 total draw subframe tests)
120+
121+
**Solution**: Detect `Subframe.column` patterns in expr/selection/group_by, materialize via `pd.merge` on index columns only, rename to underscore format (`Side_dy`) for pandas eval safety, rewrite all expressions to match.
122+
123+
**Key insight**: `pd.merge` operates only on index columns, NOT full DataFrame — memory is O(N × index_cols), not O(N × all_cols).
124+
125+
### Production Fixes (2026-03-24 to 2026-03-26)
126+
**Commits**:
127+
- `4c45c4dd`, `abd5d518`, `cef9bfef``fill_value` parameter for `add_alias()` (division-by-zero handling)
128+
- `f7d2335c``register_polynomial_from_subframe()` accepts `overwrite` parameter
129+
- `f7d2335c``_draw_single_figure()` prints errors to console when verbose=True
130+
- `486ff33c`, `6930690d``draw_figures()` per-figure defaults cascade (temporary fix)
131+
132+
---
133+
53134
## Phase 9: PyArrow Acceleration
54135

55136
**Dates**: 2025-12-01 to 2025-12-02
@@ -85,8 +166,7 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
85166
**Date**: 2025-12-02
86167
**Commit**: `a788a687c6d0684fbaa71427e443451dd09445b7`
87168

88-
**Finding**: PyArrow compute 16x slower than NumPy for element-wise math
89-
- NumPy: 2.6ms vs Arrow: 41.2ms
169+
**Finding**: PyArrow compute 8-10× slower than NumPy for element-wise math
90170
- Root cause: Arrow lacks expression fusion
91171

92172
**Design Decision**: Keep Arrow for I/O + scatter operations, use NumPy for compute. This hybrid approach gets the best of both worlds.
@@ -125,7 +205,6 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
125205
**Performance**:
126206
- Phase 7: 0.34s → Phase 8c: 0.228s (1.5x faster)
127207
- Total speedup vs baseline: 10.8x
128-
- Efficiency vs theoretical: 6.8%
129208

130209
---
131210

@@ -179,11 +258,6 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
179258
- `__file_idx__` column for file provenance
180259
- Custom exception hierarchy
181260

182-
**Architecture**:
183-
- Composition pattern ✅
184-
- LRU cache K=8 ✅
185-
- Single pd.concat with ignore_index=True ✅
186-
187261
**Tests**: 51 new chain tests, 136 total lazy/chain
188262

189263
### Phase 7.5a: Lazy Single-File Subframes
@@ -205,22 +279,14 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
205279
- Reuses LazyChainReader from Phase 7.4
206280
- Validation modes: first, strict, intersection, union
207281

208-
**Architecture Decisions** (4-reviewer consensus):
209-
- Add 'type': 'file' to single-file config ✅
210-
- No file index for subframes ✅
211-
- Trust reader protocol ✅
212-
- Schema has both 'index' and 'index_columns' ✅
213-
214282
**Tests**: 19 new, 52 total lazy subframe tests
215283

216284
---
217285

218286
## Phase 6: dfdraw Integration & Drawing Validation
219287

220288
**Dates**: 2025-12-07 to 2025-12-11
221-
**Status**: 🔄 In Progress (6.8e/c)
222-
223-
> **Note**: The "6.8" label has been extended. Original 6.8 (dfdraw integration: axis titles + slice-first evaluation) is now part of the broader "Phase 6.8: Drawing Validation & Benchmarks" package.
289+
**Status**: ✅ Complete
224290

225291
### Phase 6.8 Core: Duck-Typed Axis Titles
226292
**Date**: 2025-12-08
@@ -240,33 +306,6 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
240306
- `_eval_alias_on_df()` for arbitrary DataFrame subset
241307
- Performance: 12M rows with 10K mask evaluates on 10K only
242308

243-
### Phase 6.8d: Synthetic Data Generator
244-
**Date**: 2025-12-11
245-
246-
- Extended `generate_synthetic_data.py` with `--chain` and `--subframe-chain` flags
247-
- Created examples/ directory with four documented examples for draw invariance and lazy loading
248-
- Known relationships for testing:
249-
- `y_derived = 2 * x` (exact, no noise)
250-
- `gain[sec] = 1.0 + 0.01 * sec`
251-
- `gain[run,sec] = 1.0 + 0.01*sec + 0.001*(run-1000)`
252-
253-
### Phase 6.8e/c: Draw Invariance Tests (Current)
254-
**Date**: 2025-12-11
255-
256-
- `test_draw_invariance.py`: Core invariants, subframe joins, dtype preservation
257-
- `test_draw_chain_integration.py`: Chain loading modes, lazy subframes, error handling
258-
- Tests validate eager vs lazy equivalence across all loading modes
259-
260-
**draw() API** (from dfdraw):
261-
- Returns: `(fig, ax, stats_dict)` tuple
262-
- `stats_dict` keys: `'n'`, `'mean'`, `'std'`, `'min'`, `'max'`
263-
- Use `lazy=True` or call `materialize_alias()` before draw
264-
265-
### Phase 6.8 Remaining (per spec)
266-
- **6.8a**: Execution-order fixes (if needed from test failures)
267-
- **6.8b**: Per-spec selection (nice-to-have)
268-
- **6.8f**: Performance benchmarks (nice-to-have)
269-
270309
---
271310

272311
## Phase 5: RDataFrame Integration
@@ -293,7 +332,7 @@ AliasDataFrame is a high-performance data analysis framework for particle physic
293332

294333
### Phase 5.3: Runtime Composite Keys
295334
**Date**: 2025-12-06
296-
**Commit**: `d5591a07b252fe432afefe68b94ceced68cbfbb1`
335+
**Commit**: `d5591a07b252fe432afefe68b94ceced68cbfbb11`
297336

298337
- Runtime composite key generation for >2 index columns
299338
- TMemFile + SetFile approach
@@ -444,6 +483,15 @@ All major decisions require consensus from 3+ AI reviewers:
444483
| intersection | Common branches only |
445484
| union | All branches, fill missing |
446485

486+
### Phase 13 Decisions
487+
488+
| Decision | Rationale |
489+
|----------|-----------|
490+
| PolynomialSpec `tgSlp` as standard dimension | Coupling via alias algebra, not special parameter |
491+
| Evaluator schema stores contract only | User must re-register after load; GBAI handles serialization |
492+
| Multi-predictor requires explicit selection | Silent default on multi-predictor is P1 violation |
493+
| pd.merge on index columns only | Memory O(N × index_cols), not O(N × all_cols) |
494+
447495
---
448496

449497
## Performance Summary
@@ -456,13 +504,15 @@ All major decisions require consensus from 3+ AI reviewers:
456504
| Phase 7 | 86% | Join caching |
457505
| Phase 8 | 10.8× | Numba acceleration |
458506
| Phase 5 | 25× | vs TTree::Draw |
507+
| Phase 13.9 | 42× | Numba polynomial evaluator vs eval |
459508

460509
### Memory Savings
461510

462511
| Feature | Savings |
463512
|---------|---------|
464513
| Lazy loading | 90%+ |
465514
| Compression | 50-80% |
515+
| Coefficient matrix (Phase 13.9) | 36×96 = 27 KB vs 13.5M rows |
466516

467517
### Current Efficiency
468518

@@ -484,7 +534,23 @@ Remaining overhead is Python/Pandas framework cost.
484534
| 7.4 | 51 | 136 (lazy/chain) |
485535
| 7.5a | 33 | 33 (lazy subframe) |
486536
| 7.5b | 19 | 52 (lazy subframe) |
487-
| 6.8e/c | 43 | 1178+ |
537+
| 6.8 | 43 | 1178+ |
538+
| 13.9.ADF | 26 | 1402 |
539+
| 13.10.ADF | 24 | 1425 |
540+
| BUG draw_subframe | 23 | 1392 (at fix time) |
541+
542+
---
543+
544+
## Pending Items
545+
546+
- [ ] GBAI benchmark: evaluator at 82M rows D=3/D=4 with peak RSS
547+
- [ ] Fix `register_subframe_lazy()` bug (BUG_AliasDataFrame_20260116)
548+
- [ ] SCHEMA_VERSION export in `__init__.py`
549+
- [ ] P1 tests: I2_6, I4_2, I4_3 fixes
550+
- [ ] CAPABILITY_MATRIX.md creation
551+
- [ ] PHASE_BEGIN_AliasDataFrame tag
552+
- [ ] Axis title lookup for subframe columns (`Sub_dy` vs `Side.dy`)
553+
- [ ] dfdraw `same=True` — awaiting Team 3 response
488554

489555
---
490556

0 commit comments

Comments
 (0)