Skip to content

Commit 6dc69aa

Browse files
author
miranov25
committed
docs: update PHASE_HISTORY (13.18-13.21) + fix SCHEMA.md stale reference
PHASE_HISTORY.md: add Phase 13.18 (regression bridge), 13.19.FIX1 (vector draw diagnostic), 13.20 (export_tree batching), 13.21 (join caching + dematerialize). Update metrics, test counts, performance table, pending items. SCHEMA.md: drop_materialized_aliases() -> dematerialize() (Phase 13.21).
1 parent 4c269c3 commit 6dc69aa

2 files changed

Lines changed: 102 additions & 10 deletions

File tree

UTILS/dfextensions/AliasDataFrame/docs/PHASE_HISTORY.md

Lines changed: 101 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,10 @@ This file is intended for AI reviewers and human collaborators as a **restart co
4141
AliasDataFrame is a high-performance data analysis framework for particle physics research at CERN's ALICE experiment. It provides schema-driven, lazy-evaluated columns and hierarchical joins for ROOT/Parquet data.
4242

4343
**Key Metrics:**
44-
- Performance: 60-770x speedups achieved
45-
- Test Coverage: 1480+ tests passing (1463 baseline + 31 Phase 13.12 invariance tests; 17 net after ROOT/numba skips)
46-
- Lines of Code: ~10,000 (AliasDataFrame.py)
44+
- Performance: 60-770x speedups achieved; production pipeline 2× faster (1452s → 722s)
45+
- Test Coverage: 1521 tests passing, 125 invariance tests
46+
- Lines of Code: ~12,800 (AliasDataFrame.py)
47+
- Features: 44 in taxonomy (26 verified, 14 smoke-only, 3 broken, 1 planned)
4748

4849
**Development Team:**
4950
- Coordinator: Marian Ivanov (miranov25)
@@ -156,6 +157,83 @@ Mode #11 discipline).
156157

157158
---
158159

160+
### Phase 13.18.ADF: Regression Metadata Bridge
161+
**Dates**: 2026-04-12 to 2026-04-13
162+
**Status**: ✅ Merged
163+
**Commits**: `2bc65bb`, `2662cee`, `d0d020a`
164+
165+
**Deliverables**:
166+
- 3 new public API methods: `register_regression_metadata()`, `update_regression_metadata()`, `register_evaluator_from_metadata()`
167+
- `describe_regression()` for lazy-status-aware introspection
168+
- Schema persistence for regression_metadata (mirrors registered_functions pattern)
169+
- Natural-label → compact-index remap in `_bridge_eval_func` (cross-team Q&A with GB team)
170+
- feature_taxonomy.py: +3 entries (FUNC.regression_metadata, FUNC.evaluator_from_metadata, FUNC.regression_persistence), 41 → 44 features
171+
172+
**Tests**: 11/11 passing (7 pass + 4 xfail resolved after bridge fix)
173+
174+
**Key decision**: `_bridge_eval_func` uses `register_function` directly (not `register_evaluator` wrapper) because evaluator requires compact-index input, not natural labels. Bridge handles the remap.
175+
176+
**Lesson learned**: Reviewer-discipline failure — `_eval_lookup` docstring says "raw grid indices" but coder read "raw" as "natural values." Recovery required cross-team Q&A. Future: hand-trace one example through the target method before escalating.
177+
178+
---
179+
180+
### Phase 13.19.ADF.FIX1: Vector Draw Kwarg Diagnostic
181+
**Dates**: 2026-04-14 to 2026-04-18
182+
**Status**: ✅ Merged
183+
**Commits**: `f9679f7`, `1fa5145b` (combined with 13.20)
184+
185+
**Problem**: `adf.draw('[y1..y6]:staveITS', group_by='mP3', group_by_bins=6)` produced 421 legend entries instead of 6.
186+
187+
**Diagnostic approach** (architect direction: *"Make tests first"*):
188+
- K1 tests (boundary diagnostic): monkey-patch DFDraw methods to capture kwargs at ADF→DFDraw boundary. **Conclusion: ADF forwards kwargs correctly. Bug was dfdraw-internal.**
189+
- K2 tests (end-to-end): count rendered legend entries post-dfdraw FIX1 (`fe007b7c`)
190+
191+
**Tests**: K1 (4 pass, 1 skip) + K2 (4 pass) = 8 permanent regression tests
192+
193+
**Lesson learned**: K1 ruled out wrong hypothesis; K2 (output counting) should have been v0.2 from the start. *"Diagnostic tests must include the user-visible symptom, not just intermediate-layer plumbing."*
194+
195+
---
196+
197+
### Phase 13.20.ADF: export_tree Metadata Batching
198+
**Dates**: 2026-04-18 to 2026-04-19
199+
**Status**: ✅ Merged
200+
**Commit**: `1fa5145b` (combined with 13.19.FIX1)
201+
202+
**Problem**: `export_tree` with N subframes opened ROOT file N+1 times for metadata writing. Profile: 159s / 11% of 1452s total.
203+
204+
**Fix**: Separate data-write (uproot) from metadata-write (ROOT). 4 new methods: `_write_all_data_to_uproot`, `_collect_metadata_targets`, `_write_all_metadata_to_root`, `_write_metadata_to_tree`. 1 `TFile.Open` instead of 16+.
205+
206+
**Measured savings**: ~29s (not 80-130s predicted — O2DistAI's single-TF filter reduced file sizes, making each TFile.Open cheaper). Fix is structurally correct; savings scale with file size.
207+
208+
**Tests**: E1 (4 pass) + E2 (3 pass, 1 xfail for pre-existing `read_tree` nested subframe limitation)
209+
210+
---
211+
212+
### Phase 13.21.ADF: Join Index Caching + dematerialize() API
213+
**Dates**: 2026-04-19
214+
**Status**: ✅ Merged
215+
**Commits**: `fa7cd11d` (v1.0 caching), `4c269c3c` (v1.1 dematerialize + remove drop_materialized)
216+
217+
**Origin**: PHASE_13_19_ADF_PERF_Summary.md item A1, unanimous ADF team agreement (Claude31, Claude30, Claude32).
218+
219+
**Fix A — Join index caching** (4 changes, +35 lines):
220+
- Root cause: line 4411 cleared ENTIRE `_join_index_cache` after every `materialize_aliases` call, even though only value columns changed
221+
- Added `_index_column_signature()` — O(1) content-based cache validation (first/last/dtype)
222+
- Added targeted cache invalidation in `register_subframe` instead of blanket clear
223+
- Profile evidence: 56s / 141 cache misses → expected ~5-10s / ~15 misses
224+
225+
**Fix B — `dematerialize(drop=, keep=)` API** (+72 lines):
226+
- Three modes: `drop=[...]`, `keep=[...]`, no args (drop all)
227+
- Raw columns always protected. Mutually exclusive drop/keep (`ValueError`)
228+
- Replaced `drop_materialized()` — strict superset, 3 internal call sites updated
229+
- Composes with join caching: re-materialization reuses cached indices
230+
231+
**Tests**: J1_1..J1_10 (correctness + dematerialize) + J2_1 (performance gate) = 11 tests. Updated 2 pre-existing `test_join_index_caching.py` tests to expect new cache-survives behavior.
232+
233+
**Test results**: 1521 passed, 7 failed (all pre-existing), 1 error (pre-existing)
234+
235+
---
236+
159237
## Bug Fixes
160238

161239
### BUG_AliasDataFrame_20260324_draw_subframe_resolution
@@ -557,6 +635,9 @@ All major decisions require consensus from 3+ AI reviewers:
557635
| Phase 8 | 10.8× | Numba acceleration |
558636
| Phase 5 | 25× | vs TTree::Draw |
559637
| Phase 13.9 | 42× | Numba polynomial evaluator vs eval |
638+
| Phase 13.20 | ~29s saved | export_tree metadata batching (1 TFile.Open vs N+1) |
639+
| Phase 13.21 | ~40-50s expected | Join cache survives materialize_aliases |
640+
| **Production** | **2× (1452→722s)** | **Cross-team: GB + ADF + O2DistAI fixes combined** |
560641

561642
### Memory Savings
562643

@@ -590,19 +671,30 @@ Remaining overhead is Python/Pandas framework cost.
590671
| 13.9.ADF | 26 | 1402 |
591672
| 13.10.ADF | 24 | 1425 |
592673
| BUG draw_subframe | 23 | 1392 (at fix time) |
674+
| BUG fill_value | 7 | 1432 |
675+
| BUG draw_lazy_compound | 13 | 1441 |
676+
| 13.11.ADF | infra | 1441 |
677+
| 13.11.B | taxonomy | 1441 |
678+
| 13.12.ADF | 31 | 1480 |
679+
| 13.18.ADF | 11 | 1491 |
680+
| 13.19.ADF.FIX1 | 8 (K1+K2) | 1499 |
681+
| 13.20.ADF | 8 (E1+E2) | 1510 |
682+
| 13.21.ADF | 11 (J1+J2) | 1521 |
593683

594684
---
595685

596686
## Pending Items
597687

598-
- [ ] GBAI benchmark: evaluator at 82M rows D=3/D=4 with peak RSS
599-
- [ ] Fix `register_subframe_lazy()` bug (BUG_AliasDataFrame_20260116)
600-
- [ ] SCHEMA_VERSION export in `__init__.py`
688+
- [x] ~~CAPABILITY_MATRIX.md creation~~ (Phase 13.11)
689+
- [x] ~~PHASE_BEGIN_AliasDataFrame tag~~ (Phase 13.20 close)
690+
- [ ] A2 — LZ4 default compression (one-line + compat test, ~15-20s savings)
691+
- [ ] A3 — Batch metadata serialization (~20-30s savings)
692+
- [ ] `read_tree` recursive subframe loading (line 5172 `load_subframes=False``True`)
693+
- [ ] GB tuple support for `linear_columns` (PolynomialSpec production blocker)
694+
- [ ] Technical Summary v1.6 full public API documentation (~90 methods)
601695
- [ ] P1 tests: I2_6, I4_2, I4_3 fixes
602-
- [ ] CAPABILITY_MATRIX.md creation
603-
- [ ] PHASE_BEGIN_AliasDataFrame tag
696+
- [ ] Fix `register_subframe_lazy()` bug (BUG_AliasDataFrame_20260116)
604697
- [ ] Axis title lookup for subframe columns (`Sub_dy` vs `Side.dy`)
605-
- [ ] dfdraw `same=True` — awaiting Team 3 response
606698

607699
---
608700

UTILS/dfextensions/AliasDataFrame/docs/SCHEMA.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@ This is the primary workflow for physics analysis:
203203
adf = AliasDataFrame.read_tree("clusters.root", "tree")
204204

205205
# Step 2: Drop any previously materialized columns
206-
adf.drop_materialized_aliases()
206+
adf.dematerialize()
207207

208208
# Step 3: Register subframes and auto-alias
209209
adf_tracks = AliasDataFrame.read_tree("tracks.root", "tree")

0 commit comments

Comments
 (0)