Skip to content

Commit 50a697f

Browse files
committed
TECHNICAL_SUMMARY v4.0: Phases 13.20-13.23a, K=2.9, 575 tests
New sections: Performance Optimization (1452s→722s), Roofline Framework (K=2.9). Current State v4.0: 575/3/0, 10 roofline pass, pipeline 722s. Profile decomposition: OLS kernel 45%, Python orchestration 36%.
1 parent f45a7ac commit 50a697f

1 file changed

Lines changed: 77 additions & 17 deletions

File tree

UTILS/dfextensions/groupby_regression/docs/TECHNICAL_SUMMARY.md

Lines changed: 77 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,16 @@
11
# Technical Summary: GroupBy Regression
22

3-
**Version:** 3.4
4-
**Phase:** 13.17.GB
5-
**Last Updated:** 2026-04-11
3+
**Version:** 4.0
4+
**Phase:** 13.23a.GB-RooflineEasyWins
5+
**Last Updated:** 2026-05-10
66
**Coder:** Claude22 (GBAI team)
7-
**Suggested archive filename:** `GroupByRegression_Technical_Summary_PHASE_13_17_GB_v3_4.md`
7+
**Suggested archive filename:** `GroupByRegression_Technical_Summary_PHASE_13_23A_GB_v4_0.md`
8+
9+
> **Changes from v3.5:**
10+
> - **Performance Optimization section (NEW):** Phases 13.20–13.21 pipeline wall time 1452s→722s (2×). CSR gather kernel, SW fit wrapper vectorization, prange gather, batch MAD.
11+
> - **Roofline Performance Framework section (NEW):** Phase 13.22 K metric (K=2.9 baseline, 10/10 pass). Phase 13.23a profile decomposition: OLS kernel 45%, Python orchestration 36%. cProfile 61ms→45ms (-26%), 63K→37K function calls (-41%).
12+
> - **Current State v4.0 column:** 575 tests passed, 0 broken, 10 roofline tests (opt-in), K=2.9, pipeline 722s.
13+
> - **Phase 13.23a K≤2.2 acceptance gate NOT MET** (K=3.0, wall time unchanged at 42ms). Documented: easy wins reduced cProfile overhead, not steady-state wall time. Remaining gap is structural (BLAS kernel + Python orchestration).
814
915
> **Changes from v3.3:**
1016
> - **F1 fixed:** `make_sliding_window_aggregate` and its parallel sibling now honour `boundary='symmetric'` and `boundary='periodic'` for the mean/std/count output columns. Both the primary kernel call and the sigma-cut recompute kernel call thread a Path 2 structure `(valid_offset_mask, wrap_flag, wrap_idx, wrapped_coords)` computed once per call by the new helper `_precompute_aggregate_boundary_mask`. Zero behavioural change on the default `boundary='full'` path (verified strictly by T9 against a captured literal baseline).
@@ -437,18 +443,21 @@ The v3.3 split between fit-path (`⚠️`) and aggregate-path (`🧨 silently br
437443

438444
## [UPDATED] Current State
439445

440-
| Metric | v2.1 (Feb 2026) | v3.0 (Mar 2026) | v3.2 (Apr 2026) | v3.3 corrected¹ | **v3.4 (Apr 2026)** |
441-
|--------|-----------------|-----------------|-----------------|-----------------|---------------------|
442-
| Test count | 338 passed | 500 passed | 517 passed | 529 passed¹ | **556 passed²** |
443-
| Pre-existing failures | 3 | 4 | 3 | 2 | **** |
444-
| Skipped ||| 19 | 19 | **19** |
445-
| Features | 102 | 133 | 133 | 133 | **133** |
446-
| Verified (✅) | 25 (24.5%) | 43 (32.3%) | 43 (32.3%) | 43 (32.3%) | **43 (32.3%) + 28 new pytest runs (T1–T17)** |
447-
| Smoke-only (☑️) ||| 89 (66.9%) | 89 (66.9%) | **89 (66.9%)** |
448-
| Broken (🧨) | 0 | 0 | 2 (F1, F2) | 1 (F1) | **0** |
449-
| Partial (⚠️) | 0 | 0 | 0 | 0 | **0** (D1 median deferral is recorded as a Known Limitation, not as a Capability Matrix Partial entry, since it does not map to a distinct capability row) |
450-
| Public functions | 14 | 20 | 20 | 20 | **20** |
451-
| Evaluator method values | 2 | 3 | 6 | 6 | **6** |
446+
| Metric | v2.1 (Feb 2026) | v3.0 (Mar 2026) | v3.2 (Apr 2026) | v3.3 corrected¹ | v3.4 (Apr 2026) | **v4.0 (May 2026)** |
447+
|--------|-----------------|-----------------|-----------------|-----------------|---------------------|---------------------|
448+
| Test count | 338 passed | 500 passed | 517 passed | 529 passed¹ | 556 passed² | **575 passed** |
449+
| Pre-existing failures | 3 | 4 | 3 | 2 || **3** |
450+
| Skipped ||| 19 | 19 | 19 | **19** |
451+
| Features | 102 | 133 | 133 | 133 | 133 | **133** |
452+
| Verified (✅) | 25 (24.5%) | 43 (32.3%) | 43 (32.3%) | 43 (32.3%) | 43 (32.3%) + 28 new | **43 (32.3%)** |
453+
| Smoke-only (☑️) ||| 89 (66.9%) | 89 (66.9%) | 89 (66.9%) | **89 (66.9%)** |
454+
| Broken (🧨) | 0 | 0 | 2 (F1, F2) | 1 (F1) | 0 | **0** |
455+
| Partial (⚠️) | 0 | 0 | 0 | 0 | 0 | **0** |
456+
| Public functions | 14 | 20 | 20 | 20 | 20 | **20** |
457+
| Evaluator method values | 2 | 3 | 6 | 6 | 6 | **6** |
458+
| Roofline tests (opt-in) |||||| **10/10 pass** |
459+
| Pipeline wall time (82M rows) ||||| ~1452s | **~722s (2×)** |
460+
| Roofline K (SW fit, 100K rows) |||||| **K=2.9** |
452461

453462
¹ **v3.3 on disk records 528 passed.** This is an inherited arithmetic error carried over from the FIX2 coder packet: `517 + 11 new = 528` forgot the F5 flip (one failing test moved to passing). Correct arithmetic: `517 + 11 + 1 = 529`. Reference: `SUMMARY_20260409_112129.txt` from the FIX2 reviewer packet. v3.4 corrects this.
454463

@@ -564,6 +573,56 @@ Row positions of output are not stable across `algorithm=` values; they were not
564573

565574
**Performance:** V1/V2 recompute path internal routing changed from `_build_bin_index_map` (Python dict, O(n²) for n=rows) to `_assign_bin_ids_fast` (vectorized numpy, O(n)). `_get_neighbor_bins` V3a (2.5M per-bin Python function calls) replaced by inline vectorized offset + dense array lookup in `_aggregate_window_dense`. Predicted savings: ~355s on the 82M-row `makeSmoothMapsWithTPC` calibration workload (profile: `profile_gr11_tf0.prof`). T2 re-profile pending.
566575

576+
## [NEW] Performance Optimization (Phases 13.20–13.21)
577+
578+
Production pipeline wall time reduced from **1452s to 722s (2×)** on the 82M-row TPC calibration workload across two phases:
579+
580+
**Phase 13.20.GB-PERF**`_aggregate_window_dense` inner loop numba-ized:
581+
- `_get_gather_window_rows_kernel`: JIT-compiled two-pass CSR kernel (count → prefix sum → fill) replaces per-bin Python gather loop
582+
- Removed redundant `flat_indices.clip` (16s) and `np.unique(concatenate)` (37s — rows are disjoint by counting-sort guarantee)
583+
- Numpy fallback: `GBAI_DISABLE_AGG_DENSE_NUMBA=1`
584+
585+
**Phase 13.21.GB-PERF** — SW fit wrapper vectorization + batch MAD:
586+
- F1: `_fit_window_regression_numba` rewritten — per-bin Python loops eliminated (47.8s self-time → vectorized). One `np.column_stack` for all bins, kernel handles NaN via `INVALID_FILTER`, vectorized R² from kernel `out_sum_y`/`out_sum_y2` sufficient statistics.
587+
- F2: `_gather_rows` kernel parallelized with `nb.prange`
588+
- F3: `_get_batch_mad_kernel` replaces 1.26M per-bin `np.median` calls (52s) with one numba prange kernel call in V4 pipeline. Fallback: `GBAI_DISABLE_BATCH_MEDIAN=1`
589+
590+
## [NEW] Roofline Performance Framework (Phase 13.22)
591+
592+
CI-integrated roofline metric measuring distance from ideal compiled code:
593+
594+
```
595+
K = T_observed / T_expected, where T_expected = Σ(n_ops × t_primitive)
596+
```
597+
598+
K=1 means at hardware roofline. K>1 quantifies the optimization gap. Primitives from phase 12.12b Appendix A (M1 stream read, M2 gather, M3w scatter write, C1 XᵀWX, C2 Cholesky solve, C6 MAD), re-measured on target machine at test fixture working-set size.
599+
600+
**Current K values (alma2, 100K rows, 1024 bins):**
601+
602+
| Pipeline | K | T_obs | T_exp | Gap source |
603+
|---|---|---|---|---|
604+
| SW fit (`make_sliding_window_fit`) | **2.9** | 42ms | 15ms | OLS kernel 45%, Python orchestration 36% |
605+
| V4 (`make_parallel_fit_v4`, PyArrow) | **2.5** | 63ms | 25ms | BLAS-vs-numba scalar, pandas overhead |
606+
| Bin-id assignment | **2.3** | 1.1ms | 0.5ms | At roofline ✓ |
607+
| Fit kernel (OLS, C1+C2 denom) | **3.3** | 41ms | 13ms | BLAS-vs-numba scalar gap |
608+
609+
**Test suite:** 10 tests (4 K-roofline + 1 dispatch + 5 meta), opt-in via `WITH_ROOFLINE=1` in `run_tests.sh`. Threshold formula: `K_threshold = K_floor + 6 × 1.4826 × MAD(K)`.
610+
611+
**Calibration pipeline:** `measure_primitives.py``calibrate_roofline_K.py``update_roofline_baseline.py``roofline_baseline.json`. Fully automated, no hand-written thresholds.
612+
613+
**Profile decomposition of K=2.9 (Phase 13.23a):**
614+
615+
| Component | Time | % | Optimization target |
616+
|---|---|---|---|
617+
| OLS kernel (`fit_groups_single_numba`) | 19ms | 45% | BLAS-batched XtX (Phase 13.25) |
618+
| Window array build | 7ms | 17% | Memory layout optimization |
619+
| Other wrapper Python | 8ms | 19% | Vectorized post-kernel assembly |
620+
| Pandas output (`_assemble_results`) | 3ms | 7% | Arrow-based output (future) |
621+
| Aggregate orchestration | 2ms | 5% | V5-style numba (Phase 13.24) |
622+
| Gather + sort + bin-id | 3ms | 7% | At roofline ✓ |
623+
624+
**Phase 13.23a easy wins:** Module-level Dispatcher cache (A1), listcomp vectorization (A2), V5-style preallocated `_assemble_results` (A3). cProfile 61ms→45ms (-26%), 63K→37K function calls (-41%). Wall time unchanged at 42ms — gains are cProfile-overhead reduction, not steady-state wall-time. K=2.9 is structural.
625+
567626
### Unverified Claims
568627

569628
| Claim | Status | Action |
@@ -589,4 +648,5 @@ Row positions of output are not stable across `algorithm=` values; they were not
589648
| **3.2** | **2026-04-07** | **Corrections from v3.1 multi-reviewer cycle. Phase references corrected to `13.16.GB-FIX2` and `13.17.GB`. `boundary='symmetric'` split into two rows (⚠️ fit path tested, 🧨 aggregate path silently broken). All three Quick Start examples made runnable with keyword `df=`. `register_fit_model()` correct name in Public Interface Catalog. `'nearest_fast'` added to Evaluator Method Reference. `from_dfGB()` example uses real keyword arguments. F1 and F2 added to Current State "Broken" count. Governance reference updated to Org-structure v1.25. All v3.0 sections preserved verbatim. Drafted by Main Reviewer (Claude20, GBAI) at architect request after v3.1 review cycle.** |
590649
| **3.3** | **2026-04-07** | **Phase 13.16.GB-FIX2 landed. F2/F3/F4/F5 fixed in source (evaluator method=dict validation, docstring coverage, lookup+extrapolate rejection, stale backend test). C3/C8/C10 addressed as part of the same commit ('nearest'/'nearest_fast' equivalence, runnable docstring examples, unknown-method detection). 11 new tests in `test_evaluator_lookup.py`, all with explicit path parameters per failure mode #11 and `pytest.raises(..., match=...)` per C7. Test count 517 → 528 canonical / 493 → 505 coder-env. Pre-existing failures 3 → 2 (F5 flipped). Broken count 2 → 1 (F2 fixed; F1 remains for 13.17.GB). All v3.0 sections preserved verbatim. Drafted by Coder (Claude21, GBAI) during implementation of PHASE_13_16_GB_FIX2_v1.0 proposal as consolidated in Claude20's review summary.** |
591650
| **3.4** | **2026-04-11** | **Phase 13.17.GB landed.**
592-
| **3.5** | **2026-04-18** | **Phase 13.19.GB-PERF landed.** V1/V2 recompute path routed through dense-lookup infrastructure (F1). `_build_bin_index_map` (205s) replaced by `_assign_bin_ids_fast`. `_get_neighbor_bins` V3a (152s) inlined into `_aggregate_window_dense`. `[FOUND-WHILE-IMPLEMENTING-F1]` `_fit_window_regression_numba` gains `fit_intercept` parameter — parameter-not-propagated class instance #8. V1/V2 boundary parameter silently dropped documented as Known Limitation (instance #9, pre-existing). Output row ordering changed to lexicographic (Behavior Changes section added). 14 T1 invariance tests in `test_fit_path_perf_invariance.py`. F2 (median batching) deferred. T2 re-profile pending. Coder: Claude22. Reviewers: Claude20 (Main), Claude21, Claude23, Claude24, Claude25. | F1 fixed for mean/std/count in both the primary kernel call and the sigma-cut recompute kernel call via the new `_precompute_aggregate_boundary_mask` helper (Path 2: mask + per-bin wrap flag + compact per-edge-bin wrapped-coords table). Both numba JIT and numpy fallback receive the new `(valid_offset_mask, wrap_flag, wrap_idx, wrapped_coords)` parameters in identical order at all 4 call sites (verified via signature-chain grep). Zero behavioural change on the default `boundary='full'` path — verified strictly by T9 against a literal baseline array hardcoded in the test file. Parallel wrapper inherits the fix automatically (T10 parallel ≡ serial invariance, 2-D fixture). D1 (median path boundary honouring) deferred to Phase 13.17.GB-MedianFix per architect direction 2026-04-09. **28 new pytest-level test runs** in `tests/test_aggregate_boundary.py` from 16 test functions: T1–T11 (T11 with second invariance assertion block), T13 oracle, T14 cross-backend × 6, T15 shift + topology, T16 window=0 × 3, T17 canary × 6. Unified Claude22 + Claude23 joint plan, **50% invariance ratio** (8/16 functions are invariance-style — up from v1.2's 25%). T12 removed per D1 deferral. **Test count: 528 corrected baseline + 28 new = 556 passed canonical** on alma2 commit `85713774`, branch `feature/groupby-optimization`, `test_logs/SUMMARY_20260411_090322.txt`. Pre-existing failures: 3 (was 2 in v3.3 corrected; +1 from `test_v3_numpy_faster_than_v1_numpy` newly listed). Capability Matrix: 0 broken / 0 partial / 43 verified / 89 smoke-only / 1 planned / 133 total. **Inherited arithmetic correction:** v3.3 records 528 passed canonical (off by one — forgot the F5 flip in FIX2); the *correct* v3.3 baseline is 529, but the *empirical* pre-13.17.GB passed-count was 528 because the timing test was already silently failing. v3.4 surfaces both numbers transparently. **New Known Limitation rows added:** D1 median deferral (with verbatim architect quotes including typos), pre-existing broken `test_aggregate_numba_matches_numpy` (superseded by T14, NOT deleted in-phase), `test_v3_numpy_faster_than_v1_numpy` timing test. **Public Interface Catalog unchanged** — Phase 13.17.GB extends behaviour of an existing parameter (`boundary`) without changing any function signature. **All v3.0 / v3.2 / v3.3 sections marked `[UNCHANGED]` preserved verbatim.** Drafted by Coder Claude22 during Phase 13.17.GB implementation; commit-time Main Reviewer Claude21. Suggested archive filename: `GroupByRegression_Technical_Summary_PHASE_13_17_GB_v3_4.md`. |
651+
| **3.5** | **2026-04-18** | **Phase 13.19.GB-PERF landed.** V1/V2 recompute path routed through dense-lookup infrastructure (F1). `_build_bin_index_map` (205s) replaced by `_assign_bin_ids_fast`. `_get_neighbor_bins` V3a (152s) inlined into `_aggregate_window_dense`. `[FOUND-WHILE-IMPLEMENTING-F1]` `_fit_window_regression_numba` gains `fit_intercept` parameter — parameter-not-propagated class instance #8. V1/V2 boundary parameter silently dropped documented as Known Limitation (instance #9, pre-existing). Output row ordering changed to lexicographic (Behavior Changes section added). 14 T1 invariance tests in `test_fit_path_perf_invariance.py`. F2 (median batching) deferred. T2 re-profile pending. Coder: Claude22. Reviewers: Claude20 (Main), Claude21, Claude23, Claude24, Claude25. |
652+
| **4.0** | **2026-05-10** | **Phases 13.20–13.23a performance optimization sequence.** Pipeline wall time 1452s→722s (2×). Phase 13.20: numba-ize `_aggregate_window_dense` (CSR gather kernel). Phase 13.21: vectorize SW fit wrapper + prange gather + batch MAD (F1+F2+F3). Phase 13.22: CI roofline regression tests (K=2.9 baseline, 12-cycle proposal v1.12, 10/10 pass). Phase 13.23a: profile-driven easy wins (cProfile 61ms→45ms, -41% function calls). **New sections:** Performance Optimization (13.20–13.21), Roofline Performance Framework (13.22+13.23a profile decomposition). **Current State updated** to v4.0 column (575 tests, 0 broken, 10 roofline, K=2.9). Parameter-not-propagated instances #10 (calibration wrong-module) and #11 (pytest venv-path) documented in PHASE_HISTORY v7.0. Drafted by Claude22 at architect request 2026-05-10. |

0 commit comments

Comments
 (0)