You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> -**Phase 13.23a K≤2.2 acceptance gate NOT MET** (K=3.0, wall time unchanged at 42ms). Documented: easy wins reduced cProfile overhead, not steady-state wall time. Remaining gap is structural (BLAS kernel + Python orchestration).
8
14
9
15
> **Changes from v3.3:**
10
16
> -**F1 fixed:**`make_sliding_window_aggregate` and its parallel sibling now honour `boundary='symmetric'` and `boundary='periodic'` for the mean/std/count output columns. Both the primary kernel call and the sigma-cut recompute kernel call thread a Path 2 structure `(valid_offset_mask, wrap_flag, wrap_idx, wrapped_coords)` computed once per call by the new helper `_precompute_aggregate_boundary_mask`. Zero behavioural change on the default `boundary='full'` path (verified strictly by T9 against a captured literal baseline).
@@ -437,18 +443,21 @@ The v3.3 split between fit-path (`⚠️`) and aggregate-path (`🧨 silently br
| Partial (⚠️) | 0 | 0 | 0 | 0 |**0** (D1 median deferral is recorded as a Known Limitation, not as a Capability Matrix Partial entry, since it does not map to a distinct capability row) |
¹ **v3.3 on disk records 528 passed.** This is an inherited arithmetic error carried over from the FIX2 coder packet: `517 + 11 new = 528` forgot the F5 flip (one failing test moved to passing). Correct arithmetic: `517 + 11 + 1 = 529`. Reference: `SUMMARY_20260409_112129.txt` from the FIX2 reviewer packet. v3.4 corrects this.
454
463
@@ -564,6 +573,56 @@ Row positions of output are not stable across `algorithm=` values; they were not
564
573
565
574
**Performance:** V1/V2 recompute path internal routing changed from `_build_bin_index_map` (Python dict, O(n²) for n=rows) to `_assign_bin_ids_fast` (vectorized numpy, O(n)). `_get_neighbor_bins` V3a (2.5M per-bin Python function calls) replaced by inline vectorized offset + dense array lookup in `_aggregate_window_dense`. Predicted savings: ~355s on the 82M-row `makeSmoothMapsWithTPC` calibration workload (profile: `profile_gr11_tf0.prof`). T2 re-profile pending.
**Phase 13.21.GB-PERF** — SW fit wrapper vectorization + batch MAD:
586
+
- F1: `_fit_window_regression_numba` rewritten — per-bin Python loops eliminated (47.8s self-time → vectorized). One `np.column_stack` for all bins, kernel handles NaN via `INVALID_FILTER`, vectorized R² from kernel `out_sum_y`/`out_sum_y2` sufficient statistics.
587
+
- F2: `_gather_rows` kernel parallelized with `nb.prange`
588
+
- F3: `_get_batch_mad_kernel` replaces 1.26M per-bin `np.median` calls (52s) with one numba prange kernel call in V4 pipeline. Fallback: `GBAI_DISABLE_BATCH_MEDIAN=1`
CI-integrated roofline metric measuring distance from ideal compiled code:
593
+
594
+
```
595
+
K = T_observed / T_expected, where T_expected = Σ(n_ops × t_primitive)
596
+
```
597
+
598
+
K=1 means at hardware roofline. K>1 quantifies the optimization gap. Primitives from phase 12.12b Appendix A (M1 stream read, M2 gather, M3w scatter write, C1 XᵀWX, C2 Cholesky solve, C6 MAD), re-measured on target machine at test fixture working-set size.
599
+
600
+
**Current K values (alma2, 100K rows, 1024 bins):**
**Phase 13.23a easy wins:** Module-level Dispatcher cache (A1), listcomp vectorization (A2), V5-style preallocated `_assemble_results` (A3). cProfile 61ms→45ms (-26%), 63K→37K function calls (-41%). Wall time unchanged at 42ms — gains are cProfile-overhead reduction, not steady-state wall-time. K=2.9 is structural.
625
+
567
626
### Unverified Claims
568
627
569
628
| Claim | Status | Action |
@@ -589,4 +648,5 @@ Row positions of output are not stable across `algorithm=` values; they were not
589
648
|**3.2**|**2026-04-07**|**Corrections from v3.1 multi-reviewer cycle. Phase references corrected to `13.16.GB-FIX2` and `13.17.GB`. `boundary='symmetric'` split into two rows (⚠️ fit path tested, 🧨 aggregate path silently broken). All three Quick Start examples made runnable with keyword `df=`. `register_fit_model()` correct name in Public Interface Catalog. `'nearest_fast'` added to Evaluator Method Reference. `from_dfGB()` example uses real keyword arguments. F1 and F2 added to Current State "Broken" count. Governance reference updated to Org-structure v1.25. All v3.0 sections preserved verbatim. Drafted by Main Reviewer (Claude20, GBAI) at architect request after v3.1 review cycle.**|
590
649
|**3.3**|**2026-04-07**|**Phase 13.16.GB-FIX2 landed. F2/F3/F4/F5 fixed in source (evaluator method=dict validation, docstring coverage, lookup+extrapolate rejection, stale backend test). C3/C8/C10 addressed as part of the same commit ('nearest'/'nearest_fast' equivalence, runnable docstring examples, unknown-method detection). 11 new tests in `test_evaluator_lookup.py`, all with explicit path parameters per failure mode #11 and `pytest.raises(..., match=...)` per C7. Test count 517 → 528 canonical / 493 → 505 coder-env. Pre-existing failures 3 → 2 (F5 flipped). Broken count 2 → 1 (F2 fixed; F1 remains for 13.17.GB). All v3.0 sections preserved verbatim. Drafted by Coder (Claude21, GBAI) during implementation of PHASE_13_16_GB_FIX2_v1.0 proposal as consolidated in Claude20's review summary.**|
| **3.5** | **2026-04-18** | **Phase 13.19.GB-PERF landed.** V1/V2 recompute path routed through dense-lookup infrastructure (F1). `_build_bin_index_map` (205s) replaced by `_assign_bin_ids_fast`. `_get_neighbor_bins` V3a (152s) inlined into `_aggregate_window_dense`. `[FOUND-WHILE-IMPLEMENTING-F1]` `_fit_window_regression_numba` gains `fit_intercept` parameter — parameter-not-propagated class instance #8. V1/V2 boundary parameter silently dropped documented as Known Limitation (instance #9, pre-existing). Output row ordering changed to lexicographic (Behavior Changes section added). 14 T1 invariance tests in `test_fit_path_perf_invariance.py`. F2 (median batching) deferred. T2 re-profile pending. Coder: Claude22. Reviewers: Claude20 (Main), Claude21, Claude23, Claude24, Claude25. | F1 fixed for mean/std/count in both the primary kernel call and the sigma-cut recompute kernel call via the new `_precompute_aggregate_boundary_mask` helper (Path 2: mask + per-bin wrap flag + compact per-edge-bin wrapped-coords table). Both numba JIT and numpy fallback receive the new `(valid_offset_mask, wrap_flag, wrap_idx, wrapped_coords)` parameters in identical order at all 4 call sites (verified via signature-chain grep). Zero behavioural change on the default `boundary='full'` path — verified strictly by T9 against a literal baseline array hardcoded in the test file. Parallel wrapper inherits the fix automatically (T10 parallel ≡ serial invariance, 2-D fixture). D1 (median path boundary honouring) deferred to Phase 13.17.GB-MedianFix per architect direction 2026-04-09. **28 new pytest-level test runs** in `tests/test_aggregate_boundary.py` from 16 test functions: T1–T11 (T11 with second invariance assertion block), T13 oracle, T14 cross-backend × 6, T15 shift + topology, T16 window=0 × 3, T17 canary × 6. Unified Claude22 + Claude23 joint plan, **50% invariance ratio** (8/16 functions are invariance-style — up from v1.2's 25%). T12 removed per D1 deferral. **Test count: 528 corrected baseline + 28 new = 556 passed canonical** on alma2 commit `85713774`, branch `feature/groupby-optimization`, `test_logs/SUMMARY_20260411_090322.txt`. Pre-existing failures: 3 (was 2 in v3.3 corrected; +1 from `test_v3_numpy_faster_than_v1_numpy` newly listed). Capability Matrix: 0 broken / 0 partial / 43 verified / 89 smoke-only / 1 planned / 133 total. **Inherited arithmetic correction:** v3.3 records 528 passed canonical (off by one — forgot the F5 flip in FIX2); the *correct* v3.3 baseline is 529, but the *empirical* pre-13.17.GB passed-count was 528 because the timing test was already silently failing. v3.4 surfaces both numbers transparently. **New Known Limitation rows added:** D1 median deferral (with verbatim architect quotes including typos), pre-existing broken `test_aggregate_numba_matches_numpy` (superseded by T14, NOT deleted in-phase), `test_v3_numpy_faster_than_v1_numpy` timing test. **Public Interface Catalog unchanged** — Phase 13.17.GB extends behaviour of an existing parameter (`boundary`) without changing any function signature. **All v3.0 / v3.2 / v3.3 sections marked `[UNCHANGED]` preserved verbatim.** Drafted by Coder Claude22 during Phase 13.17.GB implementation; commit-time Main Reviewer Claude21. Suggested archive filename: `GroupByRegression_Technical_Summary_PHASE_13_17_GB_v3_4.md`. |
651
+
|**3.5**|**2026-04-18**|**Phase 13.19.GB-PERF landed.** V1/V2 recompute path routed through dense-lookup infrastructure (F1). `_build_bin_index_map` (205s) replaced by `_assign_bin_ids_fast`. `_get_neighbor_bins` V3a (152s) inlined into `_aggregate_window_dense`. `[FOUND-WHILE-IMPLEMENTING-F1]``_fit_window_regression_numba` gains `fit_intercept` parameter — parameter-not-propagated class instance #8. V1/V2 boundary parameter silently dropped documented as Known Limitation (instance #9, pre-existing). Output row ordering changed to lexicographic (Behavior Changes section added). 14 T1 invariance tests in `test_fit_path_perf_invariance.py`. F2 (median batching) deferred. T2 re-profile pending. Coder: Claude22. Reviewers: Claude20 (Main), Claude21, Claude23, Claude24, Claude25. |
652
+
|**4.0**|**2026-05-10**|**Phases 13.20–13.23a performance optimization sequence.** Pipeline wall time 1452s→722s (2×). Phase 13.20: numba-ize `_aggregate_window_dense` (CSR gather kernel). Phase 13.21: vectorize SW fit wrapper + prange gather + batch MAD (F1+F2+F3). Phase 13.22: CI roofline regression tests (K=2.9 baseline, 12-cycle proposal v1.12, 10/10 pass). Phase 13.23a: profile-driven easy wins (cProfile 61ms→45ms, -41% function calls). **New sections:** Performance Optimization (13.20–13.21), Roofline Performance Framework (13.22+13.23a profile decomposition). **Current State updated** to v4.0 column (575 tests, 0 broken, 10 roofline, K=2.9). Parameter-not-propagated instances #10 (calibration wrong-module) and #11 (pytest venv-path) documented in PHASE_HISTORY v7.0. Drafted by Claude22 at architect request 2026-05-10. |
0 commit comments