Skip to content

Commit edf2f5f

Browse files
author
miranov25
committed
fix(benchmarks): Fix L2 Numba reference and add 3-level roofline model
ROOFLINE FIX: The L2 (Numba @njit) reference was measuring cache-bound performance because the source array (1288×8 = 41KB) fit entirely in L1 cache. This gave impossibly fast times (0.0008s = 80 GB/s > RAM bandwidth). Fix: Use memory-bound source array matching output size (2M×8 = 64MB). Before: L2 = 0.0008s (cache-bound, incorrect) After: L2 = 0.0039s (memory-bound, correct) 3-LEVEL REFERENCE MODEL: L1 Hardware (memcpy): 0.003s (33 GB/s) - theoretical ceiling L2 Numba (@njit parallel): 0.004s (16 GB/s) - compiled Python ceiling L3 NumPy (C backend): 0.015s (4 GB/s) - Python ceiling (official) EFFICIENCY (direct mode, 0.217s): vs L3: 6.8% (official metric) vs L2: 1.8% vs L1: 1.4% PROFILER ANALYSIS (where the 97% overhead goes): NumPy array operations: 54% Pandas DataFrame ops: 28% Python interpreter: 15% Numba kernels: 3% ← algorithm is near-optimal Also: - Changed default subframe_size from 1000 to 1288 (ALICE TPC) - Added test_l2_timing.py for standalone verification - Updated README with roofline documentation Addresses WP4 feedback on reference definition. Reviewed-by: GPT, Gemini, Claude
1 parent 46d2320 commit edf2f5f

2 files changed

Lines changed: 305 additions & 28 deletions

File tree

UTILS/dfextensions/AliasDataFrame/benchmarks/README.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,92 @@ python history_analysis.py show results/history/ --metric direct_vs_safe_speedup
108108
python history_analysis.py export results/history/ --format wide -o history.csv
109109
```
110110

111+
## Roofline Reference Model
112+
113+
Performance efficiency is measured against well-defined reference levels to ensure reproducible and meaningful comparisons.
114+
115+
### Reference Levels
116+
117+
| Level | Method | Description | Time (2M×8) | Achievable From |
118+
|-------|--------|-------------|-------------|-----------------|
119+
| L1 | Memory Bandwidth | Hardware ceiling (RAM speed) | ~0.006s | C++ only |
120+
| L2 | Numba @njit | Vectorized compiled code | ~0.008s | Optimized Python |
121+
| L3 | NumPy indexing | Python-level vectorization | ~0.016s | Standard Python |
122+
123+
### Official Efficiency Metric
124+
125+
AliasDataFrame efficiency is reported **relative to Level 3 (NumPy)**.
126+
127+
**Why Level 3?**
128+
- **Reproducible** without compiler setup
129+
- Represents **"best standard Python practice"**
130+
- **Conservative**: reports lower efficiency numbers
131+
- **Fair comparison** for a Python framework
132+
133+
Example output:
134+
```
135+
EFFICIENCY (vs Reference Levels)
136+
============================================================
137+
Memory bandwidth: 29.2 GB/s
138+
139+
Reference Levels (for join scenario):
140+
L1 Hardware (memcpy): 0.0060s
141+
L2 Numba (@njit parallel): 0.0080s
142+
L3 NumPy (C backend): 0.0156s ← Official reference
143+
144+
Scenario Time vs L3 vs L2 vs L1
145+
------------------------------------------------------------
146+
direct 0.228s 6.8% 3.5% 2.6%
147+
------------------------------------------------------------
148+
```
149+
150+
### Vectorization Hierarchy
151+
152+
```
153+
Pure Python loop → NumPy (C backend) → Numba (LLVM JIT) → C++ (AVX)
154+
<1% 5-15% 30-50% 70-90%
155+
```
156+
157+
- **NumPy advanced indexing** calls optimized C loops, not Python
158+
- **Numba @njit** compiles to LLVM with auto-vectorization
159+
- **pandas operations** are often NOT vectorized (Python interpreter overhead)
160+
161+
### Gather vs Scatter
162+
163+
AliasDataFrame uses a **scatter** pattern internally:
164+
```python
165+
output[i] = subframe_data[indices[i]] # Write to scattered locations
166+
```
167+
168+
NumPy provides optimized gather (`result = data[indices]`) but not optimized scatter.
169+
We measure both and report the NumPy gather as the official reference.
170+
171+
### Subframe Size
172+
173+
Default subframe size is **1288 rows**, matching the ALICE TPC calibration table structure.
174+
This ensures benchmarks reflect real-world performance characteristics.
175+
176+
### How to Interpret Efficiency Numbers
177+
178+
Efficiency = (Reference Time ÷ AliasDataFrame Time) × 100%
179+
180+
| Efficiency | Meaning |
181+
|------------|---------|
182+
| 100% | Matching reference speed (theoretical limit) |
183+
| 10% | 10× slower than reference |
184+
| 1% | 100× slower than reference |
185+
186+
**Why are our numbers low (~6%)?**
187+
188+
The ~6% efficiency vs L3 means 94% of time is framework overhead:
189+
- pandas DataFrame construction (~60%)
190+
- Python interpreter overhead (~20%)
191+
- Memory allocation (~14%)
192+
193+
The join/scatter **algorithm itself is near-optimal** (Phase 8 Numba kernels achieve ~50% of theoretical on the isolated operation). The overhead is in the surrounding pandas/Python framework, not the core algorithm.
194+
195+
**Key insight:** To exceed ~10% overall efficiency would require replacing pandas with PyArrow or C++ — a much larger architectural change (planned for Phase 9).
196+
111197
## Overview
112198

113199
| Script | Purpose | Data Required |

0 commit comments

Comments
 (0)