Commit 5b9058c
authored
Optimize _gridmake2
The optimized code achieves a **10x speedup** (1038%) by replacing NumPy's high-level array operations with JIT-compiled explicit loops via Numba's `@njit` decorator.
## Key Optimizations
**1. Numba JIT Compilation with `@njit(cache=True)`**
- Eliminates Python interpreter overhead by compiling to machine code
- The `cache=True` flag stores compiled code between runs, avoiding recompilation cost
- Particularly effective for loops, which NumPy operations like `tile`, `repeat`, and `column_stack` use internally but with Python overhead
**2. Preallocated Output Arrays with Explicit Loops**
- **Original approach**: `np.column_stack([np.tile(x1, x2.shape[0]), np.repeat(x2, x1.shape[0])])` creates three temporary arrays (tile result, repeat result, then column_stack result)
- **Optimized approach**: Pre-allocates a single output array with exact size `(x1.shape[0] * x2.shape[0], 2)` and fills it directly via nested loops
- Eliminates intermediate array allocations and memory copies
**3. Direct Memory Access**
- Line profiler shows the original code spends 77.9% of time in `np.column_stack` and related operations
- The optimized version replaces these with direct index assignments (`out[idx, 0] = x1[i]`), which Numba compiles to efficient memory writes
## Performance Context
From `function_references`, `_gridmake2` is called recursively within `gridmake()` when building cartesian products of multiple arrays. For `d > 2` dimensions, the function is called `d-1` times in a loop. This means:
- **Hot path impact**: The 10x speedup compounds across multiple calls when expanding 3+ dimensional grids
- **Memory efficiency**: For large input arrays, avoiding temporary allocations becomes increasingly important
## Test Case Suitability
The optimization excels when:
- Building cartesian products of moderately-sized vectors (e.g., 100-1000 elements each)
- Called repeatedly in loops (as in the recursive `gridmake` case)
- Input arrays have consistent dtypes (Numba's type specialization works best here)
The line profiler confirms the bottleneck was NumPy's high-level operations, which this optimization directly addresses through low-level compiled code.1 parent bab9ae9 commit 5b9058c
1 file changed
Lines changed: 34 additions & 23 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
2 | | - | |
| 1 | + | |
3 | 2 | | |
4 | 3 | | |
5 | 4 | | |
| |||
9 | 8 | | |
10 | 9 | | |
11 | 10 | | |
| 11 | + | |
12 | 12 | | |
| 13 | + | |
13 | 14 | | |
14 | 15 | | |
| 16 | + | |
15 | 17 | | |
16 | 18 | | |
17 | 19 | | |
18 | | - | |
19 | | - | |
| 20 | + | |
20 | 21 | | |
21 | 22 | | |
22 | 23 | | |
| |||
43 | 44 | | |
44 | 45 | | |
45 | 46 | | |
46 | | - | |
47 | | - | |
| 47 | + | |
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| |||
79 | 79 | | |
80 | 80 | | |
81 | 81 | | |
82 | | - | |
83 | | - | |
| 82 | + | |
84 | 83 | | |
85 | 84 | | |
| 85 | + | |
86 | 86 | | |
87 | | - | |
88 | | - | |
| 87 | + | |
89 | 88 | | |
90 | 89 | | |
91 | 90 | | |
| |||
114 | 113 | | |
115 | 114 | | |
116 | 115 | | |
117 | | - | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | | - | |
122 | | - | |
123 | | - | |
124 | | - | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
125 | 138 | | |
126 | 139 | | |
127 | 140 | | |
128 | | - | |
129 | | - | |
| 141 | + | |
130 | 142 | | |
131 | 143 | | |
132 | 144 | | |
| |||
161 | 173 | | |
162 | 174 | | |
163 | 175 | | |
164 | | - | |
| 176 | + | |
165 | 177 | | |
166 | 178 | | |
167 | 179 | | |
168 | 180 | | |
169 | | - | |
170 | | - | |
| 181 | + | |
0 commit comments