Commit 1eb0e86
authored
Optimize _gridmake2
## Performance Optimization Summary
The optimized code achieves an **884% speedup** (from 1.07ms to 109μs) by replacing NumPy's high-level array operations with **Numba JIT-compiled explicit loops**.
### Key Optimizations
**1. Numba JIT Compilation (`@njit(cache=True)`)**
- Compiles the function to machine code at runtime, eliminating Python interpreter overhead
- The `cache=True` flag stores the compiled version, avoiding recompilation costs on subsequent runs
- Particularly effective here because the function contains simple arithmetic and array indexing operations that Numba optimizes well
**2. Explicit Loop-Based Construction vs. NumPy Broadcasting**
- **Original approach**: Used `np.tile()`, `np.repeat()`, and `np.column_stack()` which create multiple intermediate arrays and perform memory allocations
- **Optimized approach**: Pre-allocates the output array once with `np.empty()` and fills it directly using nested loops
- This eliminates intermediate array creation and reduces memory allocation overhead
**3. Why This Works**
From the line profiler, the original code spent:
- **76.4%** of time in `np.column_stack([np.tile(...)])`
- **8.5%** in `np.repeat()`
- **9.3%** in `np.tile()` for the 2D case
These NumPy operations, while convenient, involve:
- Multiple temporary array allocations
- Memory copies during stacking operations
- Python-level function call overhead
Numba's compiled loops avoid all of this by directly computing each output element in place.
### Impact on Workloads
Based on `function_references`, `_gridmake2` is called from `gridmake()` which:
- Calls it **once for 2 input arrays**
- Calls it **iteratively** for 3+ arrays (once initially, then in a loop for remaining arrays)
For multi-array scenarios (3+ inputs), the speedup compounds significantly since `_gridmake2` is called multiple times per `gridmake()` invocation. The nearly **9x speedup** per call translates to substantial gains in computational economics applications where Cartesian products are frequently computed for state space expansions.
### Trade-offs
- First call incurs JIT compilation overhead (~tens of milliseconds), but `cache=True` mitigates this for subsequent calls
- Code is more verbose but dramatically faster for repeated execution patterns
- Best suited for scenarios where the function is called multiple times (amortizing compilation cost)1 parent bab9ae9 commit 1eb0e86
1 file changed
Lines changed: 33 additions & 23 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
2 | | - | |
| 1 | + | |
3 | 2 | | |
4 | 3 | | |
5 | 4 | | |
| |||
9 | 8 | | |
10 | 9 | | |
11 | 10 | | |
| 11 | + | |
12 | 12 | | |
| 13 | + | |
13 | 14 | | |
14 | 15 | | |
| 16 | + | |
15 | 17 | | |
16 | 18 | | |
17 | 19 | | |
18 | | - | |
19 | | - | |
| 20 | + | |
20 | 21 | | |
21 | 22 | | |
22 | 23 | | |
| |||
43 | 44 | | |
44 | 45 | | |
45 | 46 | | |
46 | | - | |
47 | | - | |
| 47 | + | |
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| |||
79 | 79 | | |
80 | 80 | | |
81 | 81 | | |
82 | | - | |
83 | | - | |
| 82 | + | |
84 | 83 | | |
85 | 84 | | |
| 85 | + | |
86 | 86 | | |
87 | | - | |
88 | | - | |
| 87 | + | |
89 | 88 | | |
90 | 89 | | |
91 | 90 | | |
| |||
114 | 113 | | |
115 | 114 | | |
116 | 115 | | |
117 | | - | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | | - | |
122 | | - | |
123 | | - | |
124 | | - | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
125 | 137 | | |
126 | 138 | | |
127 | 139 | | |
128 | | - | |
129 | | - | |
| 140 | + | |
130 | 141 | | |
131 | 142 | | |
132 | 143 | | |
| |||
161 | 172 | | |
162 | 173 | | |
163 | 174 | | |
164 | | - | |
| 175 | + | |
165 | 176 | | |
166 | 177 | | |
167 | 178 | | |
168 | 179 | | |
169 | | - | |
170 | | - | |
| 180 | + | |
0 commit comments