Optimize _gridmake2

codeflash-ai[bot] · web-flow · commit 5b9058cad263 · 2025-12-28T18:08:25.000Z
The optimized code achieves a **10x speedup** (1038%) by replacing NumPy's high-level array operations with JIT-compiled explicit loops via Numba's `@njit` decorator.

## Key Optimizations

**1. Numba JIT Compilation with `@njit(cache=True)`**
- Eliminates Python interpreter overhead by compiling to machine code
- The `cache=True` flag stores compiled code between runs, avoiding recompilation cost
- Particularly effective for loops, which NumPy operations like `tile`, `repeat`, and `column_stack` use internally but with Python overhead

**2. Preallocated Output Arrays with Explicit Loops**
- **Original approach**: `np.column_stack([np.tile(x1, x2.shape[0]), np.repeat(x2, x1.shape[0])])` creates three temporary arrays (tile result, repeat result, then column_stack result)
- **Optimized approach**: Pre-allocates a single output array with exact size `(x1.shape[0] * x2.shape[0], 2)` and fills it directly via nested loops
- Eliminates intermediate array allocations and memory copies

**3. Direct Memory Access**
- Line profiler shows the original code spends 77.9% of time in `np.column_stack` and related operations
- The optimized version replaces these with direct index assignments (`out[idx, 0] = x1[i]`), which Numba compiles to efficient memory writes

## Performance Context

From `function_references`, `_gridmake2` is called recursively within `gridmake()` when building cartesian products of multiple arrays. For `d &gt; 2` dimensions, the function is called `d-1` times in a loop. This means:
- **Hot path impact**: The 10x speedup compounds across multiple calls when expanding 3+ dimensional grids
- **Memory efficiency**: For large input arrays, avoiding temporary allocations becomes increasingly important

## Test Case Suitability

The optimization excels when:
- Building cartesian products of moderately-sized vectors (e.g., 100-1000 elements each)
- Called repeatedly in loops (as in the recursive `gridmake` case)
- Input arrays have consistent dtypes (Numba's type specialization works best here)

The line profiler confirms the bottleneck was NumPy's high-level operations, which this optimization directly addresses through low-level compiled code.
diff --git a/code_to_optimize/discrete_riccati.py b/code_to_optimize/discrete_riccati.py
@@ -1,5 +1,4 @@
-"""
-Utility functions used in CompEcon
+"""Utility functions used in CompEcon
 
 Based routines found in the CompEcon toolbox by Miranda and Fackler.
 
@@ -9,14 +8,16 @@
 and Finance, MIT Press, 2002.
 
 """
+
 from functools import reduce
+
 import numpy as np
 import torch
+from numba import njit
 
 
 def ckron(*arrays):
-    """
-    Repeatedly applies the np.kron function to an arbitrary number of
+    """Repeatedly applies the np.kron function to an arbitrary number of
     input arrays
 
     Parameters
@@ -43,8 +44,7 @@ def ckron(*arrays):
 
 
 def gridmake(*arrays):
-    """
-    Expands one or more vectors (or matrices) into a matrix where rows span the
+    """Expands one or more vectors (or matrices) into a matrix where rows span the
     cartesian product of combinations of the input arrays. Each column of the
     input arrays will correspond to one column of the output matrix.
 
@@ -79,13 +79,12 @@ def gridmake(*arrays):
                 out = _gridmake2(out, arr)
 
         return out
-    else:
-        raise NotImplementedError("Come back here")
+    raise NotImplementedError("Come back here")
 
 
+@njit(cache=True)
 def _gridmake2(x1, x2):
-    """
-    Expands two vectors (or matrices) into a matrix where rows span the
+    """Expands two vectors (or matrices) into a matrix where rows span the
     cartesian product of combinations of the input arrays. Each column of the
     input arrays will correspond to one column of the output matrix.
 
@@ -114,19 +113,32 @@ def _gridmake2(x1, x2):
 
     """
     if x1.ndim == 1 and x2.ndim == 1:
-        return np.column_stack([np.tile(x1, x2.shape[0]),
-                               np.repeat(x2, x1.shape[0])])
-    elif x1.ndim > 1 and x2.ndim == 1:
-        first = np.tile(x1, (x2.shape[0], 1))
-        second = np.repeat(x2, x1.shape[0])
-        return np.column_stack([first, second])
-    else:
-        raise NotImplementedError("Come back here")
+        out = np.empty((x1.shape[0] * x2.shape[0], 2), dtype=x1.dtype)
+        idx = 0
+        for j in range(x2.shape[0]):
+            for i in range(x1.shape[0]):
+                out[idx, 0] = x1[i]
+                out[idx, 1] = x2[j]
+                idx += 1
+        return out
+    if x1.ndim > 1 and x2.ndim == 1:
+        n1 = x1.shape[0]
+        n2 = x2.shape[0]
+        ncols = x1.shape[1] + 1
+        out = np.empty((n1 * n2, ncols), dtype=x1.dtype)
+        idx = 0
+        for j in range(n2):
+            for i in range(n1):
+                for k in range(x1.shape[1]):
+                    out[idx, k] = x1[i, k]
+                out[idx, -1] = x2[j]
+                idx += 1
+        return out
+    raise NotImplementedError("Come back here")
 
 
 def _gridmake2_torch(x1: torch.Tensor, x2: torch.Tensor) -> torch.Tensor:
-    """
-    PyTorch version of _gridmake2.
+    """PyTorch version of _gridmake2.
 
     Expands two tensors into a matrix where rows span the cartesian product
     of combinations of the input tensors. Each column of the input tensors
@@ -161,10 +173,9 @@ def _gridmake2_torch(x1: torch.Tensor, x2: torch.Tensor) -> torch.Tensor:
         first = x1.tile(x2.shape[0])
         second = x2.repeat_interleave(x1.shape[0])
         return torch.column_stack([first, second])
-    elif x1.dim() > 1 and x2.dim() == 1:
+    if x1.dim() > 1 and x2.dim() == 1:
         # tile x1 along first dimension
         first = x1.tile(x2.shape[0], 1)
         second = x2.repeat_interleave(x1.shape[0])
         return torch.column_stack([first, second])
-    else:
-        raise NotImplementedError("Come back here")
+    raise NotImplementedError("Come back here")