This document tracks performance optimizations implemented in SAGESim to improve simulation runtime.
Branch: 24-runtime-optimization
Problem: GPU kernels were performing linear search to find neighbor agent data during step function execution.
Original approach:
# In GPU kernel - for EACH agent, for EACH neighbor
for neighbor_id in agent_neighbors:
# Linear search through ALL agents to find neighbor
for i in range(len(all_agent_ids)):
if all_agent_ids[i] == neighbor_id:
neighbor_data = property_tensor[i]
breakRuntime Complexity: O(N × M × K)
- N = number of agents
- M = average neighbors per agent
- K = total agents in context (local + neighbors)
For large networks, this creates a quadratic or worse bottleneck.
Implementation: model.py:598-651
Key Idea: Convert agent IDs to array indices once on CPU using a hash map, before GPU execution.
# Line 599-600: O(K) time, O(K) space
all_agent_ids_list = self.__rank_local_agent_ids + received_neighbor_ids
agent_id_to_index = {int(agent_id): idx for idx, agent_id in enumerate(all_agent_ids_list)}Example:
all_agent_ids_list = [5, 10, 15, 20, 25]
agent_id_to_index = {5: 0, 10: 1, 15: 2, 20: 3, 25: 4}# Lines 644-651: O(N × M) time
if i == 1: # Property 1 is 'locations' (neighbors/connections)
# Convert agent IDs to local indices while still a list
combined = convert_agent_ids_to_indices(combined, agent_id_to_index)Example conversion:
# Before: neighbors as agent IDs
locations_cpu = [[10, 15], [5, 20], [10, 25]]
# After: neighbors as array indices
locations_as_indices = [[1, 2], [0, 3], [1, 4]]# In GPU kernel - now O(1) lookup!
neighbor_indices = locations[agent_index] # Pre-converted indices
for neighbor_index in neighbor_indices:
neighbor_state = state_tensor[neighbor_index] # Direct access!Per-agent neighbor lookup:
For each agent (N):
For each neighbor (M):
Linear search through all IDs (K):
Compare IDs
Total: O(N × M × K)
Example: 1000 agents, avg 10 neighbors, 1200 total in context
- Operations: 1000 × 10 × 1200 = 12,000,000 comparisons per tick
Hash map creation (once per tick): O(K) ID conversion (once per tick): O(N × M) Per-agent lookup (in kernel): O(1) per neighbor
Total: O(K + N × M) + O(N × M) = O(K + N × M)
Example: Same scenario
- Hash map: 1200 operations
- Conversion: 1000 × 10 = 10,000 operations
- Kernel lookups: 1000 × 10 = 10,000 direct accesses
- Total: ~21,200 operations (574× improvement!)
Location: model.py:29-108
The function has been optimized with:
- Dense array lookup for non-sparse agent IDs (O(1) per lookup)
- Binary search for sparse agent IDs (O(log n) per lookup)
- Fully vectorized path for numpy arrays
def convert_agent_ids_to_indices(data_tensor, agent_id_to_index_map):
# OPTIMIZATION: Build lookup arrays ONCE instead of for every agent!
id_keys = np.array(list(agent_id_to_index_map.keys()), dtype=np.int32)
id_values = np.array(list(agent_id_to_index_map.values()), dtype=np.int32)
min_id = id_keys.min()
max_id = id_keys.max()
id_range = max_id - min_id + 1
# Use dense array if not too sparse (< 3x overhead)
use_dense = id_range < len(agent_id_to_index_map) * 3
if use_dense:
lookup_array = np.full(id_range, -1, dtype=np.int32)
lookup_array[id_keys - min_id] = id_values
else:
# Sparse: use sorted arrays for binary search
sort_idx = np.argsort(id_keys)
sorted_keys = id_keys[sort_idx]
sorted_values = id_values[sort_idx]
result = []
for agent_data in data_tensor:
if isinstance(agent_data, np.ndarray):
# FULLY VECTORIZED: Use pre-built lookup arrays
valid_mask = ~np.isnan(agent_data)
converted = np.full(agent_data.shape, -1, dtype=np.int32)
if np.any(valid_mask):
valid_ids = agent_data[valid_mask].astype(np.int32)
if use_dense:
# Dense lookup: O(1) array indexing
in_range = (valid_ids >= min_id) & (valid_ids <= max_id)
indices = np.full(len(valid_ids), -1, dtype=np.int32)
indices[in_range] = lookup_array[valid_ids[in_range] - min_id]
else:
# Sparse: use searchsorted (O(log n) per lookup)
positions = np.searchsorted(sorted_keys, valid_ids)
found = (positions < len(sorted_keys)) & (sorted_keys[positions] == valid_ids)
indices = np.where(found, sorted_values[positions], -1)
converted[valid_mask] = indices
result.append(converted.tolist())
elif isinstance(agent_data, (list, tuple, set)):
# Handle collections with hash map lookup
converted_data = [
agent_id_to_index_map.get(int(v), -1) if not np.isnan(v) else v
for v in agent_data
]
result.append(converted_data)
# ... handle other cases
return resultFeatures:
- Handles numpy arrays (vectorized), lists, tuples, sets, and scalars
- Preserves NaN values
- Returns -1 for missing IDs (safety)
- Adaptive: uses dense or sparse lookup based on ID distribution
Property 1 is always locations (neighbor data):
# Property index 1 is 'locations' (neighbors/connections) - this is standard in SAGESimConversion applied only to property 1:
# Lines 644-651
if i == 1: # Property 1 is 'locations' (neighbors/connections)
combined = convert_agent_ids_to_indices(combined, agent_id_to_index)Additional memory per tick:
- Hash map: O(K) entries (K = local + neighbor agents)
- Lookup array: O(max_id - min_id) for dense, O(K) for sparse
- Converted locations: O(N × M) indices
Trade-off: Small memory cost for massive performance gain.
Example: 1000 agents, 10 neighbors avg, 200 neighbors received
- Hash map: ~1200 entries × 16 bytes = ~19KB
- Locations copy: ~10000 indices × 4 bytes = ~40KB
- Total overhead: ~59KB (negligible for modern GPUs)
Test file: tests/test_double_buffer.py
Both test cases verify correct neighbor lookups with converted indices:
test_1_tick_spread_with_SIModel: 1→10→100 hierarchical networktest_2_tick_spread_with_SIRModel: Multi-priority execution
Network structure ensures neighbors must be correctly identified for infection spread.
Network density factor:
- Sparse networks (M << K): ~K/M speedup
- Dense networks (M ≈ K): Still O(N × K) → O(N) reduction
Most ABM networks are sparse (M << K), making this optimization highly effective:
| Agents (N) | Avg Neighbors (M) | Context Size (K) | Before (ops) | After (ops) | Speedup |
|---|---|---|---|---|---|
| 100 | 5 | 120 | 60,000 | ~620 | 97× |
| 1,000 | 10 | 1,200 | 12,000,000 | ~21,200 | 566× |
| 10,000 | 20 | 12,000 | 2,400,000,000 | ~212,000 | 11,321× |
| 100,000 | 50 | 120,000 | 600,000,000,000 | ~12,120,000 | 49,504× |
Note: Speedup increases with network size (exactly what we need!).
- Hash map creation:
model.py:599-600 - ID conversion function:
model.py:29-108 - Conversion application:
model.py:644-651 - Kernel argument preparation:
model.py:686-694 - Step function usage:
tests/test_double_buffer.py:24(neighbor_indices access)
Problem:
The convert_to_equal_side_tensor() function was using Awkward Array for all ragged list conversions, even for simple 2D arrays. Awkward Array has significant overhead for depth-2 structures that numpy can handle efficiently with padding.
Overhead: Awkward Array is powerful but adds ~500-1000ms for large 2D tensors that numpy can handle directly.
Implementation: sagesim/internal_utils.py:8-96
Key Idea: Detect tensor depth and use numpy padding for simple 2D structures, fallback to Awkward Array only for complex nested structures.
def convert_to_equal_side_tensor(ragged_list: List[Any]) -> cp.array:
# Quick check: is data already padded? (all rows same length AND elements are scalars)
# This happens after first tick when .tolist() keeps padded structure
if isinstance(ragged_list[0], (list, tuple)):
row_lengths = [len(row) for row in ragged_list]
all_same_len = len(set(row_lengths)) == 1
if all_same_len:
is_scalar = not isinstance(ragged_list[0][0], (list, tuple))
if is_scalar:
# Already padded depth-2 data! Just convert to GPU array
return cp.array(ragged_list, dtype=np.float32) # Detect depth using awkward
awkward_array = ak.from_iter(ragged_list)
min_depth, max_depth = awkward_array.layout.minmax_depth
depth = max_depth
# Use fast NumPy path for depth 1-2 (common cases)
if depth <= 2:
return _convert_numpy_fast(ragged_list, depth)
# Fall back to awkward for depth 3+ (rare cases)
else:
return _convert_awkward(awkward_array, depth)def _convert_numpy_fast(ragged_list: List[Any], depth: int) -> cp.array:
# Depth 1: Simple 1D array (scalars)
if depth == 1:
return cp.array(ragged_list, dtype=np.float32)
# Depth 2: 2D ragged array [[1,2], [3], [4,5,6]]
elif depth == 2:
max_len = max(len(row) for row in ragged_list)
result = np.full((len(ragged_list), max_len), np.nan, dtype=np.float32)
for i, row in enumerate(ragged_list):
if len(row) > 0:
result[i, :len(row)] = list(row) if isinstance(row, set) else row
return cp.array(result)Benchmark on SuperNeuroABM SNN network (2716 somas + 10000 synapses):
| Operation | Before (Awkward) | After (Numpy) | Speedup |
|---|---|---|---|
| Convert to equal side tensor | 0.9-1.1s | 0.5-0.6s | 1.8-2× faster |
Memory: Same memory usage (both create padded arrays)
Correctness: Identical output for 2D structures, tested on all existing test cases
Already-padded fast path:
- Data from previous tick (
.tolist()preserves padding) - Skips all conversion entirely
Numpy fast path (depth <= 2):
- Agent connectivity lists:
[[1, 2, 3], [4, 5], [6]] - Simple ragged property arrays
- Most common case in ABM simulations
Awkward Array fallback (depth > 2):
- Complex nested structures:
[[[1, 2], [3]], [[4]]] - Rare in practice, but supported
- Already-padded detection:
sagesim/internal_utils.py:18-35 - Depth detection:
sagesim/internal_utils.py:37-44 - Numpy fast path:
sagesim/internal_utils.py:55-80 - Awkward fallback:
sagesim/internal_utils.py:83-96 - Usage:
sagesim/model.py:662(called during data preparation)
Problem:
The simulate() method was reimporting the GPU step function module on every call. This caused CuPy to recompile or revalidate the JIT kernel, resulting in 20× performance degradation on subsequent simulation runs.
Testing SuperNeuroABM on Cora graph (5 test papers):
Paper 1: 2.96s ✓
Paper 2: 55.95s ✗ (20× slower!)
Paper 3: 55.69s ✗
Paper 4: 55.95s ✗
Paper 5: 56.09s ✗
Between each paper, model.reset() was called, then simulate() ran again.
def simulate(self, ticks, sync_workers_every_n_ticks=1):
# THIS HAPPENED EVERY TIME simulate() WAS CALLED:
step_func_module = importlib.import_module("step_func_code")
self._step_func = step_func_module.stepfunc
# Then use self._step_func...Why was this slow?
- Reimporting a module with CuPy JIT kernels triggers cache invalidation checks
- CuPy has to verify if the kernel changed, recompile if needed
- Even without recompilation, validation is expensive (~53 seconds!)
- The kernel code never actually changes between calls
Implementation:
sagesim/model.py:436(setup method imports once)sagesim/model.py:713(simulate uses cachedself._step_func)
def setup(self, use_gpu=True):
# Generate step function file
if worker == 0:
with open(self._step_function_file_path, "w") as f:
f.write(generate_gpu_func(...))
comm.barrier()
# ❌ Step function NOT imported here
def simulate(self, ticks, sync_workers_every_n_ticks=1):
# ❌ Import happens here, every time simulate() is called
step_func_module = importlib.import_module("step_func_code")
self._step_func = step_func_module.stepfunc
# ... rest of simulationdef setup(self, use_gpu=True):
# Generate step function file
if worker == 0:
with open(self._step_function_file_path, "w") as f:
f.write(generate_gpu_func(...))
comm.barrier()
# ✓ Import and cache the step function ONCE
with warnings.catch_warnings():
warnings.simplefilter("ignore")
step_func_module = importlib.import_module("step_func_code")
self._step_func = step_func_module.stepfunc # Cached in instance
def simulate(self, ticks, sync_workers_every_n_ticks=1):
# ✓ No import! Just use cached self._step_func
# ... rest of simulation uses self._step_func directlySame benchmark (5 papers on Cora graph):
| Paper | Before | After | Speedup |
|---|---|---|---|
| Paper 1 | 2.96s | 3.29s | 0.9× (slightly slower due to setup overhead) |
| Paper 2 | 55.95s | 2.33s | 24× faster |
| Paper 3 | 55.69s | 2.82s | 20× faster |
| Paper 4 | 55.95s | 2.85s | 20× faster |
| Paper 5 | 56.09s | 3.04s | 18× faster |
Total time for 5 papers: 228.64s → 14.33s (16× overall speedup)
Key insight: model.reset() only resets data (agent states, tick counter), but the compiled GPU kernel doesn't need to change.
Think of it like this:
- Before: Every time you wanted to use a calculator, you built a new one from scratch
- After: You build the calculator once during
setup(), store it inself._step_func, and reuse it forever
The GPU kernel is like a compiled binary - once it's compiled, you can run it as many times as you want with different data.
- Cache during setup:
sagesim/model.py:436 - Kernel invocation:
sagesim/model.py:713(usesself._step_func)
Problem:
SAGESim's AgentFactory maintains a _prev_agent_data cache to avoid redundant MPI sends. When agent data hasn't changed between ticks, the cache skips the MPI communication.
However, model.reset() in SuperNeuroABM resets all agent states to initial values, but didn't clear the cache. This caused:
- Every agent's new data to be compared against stale cached data
- ~12,000 expensive
np.array_equal()calls per reset - 1.8-2.0s overhead on every simulation after the first
Paper 1 contextualize: 0.08s ✓
Paper 2 contextualize: 1.80s ✗ (20× slower!)
Paper 3 contextualize: 2.00s ✗
Paper 4 contextualize: 1.85s ✗
Paper 5 contextualize: 1.95s ✗
# sagesim/agent.py (contextualize_agent_data_tensors)
if agent_id in self._prev_agent_data:
# Compare against cached data
all_equal = all(
np.array_equal(curr, prev)
for curr, prev in zip(current_data, cached_data)
)
if all_equal:
continue # Skip MPI sendAfter reset(), all agents have changed, but we still do 12,000+ array comparisons before figuring this out!
Implementation: Application code should clear cache in reset method
def reset(self, retain_parameters: bool = True) -> None:
self._reset_agents(retain_parameters=retain_parameters)
# ❌ Cache NOT cleared, still contains old data
super().reset()def reset(self, retain_parameters: bool = True) -> None:
self._reset_agents(retain_parameters=retain_parameters)
# ✓ Clear SAGESim's agent data cache
self._agent_factory._prev_agent_data.clear()
super().reset()Benchmark on SuperNeuroABM (2716 somas + 10000 synapses):
| Metric | Before | After | Speedup |
|---|---|---|---|
| Paper 1 contextualize | 0.08s | 0.08s | 1× (unchanged) |
| Paper 2+ contextualize | 1.8-2.0s | 0.12-0.15s | 13-16× faster |
Why Paper 1 unchanged? On first run, cache is empty, so no comparisons happen. Optimization only helps subsequent runs after reset.
The cache is designed for inter-tick optimization during a single simulation:
- Tick 1 → Tick 2: Many agents unchanged → skip MPI send
- Tick 2 → Tick 3: Many agents unchanged → skip MPI send
But after reset(), we're starting a new simulation:
- All agents are reset to initial state
- Old cached data is irrelevant
- Comparing against stale cache is pure waste
Solution: Clear the cache on reset, treat each simulation as fresh.
- Cache usage:
sagesim/agent.py:557-600(contextualize method) - Cache definition:
sagesim/agent.py:53(_prev_agent_data = {})
Problem: Every tick, workers exchange ALL properties for neighbor agents, even properties that neighbors never read.
Impact: Wasted bandwidth, especially for agents with many internal-only properties.
Implementation: sagesim/breed.py:37-58 and sagesim/agent.py:534-539
Mark properties that neighbors don't need to read:
breed.register_property("health", default=100, neighbor_visible=True) # Sent
breed.register_property("counter", default=0, neighbor_visible=False) # NOT sentOnly neighbor_visible=True properties are included in MPI messages.
| Scenario | Properties | Visible | Bandwidth Reduction |
|---|---|---|---|
| SIR Model | 5 | 2 | 60% |
| Neural Network | 10 | 3 | 70% |
| Complex ABM | 20 | 5 | 75% |
See docs/selective_property_synchronization.md for full documentation.
| # | Optimization | Component | Speedup | Impact |
|---|---|---|---|---|
| 1 | ID to Index Conversion | GPU kernel lookup | 100-50,000× | Eliminates O(N×M×K) linear search |
| 2 | Depth Detection for Tensors | Tensor conversion | 1.8-2× | Numpy fast path for 2D arrays |
| 3 | Step Function Caching | JIT compilation | 16-24× | Import once instead of per-simulate |
| 4 | Clear Cache on Reset | MPI contextualize | 13-16× | Avoid stale data comparisons |
| 5 | Selective Property Sync | MPI bandwidth | 60-75% | Only send neighbor-visible properties |
Combined impact: Can improve overall simulation performance by 100-1000× depending on network size and usage pattern.
SuperNeuroABM on Cora Graph (5 test papers, 2716 somas, 10000 synapses):
Paper 1: ~120s (linear search + awkward + first import)
Paper 2: ~180s (linear search + awkward + reimport + stale cache)
Paper 3: ~180s
Paper 4: ~180s
Paper 5: ~180s
Total: ~840s (14 minutes)
Paper 1: 3.29s
Paper 2: 2.33s
Paper 3: 2.82s
Paper 4: 2.85s
Paper 5: 3.04s
Total: 14.33s
Overall speedup: 58.6× (from 14 minutes to 14 seconds!)
- GPU-Resident Data: Keep data on GPU between ticks, eliminate CPU↔GPU transfers
- GPU-Aware MPI: Direct GPU-to-GPU communication via RDMA
- GPU-Side Packing: Pack/unpack MPI buffers on GPU instead of CPU
- RCCL/NCCL Integration: Use GPU-native collectives for global reductions
- Kernel Fusion: Combine multiple priority groups when safe
See gpu_optimization_roadmap.md for detailed plans.
- CuPy profiler:
cupyx.profiler.time_range() - NVIDIA Nsight Systems: Full timeline analysis
- ROCm rocprof: AMD GPU profiling
- All optimizations maintain 100% backward compatibility
- No changes required to user code or step functions
- Optimizations are transparent and automatic
- All optimizations verified with existing test suites
- Memory overhead is negligible (<100KB per optimization)