Skip to content

Commit c0dd601

Browse files
Add KB entry 035: streaming hypergraph multi-pass evaluation
Lessons from optimizing FREIGHT streaming hypergraph partitioning multi-pass evaluation via pybind11 bindings. Key techniques: per-net bit vectors with popcount, objective-specific evaluation paths, incremental bit-setting, and avoiding redundant copies. 1.82x speedup (82.9ms to 45.6ms) across 96 configurations, bit-identical results with the FREIGHT CLI. Includes detailed "what didn't work" section covering loop fusion, LTO, software prefetching, and raw pointers vs pybind11 proxies.
1 parent 0a407f8 commit c0dd601

File tree

2 files changed

+124
-0
lines changed

2 files changed

+124
-0
lines changed
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Eliminating Heap Allocation in Streaming Hypergraph Multi-Pass Evaluation
2+
3+
## Problem
4+
5+
Optimizing the multi-pass streaming hypergraph partitioning in [FREIGHT](https://github.com/freelancer/FREIGHT) via its pybind11 binding. FREIGHT uses a streaming algorithm where nodes are processed one-by-one and assigned to partition blocks. Multi-pass restreaming re-evaluates the partition after each pass and keeps the best. The metric was geometric mean of wall-clock time across 96 configurations (4 ISPD98 instances, 3 k values, 4 pass counts, 2 objectives), with a hard constraint that results must remain bit-identical to the FREIGHT CLI. The baseline geometric mean was 82.9ms.
6+
7+
The bottleneck was the inter-pass evaluation: after each pass, the code needed to compute how many hyperedges (nets) span multiple partition blocks. The original implementation built a reverse mapping (`vector<vector<int64_t>>` net-to-nodes, O(total_pins) with many small vector allocations) and then created a `std::set<PartitionID>` for each net to count distinct blocks. For large hypergraphs (200K+ nets, 800K+ pins), this meant millions of heap allocations per pass.
8+
9+
## What Worked
10+
11+
The combined effect was a **1.82x speedup** (82.9ms to 45.6ms) over 8 kept experiments. Below are the techniques in order of impact.
12+
13+
### 1. Per-net bit vectors replacing reverse mapping + std::set (~1.60x)
14+
15+
The original evaluation built `net_to_nodes[net] = {node0, node1, ...}` (a `vector<vector<int64_t>>`) and then for each net created a `std::set<PartitionID>` to count distinct blocks. Both data structures involve heap allocation per net.
16+
17+
**Fix:** Replace with a flat `vector<uint64_t>` of size `num_nets * ceil(k/64)`. Each net gets `ceil(k/64)` words as a bit vector. To evaluate, iterate nodes via the existing node-to-edge CSR, and for each node's nets, OR the assigned block's bit into the net's word. Then a single popcount scan over all nets gives the distinct block count. This eliminates the entire reverse mapping and all `std::set` allocations.
18+
19+
For k <= 64 (the common case), each net uses exactly one `uint64_t`, and the bit-setting reduces to `net_block_bits[net_id] |= (1ULL << block)`.
20+
21+
### 2. Objective-specific evaluation paths (~4%)
22+
23+
For connectivity objective, the bit-vector evaluation is needed (distinct block count per net). But for cut-net objective, the existing `stream_edges_assign` array already tracks whether each net is cut: entries equal to `CUT_NET` indicate cut nets. A simple O(num_nets) sequential scan counting `CUT_NET` entries replaces the entire bit-vector machinery for half of all configurations.
24+
25+
### 3. Incremental bit-setting during the main loop (~0.5%)
26+
27+
Instead of a separate O(total_pins) evaluation scan after each pass, set bits during the main partitioning loop (one OR per pin right after `solve_node`). The evaluation reduces to just the O(num_nets) popcount scan. The edge data (CSR indices) is in L1 cache from the prior accumulation scan, making the bit-setting nearly free.
28+
29+
### 4. Eliminate intermediate vectors (~4%)
30+
31+
The original code maintained a `valid_neighboring_nets` vector, cleared and rebuilt per node via `push_back`. Replacing this with a direct re-iteration of the CSR edges for the per-net tracking update eliminates vector overhead (no clear, no push_back, no capacity management) while accessing the same data already in L1 cache.
32+
33+
### 5. Avoid redundant copies in output path (~3%)
34+
35+
The original code: (a) copied best assignment to a snapshot vector on each improvement, (b) restored the snapshot back to the working array after the loop, (c) element-wise copied the working array to the numpy output. Three O(n) passes over 210K+ elements.
36+
37+
**Fix:** Track which pass was best. Skip the snapshot on the last pass (just read the working array directly). Use `memcpy` for intermediate snapshots instead of `vector::operator=`. Copy the correct source directly to the numpy output via `memcpy`, skipping the intermediate restore.
38+
39+
### 6. Pre-allocate best partition vectors (~3%)
40+
41+
Pre-size `best_nodes_assign` and `best_blocks_weight` to their final sizes at initialization. This avoids a dynamic allocation + copy when the first improvement is found. Small but measurable because the allocation is O(n) and triggers on the first pass of every multi-pass run.
42+
43+
## Experiment Data
44+
45+
| # | geo_mean (ms) | Status | Hypothesis |
46+
|---|--------------|--------|------------|
47+
| 0 | 82.92 | baseline | Original implementation |
48+
| 1 | 51.73 | keep | Per-net bit vectors replace net_to_nodes + std::set |
49+
| 2 | 51.41 | keep | Incremental bit-setting during main loop |
50+
| 3 | 52.66 | discard | Fuse bit-setting and per-net tracking (hurts ILP) |
51+
| 4 | 49.37 | keep | Eliminate valid_neighboring_nets vector |
52+
| 5 | 50.44 | discard | Specialize post-solve for connectivity vs cut_net (icache pressure) |
53+
| 6 | 49.62 | discard | Specialize bit-setting/popcount for k<=64 (within noise) |
54+
| 7 | 47.18 | keep | Direct CUT_NET counting for cut_net eval |
55+
| 8 | 45.73 | keep | Pre-allocate best partition vectors |
56+
| 9 | 47.86 | discard | Software prefetch (avg node degree ~4, too short for prefetch) |
57+
| 10 | 45.69 | keep | memcpy output directly from best source |
58+
| 11 | 47.67 | discard | Hoist use_connectivity branch (icache pressure from duplication) |
59+
| 12 | 45.47 | keep | Skip snapshot on last pass |
60+
| 13 | 48.55 | discard | LTO for cross-TU inlining (increased code size, worse icache) |
61+
| 14 | 49.75 | discard | Raw pointers instead of pybind11 unchecked accessors (pybind11 generated better code) |
62+
| 15 | 45.58 | keep | null_buf replacing /dev/null file open |
63+
64+
## What Didn't Work
65+
66+
Several approaches that seemed promising failed or regressed:
67+
68+
- **Loop fusion (bit-setting + per-net tracking in one loop):** The two operations have different memory access patterns (OR into bit vector vs read-modify-write of edge assignment). Fusing them into a single loop hurt instruction-level parallelism. Separate loops allow the CPU to pipeline memory operations independently. This was confirmed twice (iterations 3 and 5).
69+
70+
- **Code path specialization (separate loops for connectivity vs cut-net):** Duplicating the inner loop body to eliminate a branch (connectivity: unconditional write; cut-net: read-modify-write) increased instruction cache pressure. The branch was already perfectly predicted since it's loop-invariant. The compiler hoists it automatically.
71+
72+
- **k_words=1 specialization:** For k <= 64, `k_words=1` eliminates a multiply per edge. But the compiler already optimizes `x * 1` and the multiply is hidden by memory latency. Added complexity for no measurable gain.
73+
74+
- **Software prefetching:** Average node degree was ~4 (820K pins / 210K nodes). With so few iterations per inner loop, prefetch instructions don't have enough lead time to hide latency, and the prefetch instruction itself adds overhead.
75+
76+
- **LTO (link-time optimization):** Expected to devirtualize `compute_score` calls across translation units. Instead, increased code size and worsened instruction cache behavior, causing a 6% regression.
77+
78+
- **Raw data pointers vs pybind11 unchecked accessors:** Replacing `vp(i)` / `ve(i)` (pybind11 proxy) with `vp[i]` / `ve[i]` (raw pointer dereference) was 9% slower. The pybind11 `unchecked<1>()` accessor apparently provides alignment or aliasing hints that help the compiler generate better code. A surprising result; do not assume raw pointers are faster than well-designed proxy objects.
79+
80+
## Code Example
81+
82+
Core of the bit-vector evaluation (connectivity mode):
83+
84+
```cpp
85+
// Allocate once: ceil(k/64) words per net
86+
size_t k_words = (k + 63) / 64;
87+
std::vector<uint64_t> net_block_bits(num_nets * k_words, 0);
88+
89+
// During main partitioning loop, after solve_node:
90+
uint64_t bit = uint64_t(1) << (block & 63);
91+
size_t word = block >> 6;
92+
for (int64_t e = edge_begin; e < edge_end; e++) {
93+
net_block_bits[ve(e) * k_words + word] |= bit;
94+
}
95+
96+
// After pass: popcount scan
97+
for (int64_t net = 0; net < num_nets; net++) {
98+
int distinct = 0;
99+
for (size_t w = 0; w < k_words; w++)
100+
distinct += __builtin_popcountll(net_block_bits[net * k_words + w]);
101+
if (distinct > 1)
102+
pass_connectivity += distinct - 1;
103+
}
104+
// Reset for next pass
105+
std::fill(net_block_bits.begin(), net_block_bits.end(), 0);
106+
```
107+
108+
Cut-net evaluation (no bit vectors needed):
109+
110+
```cpp
111+
// stream_edges_assign[net] == CUT_NET means the net is cut
112+
double pass_cut = 0;
113+
for (int64_t net = 0; net < num_nets; net++)
114+
if (stream_edges_assign[net] == CUT_NET) pass_cut += 1;
115+
```
116+
117+
## Environment
118+
119+
- **Language:** C++17 with pybind11 bindings, compiled via scikit-build-core
120+
- **Hardware:** x86-64 Linux (256-core server), also verified on macOS ARM
121+
- **Compiler:** GCC with `-O2 -march=native`
122+
- **Instances:** ISPD98 ibm01 (13K nodes), ibm05 (29K nodes), ibm18 (211K nodes)
123+
- **Benchmark:** 96 configurations (4 instances, k in {4,8,16}, passes in {2,3,5,10}, both connectivity and cut-net objectives), 3 repetitions per config, median timing

knowledge-base/INDEX.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,3 +38,4 @@ Optimization techniques, experiment results, and lessons learned from AAE sessio
3838
| 032 | array module for typed numerical data | Compact storage, binary I/O | array.array for 72% memory reduction vs list | [032-array-module-typed-data.md](032-array-module-typed-data.md) |
3939
| 033 | Cython for C-speed hot loops | Custom metrics, graph traversal | Typed Cython with memoryviews for 95x speedup | [033-cython-c-speed-hot-loops.md](033-cython-c-speed-hot-loops.md) |
4040
| 034 | Optimizing label propagation in graph clustering | Multilevel graph clustering, LP refinement | Dense vectors, counting-sort contraction, sweep specialization, allocation elimination | [034-graph-clustering-lp-refinement.md](034-graph-clustering-lp-refinement.md) |
41+
| 035 | Eliminating heap allocation in streaming hypergraph multi-pass evaluation | Streaming hypergraph partitioning, multi-pass evaluation | Per-net bit vectors with popcount, objective-specific eval paths, incremental bit-setting | [035-streaming-hypergraph-multipass-eval.md](035-streaming-hypergraph-multipass-eval.md) |

0 commit comments

Comments
 (0)