You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expand What Worked section with detailed technical descriptions
Each optimization now includes the root cause, the fix, and why it
helps at the hardware/memory level. Enough detail to reproduce each
technique on a similar multilevel graph framework.
Copy file name to clipboardExpand all lines: knowledge-base/034-graph-clustering-lp-refinement.md
+61-12Lines changed: 61 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,22 +6,71 @@ Optimizing the runtime of a signed graph correlation clustering solver ([Scalabl
6
6
7
7
## What Worked
8
8
9
-
The biggest wins came from three areas, yielding a combined **1.23x speedup (18.7%)** over 30 experiments:
9
+
The combined effect was a **1.23x speedup (18.7%)** over 30 experiments. Below is each technique with enough detail to reproduce.
10
10
11
-
**1. Eliminate hash map overhead in inner loops (~7%)**
12
-
The LP inner loop used `std::unordered_map` for cluster-ID remapping and a heap backed by hash lookups. Replacing these with dense vectors indexed by node/block ID removed all hashing overhead from the hottest loop. This was the single largest win.
**2. Algorithmic shortcuts in the multilevel framework (~6%)**
15
-
-*Direct contraction via counting sort*: The original code built a `complete_boundary` object and saved/restored the partition at each coarsening level. Replacing this with a single counting-sort-based contraction pass eliminated all that machinery.
16
-
-*Specialize LP sweep 2 for unconstrained path*: When no cluster size constraints are active (the common case), sweep 2 can iterate a cached block-ID array instead of re-reading edge arrays, cutting random memory accesses.
17
-
-*Cache block IDs for low-degree nodes*: For nodes with degree <= 32, caching the block IDs of neighbors during sweep 1 and reusing them in sweeps 2-3 avoids redundant `cluster_id[]` lookups.
13
+
The LP inner loop accumulates edge weights per neighboring block to decide which block a node should move to. The original code used `std::unordered_map<PartitionID, EdgeWeight>` — every edge traversal hashed the target block ID, probed the hash table, and potentially allocated a new bucket. Since this runs for every node on every LP sweep on every coarsening/refinement level, it dominated the profile.
18
14
19
-
**3. Allocation elimination (~5%)**
20
-
- Stack-allocating LP/coarsening/refinement objects instead of heap-allocating them each level.
21
-
- Making LP buffers (queues, boolean vectors, block arrays) persistent class members that survive across coarsening levels, so their allocations are reused.
22
-
- Linking `tcmalloc_minimal` for faster general allocation/deallocation (~4% alone).
15
+
**Fix:** Replace with a dense `std::vector<EdgeWeight>` of size `max_blocks`, indexed directly by block ID. Track which entries were touched in a small side vector, and reset only those entries after processing each node. This turns O(1)-amortized hash lookups into O(1)-worst-case array indexing and eliminates all hashing, bucket allocation, and cache-hostile pointer chasing.
23
16
24
-
Smaller wins: `vector<char>` over `vector<bool>` to avoid bit-packing, `MADV_HUGEPAGE` hints for large LP arrays, hoisting edge/hash_map pointers with `__restrict__`, and compiler flags `-fprefetch-loop-arrays -fno-plt`.
17
+
The same pattern applied to `maxNodeHeap`, which backed its key lookups with a hash map. Replacing it with a three-vector architecture (`m_elements`, `m_element_index[node] → position`, `m_heap[position] → key`) gives O(1) direct-indexed lookup instead of hash probing.
Each coarsening level contracts the graph: fine nodes are merged into coarse super-nodes. The original code built a `complete_boundary` object (~16MB on large graphs), saved and restored the full partition map (~8MB), and used `vector<vector<NodeID>>` to group nodes per block — all to support a generic contraction interface.
22
+
23
+
**Fix:** A single counting-sort pass groups fine nodes by their coarse mapping in O(N) time:
24
+
1. Histogram: count how many fine nodes map to each coarse node.
25
+
2. Prefix sum: convert counts to start offsets.
26
+
3. Scatter: place each fine node at its offset position.
27
+
28
+
Then iterate coarse nodes in order, processing contiguous runs of fine nodes. This replaces ~24MB of intermediate structures with three flat arrays totaling O(N) and eliminates the partition save/restore entirely. Memory access is sequential during the scatter and iteration phases, which is cache-friendly.
29
+
30
+
### 3. LP sweep specialization and block-ID caching (~3%)
31
+
32
+
LP processes each node in three sweeps: (1) accumulate edge weights per block, (2) find the best block, (3) reset the accumulator. In the original code, all three sweeps read the edge array independently, each time dereferencing `cluster_id[edges[e].target]` to look up the target's block.
33
+
34
+
**Fix — cache block IDs for low-degree nodes:** For nodes with degree ≤ 32 (covering ~95% of nodes in real-world graphs), sweep 1 writes the block IDs into a stack-allocated `PartitionID blk_cache[32]`. Sweeps 2 and 3 iterate `blk_cache` instead of re-reading the edge array and re-dereferencing `cluster_id[]`. The 32-element cache fits in one or two L1 cache lines.
35
+
36
+
**Fix — specialize sweep 2 for unconstrained path:** When no cluster size constraints are active (the common case in correlation clustering), sweep 2 only needs block IDs and accumulated weights — it doesn't need edge weights or node IDs. The specialized path iterates the `blk_cache` array in a tight loop with no edge-array access at all, cutting random memory reads in half.
37
+
38
+
**Fix — cache partition IDs in constrained path:** When constraints are active and the graph is already partitioned, sweep 2 must also check `getPartitionIndex()` for each neighbor. Caching these in a `PartitionID part_cache[32]` alongside `blk_cache` avoids a second round of random lookups into the partition array.
39
+
40
+
### 4. Pointer hoisting with `__restrict__` (~1.5%)
41
+
42
+
The LP inner loop accesses edges via `G.getEdgeTarget(e)` which compiles to `graphref->m_edges[e].target` — a pointer-to-pointer indirection on every edge. With millions of edges per LP iteration, this adds up.
43
+
44
+
**Fix:** Add `edge_array()` / `node_array()` accessors to `graph_access` that return raw pointers, and hoist them before the loop:
The `__restrict__` qualifier tells the compiler these pointers don't alias, enabling auto-vectorization and instruction reordering that wasn't possible through the accessor indirection.
50
+
51
+
### 5. Persistent buffers as class members (~2%)
52
+
53
+
LP coarsening and LP refinement each use several large buffers: the hash map vector, a permutation array, and two queue-membership vectors (`vector<char>`). Originally these were local variables, allocated and freed on every call — once per coarsening level (typically 10-15 levels).
54
+
55
+
**Fix:** Move them to class member variables (`m_hash_map`, `m_permutation`, `m_qc_a`, `m_qc_b`). On each call, resize if needed (capacity grows monotonically during coarsening since graphs shrink), then `assign()` to reset values. This converts O(N) allocations to O(N) memsets, which are much cheaper — memset is a single cache-line-streaming operation vs malloc's free-list search, mmap, and page-fault overhead.
56
+
57
+
### 6. Stack allocation of framework objects (~1.5%)
58
+
59
+
The multilevel loop allocates LP, contraction, and stop-rule objects at each level. Originally these were heap-allocated (`new`/`delete`), producing malloc pressure and heap fragmentation over 10+ levels.
60
+
61
+
**Fix:** Stack-allocate them as local variables in the coarsening loop. Constructor/destructor run at scope entry/exit with zero allocator overhead. For refinement, the LP and k-way refinement objects are created once and reused across all uncoarsening levels via persistent smart pointers.
62
+
63
+
### 7. tcmalloc_minimal (~4%)
64
+
65
+
After eliminating the biggest allocation hotspots, the remaining malloc/free calls (from graph construction, edge arrays, STL containers) still added up. Linking Google's `tcmalloc_minimal` replaced glibc's allocator with one that uses per-thread free-list caches, avoiding lock contention and reducing fragmentation.
66
+
67
+
**Integration:** Auto-detected via CMake `find_library(TCMALLOC_LIB tcmalloc_minimal)`, linked only on Linux. Falls back to the default allocator if not found.
68
+
69
+
### 8. Smaller wins (~1.5% combined)
70
+
71
+
-**`vector<char>` over `vector<bool>`**: The queue-membership flags were `vector<bool>`, which uses bit-packing. Each access requires shift+mask operations. Switching to `vector<char>` (one byte per entry) trades 8x memory for direct byte access — worthwhile because these vectors are small relative to the graph and accessed in the hot loop.
72
+
-**`MADV_HUGEPAGE` for LP arrays**: The hash map vector is randomly accessed by block ID. On graphs with 2M+ nodes, this causes TLB thrashing with 4KB pages. `madvise(MADV_HUGEPAGE)` hints the kernel to back it with 2MB pages, reducing TLB entries needed by 512x. Only applied to LP-local arrays — applying it to the main graph arrays caused THP overhead that was worse than the TLB savings.
73
+
-**Compiler flags**: `-fprefetch-loop-arrays` lets GCC insert prefetch instructions for streaming edge-array iteration. `-fno-plt` eliminates PLT indirection on shared library calls (minor, but free).
0 commit comments