Expand What Worked section with detailed technical descriptions

schulzchristian · schulzchristian · commit 435f45745448 · 2026-03-28T17:30:54.000Z
Each optimization now includes the root cause, the fix, and why it
helps at the hardware/memory level. Enough detail to reproduce each
technique on a similar multilevel graph framework.
diff --git a/knowledge-base/034-graph-clustering-lp-refinement.md b/knowledge-base/034-graph-clustering-lp-refinement.md
@@ -6,22 +6,71 @@ Optimizing the runtime of a signed graph correlation clustering solver ([Scalabl
 
 ## What Worked
 
-The biggest wins came from three areas, yielding a combined **1.23x speedup (18.7%)** over 30 experiments:
+The combined effect was a **1.23x speedup (18.7%)** over 30 experiments. Below is each technique with enough detail to reproduce.
 
-**1. Eliminate hash map overhead in inner loops (~7%)**
-The LP inner loop used `std::unordered_map` for cluster-ID remapping and a heap backed by hash lookups. Replacing these with dense vectors indexed by node/block ID removed all hashing overhead from the hottest loop. This was the single largest win.
+### 1. Dense vectors replacing hash maps in LP inner loops (~7%)
 
-**2. Algorithmic shortcuts in the multilevel framework (~6%)**
-- *Direct contraction via counting sort*: The original code built a `complete_boundary` object and saved/restored the partition at each coarsening level. Replacing this with a single counting-sort-based contraction pass eliminated all that machinery.
-- *Specialize LP sweep 2 for unconstrained path*: When no cluster size constraints are active (the common case), sweep 2 can iterate a cached block-ID array instead of re-reading edge arrays, cutting random memory accesses.
-- *Cache block IDs for low-degree nodes*: For nodes with degree <= 32, caching the block IDs of neighbors during sweep 1 and reusing them in sweeps 2-3 avoids redundant `cluster_id[]` lookups.
+The LP inner loop accumulates edge weights per neighboring block to decide which block a node should move to. The original code used `std::unordered_map<PartitionID, EdgeWeight>` — every edge traversal hashed the target block ID, probed the hash table, and potentially allocated a new bucket. Since this runs for every node on every LP sweep on every coarsening/refinement level, it dominated the profile.
 
-**3. Allocation elimination (~5%)**
-- Stack-allocating LP/coarsening/refinement objects instead of heap-allocating them each level.
-- Making LP buffers (queues, boolean vectors, block arrays) persistent class members that survive across coarsening levels, so their allocations are reused.
-- Linking `tcmalloc_minimal` for faster general allocation/deallocation (~4% alone).
+**Fix:** Replace with a dense `std::vector<EdgeWeight>` of size `max_blocks`, indexed directly by block ID. Track which entries were touched in a small side vector, and reset only those entries after processing each node. This turns O(1)-amortized hash lookups into O(1)-worst-case array indexing and eliminates all hashing, bucket allocation, and cache-hostile pointer chasing.
 
-Smaller wins: `vector<char>` over `vector<bool>` to avoid bit-packing, `MADV_HUGEPAGE` hints for large LP arrays, hoisting edge/hash_map pointers with `__restrict__`, and compiler flags `-fprefetch-loop-arrays -fno-plt`.
+The same pattern applied to `maxNodeHeap`, which backed its key lookups with a hash map. Replacing it with a three-vector architecture (`m_elements`, `m_element_index[node] → position`, `m_heap[position] → key`) gives O(1) direct-indexed lookup instead of hash probing.
+
+### 2. Counting-sort contraction replacing boundary objects (~3%)
+
+Each coarsening level contracts the graph: fine nodes are merged into coarse super-nodes. The original code built a `complete_boundary` object (~16MB on large graphs), saved and restored the full partition map (~8MB), and used `vector<vector<NodeID>>` to group nodes per block — all to support a generic contraction interface.
+
+**Fix:** A single counting-sort pass groups fine nodes by their coarse mapping in O(N) time:
+1. Histogram: count how many fine nodes map to each coarse node.
+2. Prefix sum: convert counts to start offsets.
+3. Scatter: place each fine node at its offset position.
+
+Then iterate coarse nodes in order, processing contiguous runs of fine nodes. This replaces ~24MB of intermediate structures with three flat arrays totaling O(N) and eliminates the partition save/restore entirely. Memory access is sequential during the scatter and iteration phases, which is cache-friendly.
+
+### 3. LP sweep specialization and block-ID caching (~3%)
+
+LP processes each node in three sweeps: (1) accumulate edge weights per block, (2) find the best block, (3) reset the accumulator. In the original code, all three sweeps read the edge array independently, each time dereferencing `cluster_id[edges[e].target]` to look up the target's block.
+
+**Fix — cache block IDs for low-degree nodes:** For nodes with degree ≤ 32 (covering ~95% of nodes in real-world graphs), sweep 1 writes the block IDs into a stack-allocated `PartitionID blk_cache[32]`. Sweeps 2 and 3 iterate `blk_cache` instead of re-reading the edge array and re-dereferencing `cluster_id[]`. The 32-element cache fits in one or two L1 cache lines.
+
+**Fix — specialize sweep 2 for unconstrained path:** When no cluster size constraints are active (the common case in correlation clustering), sweep 2 only needs block IDs and accumulated weights — it doesn't need edge weights or node IDs. The specialized path iterates the `blk_cache` array in a tight loop with no edge-array access at all, cutting random memory reads in half.
+
+**Fix — cache partition IDs in constrained path:** When constraints are active and the graph is already partitioned, sweep 2 must also check `getPartitionIndex()` for each neighbor. Caching these in a `PartitionID part_cache[32]` alongside `blk_cache` avoids a second round of random lookups into the partition array.
+
+### 4. Pointer hoisting with `__restrict__` (~1.5%)
+
+The LP inner loop accesses edges via `G.getEdgeTarget(e)` which compiles to `graphref->m_edges[e].target` — a pointer-to-pointer indirection on every edge. With millions of edges per LP iteration, this adds up.
+
+**Fix:** Add `edge_array()` / `node_array()` accessors to `graph_access` that return raw pointers, and hoist them before the loop:
+```cpp
+const Edge* __restrict__ edges = G.edge_array();
+EdgeWeight* __restrict__ hmap = m_hash_map.data();
+```
+The `__restrict__` qualifier tells the compiler these pointers don't alias, enabling auto-vectorization and instruction reordering that wasn't possible through the accessor indirection.
+
+### 5. Persistent buffers as class members (~2%)
+
+LP coarsening and LP refinement each use several large buffers: the hash map vector, a permutation array, and two queue-membership vectors (`vector<char>`). Originally these were local variables, allocated and freed on every call — once per coarsening level (typically 10-15 levels).
+
+**Fix:** Move them to class member variables (`m_hash_map`, `m_permutation`, `m_qc_a`, `m_qc_b`). On each call, resize if needed (capacity grows monotonically during coarsening since graphs shrink), then `assign()` to reset values. This converts O(N) allocations to O(N) memsets, which are much cheaper — memset is a single cache-line-streaming operation vs malloc's free-list search, mmap, and page-fault overhead.
+
+### 6. Stack allocation of framework objects (~1.5%)
+
+The multilevel loop allocates LP, contraction, and stop-rule objects at each level. Originally these were heap-allocated (`new`/`delete`), producing malloc pressure and heap fragmentation over 10+ levels.
+
+**Fix:** Stack-allocate them as local variables in the coarsening loop. Constructor/destructor run at scope entry/exit with zero allocator overhead. For refinement, the LP and k-way refinement objects are created once and reused across all uncoarsening levels via persistent smart pointers.
+
+### 7. tcmalloc_minimal (~4%)
+
+After eliminating the biggest allocation hotspots, the remaining malloc/free calls (from graph construction, edge arrays, STL containers) still added up. Linking Google's `tcmalloc_minimal` replaced glibc's allocator with one that uses per-thread free-list caches, avoiding lock contention and reducing fragmentation.
+
+**Integration:** Auto-detected via CMake `find_library(TCMALLOC_LIB tcmalloc_minimal)`, linked only on Linux. Falls back to the default allocator if not found.
+
+### 8. Smaller wins (~1.5% combined)
+
+- **`vector<char>` over `vector<bool>`**: The queue-membership flags were `vector<bool>`, which uses bit-packing. Each access requires shift+mask operations. Switching to `vector<char>` (one byte per entry) trades 8x memory for direct byte access — worthwhile because these vectors are small relative to the graph and accessed in the hot loop.
+- **`MADV_HUGEPAGE` for LP arrays**: The hash map vector is randomly accessed by block ID. On graphs with 2M+ nodes, this causes TLB thrashing with 4KB pages. `madvise(MADV_HUGEPAGE)` hints the kernel to back it with 2MB pages, reducing TLB entries needed by 512x. Only applied to LP-local arrays — applying it to the main graph arrays caused THP overhead that was worse than the TLB savings.
+- **Compiler flags**: `-fprefetch-loop-arrays` lets GCC insert prefetch instructions for streaming edge-array iteration. `-fno-plt` eliminates PLT indirection on shared library calls (minor, but free).
 
 ## Experiment Data