Skip to content

Commit c52e561

Browse files
Merge pull request #1 from CHSZLab/kb/034-graph-partitioning-lp-optimization
Add KB entry: optimizing LP in graph clustering (1.23x)
2 parents e000fe4 + 435f457 commit c52e561

File tree

2 files changed

+147
-0
lines changed

2 files changed

+147
-0
lines changed
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# Optimizing Label Propagation in Graph Clustering
2+
3+
## Problem
4+
5+
Optimizing the runtime of a signed graph correlation clustering solver ([ScalableCorrelationClustering](https://github.com/KaHIP/ScalableCorrelationClustering)) built on the KaHIP multilevel framework. The solver uses label propagation (LP) for both coarsening and refinement, plus FM-based refinement on coarse levels. The metric was geometric mean of execution times across multiple real-world graph instances, with a hard constraint that solution quality must stay within 0.001% of the baseline. The baseline geometric mean was 1.528s.
6+
7+
## What Worked
8+
9+
The combined effect was a **1.23x speedup (18.7%)** over 30 experiments. Below is each technique with enough detail to reproduce.
10+
11+
### 1. Dense vectors replacing hash maps in LP inner loops (~7%)
12+
13+
The LP inner loop accumulates edge weights per neighboring block to decide which block a node should move to. The original code used `std::unordered_map<PartitionID, EdgeWeight>` — every edge traversal hashed the target block ID, probed the hash table, and potentially allocated a new bucket. Since this runs for every node on every LP sweep on every coarsening/refinement level, it dominated the profile.
14+
15+
**Fix:** Replace with a dense `std::vector<EdgeWeight>` of size `max_blocks`, indexed directly by block ID. Track which entries were touched in a small side vector, and reset only those entries after processing each node. This turns O(1)-amortized hash lookups into O(1)-worst-case array indexing and eliminates all hashing, bucket allocation, and cache-hostile pointer chasing.
16+
17+
The same pattern applied to `maxNodeHeap`, which backed its key lookups with a hash map. Replacing it with a three-vector architecture (`m_elements`, `m_element_index[node] → position`, `m_heap[position] → key`) gives O(1) direct-indexed lookup instead of hash probing.
18+
19+
### 2. Counting-sort contraction replacing boundary objects (~3%)
20+
21+
Each coarsening level contracts the graph: fine nodes are merged into coarse super-nodes. The original code built a `complete_boundary` object (~16MB on large graphs), saved and restored the full partition map (~8MB), and used `vector<vector<NodeID>>` to group nodes per block — all to support a generic contraction interface.
22+
23+
**Fix:** A single counting-sort pass groups fine nodes by their coarse mapping in O(N) time:
24+
1. Histogram: count how many fine nodes map to each coarse node.
25+
2. Prefix sum: convert counts to start offsets.
26+
3. Scatter: place each fine node at its offset position.
27+
28+
Then iterate coarse nodes in order, processing contiguous runs of fine nodes. This replaces ~24MB of intermediate structures with three flat arrays totaling O(N) and eliminates the partition save/restore entirely. Memory access is sequential during the scatter and iteration phases, which is cache-friendly.
29+
30+
### 3. LP sweep specialization and block-ID caching (~3%)
31+
32+
LP processes each node in three sweeps: (1) accumulate edge weights per block, (2) find the best block, (3) reset the accumulator. In the original code, all three sweeps read the edge array independently, each time dereferencing `cluster_id[edges[e].target]` to look up the target's block.
33+
34+
**Fix — cache block IDs for low-degree nodes:** For nodes with degree ≤ 32 (covering ~95% of nodes in real-world graphs), sweep 1 writes the block IDs into a stack-allocated `PartitionID blk_cache[32]`. Sweeps 2 and 3 iterate `blk_cache` instead of re-reading the edge array and re-dereferencing `cluster_id[]`. The 32-element cache fits in one or two L1 cache lines.
35+
36+
**Fix — specialize sweep 2 for unconstrained path:** When no cluster size constraints are active (the common case in correlation clustering), sweep 2 only needs block IDs and accumulated weights — it doesn't need edge weights or node IDs. The specialized path iterates the `blk_cache` array in a tight loop with no edge-array access at all, cutting random memory reads in half.
37+
38+
**Fix — cache partition IDs in constrained path:** When constraints are active and the graph is already partitioned, sweep 2 must also check `getPartitionIndex()` for each neighbor. Caching these in a `PartitionID part_cache[32]` alongside `blk_cache` avoids a second round of random lookups into the partition array.
39+
40+
### 4. Pointer hoisting with `__restrict__` (~1.5%)
41+
42+
The LP inner loop accesses edges via `G.getEdgeTarget(e)` which compiles to `graphref->m_edges[e].target` — a pointer-to-pointer indirection on every edge. With millions of edges per LP iteration, this adds up.
43+
44+
**Fix:** Add `edge_array()` / `node_array()` accessors to `graph_access` that return raw pointers, and hoist them before the loop:
45+
```cpp
46+
const Edge* __restrict__ edges = G.edge_array();
47+
EdgeWeight* __restrict__ hmap = m_hash_map.data();
48+
```
49+
The `__restrict__` qualifier tells the compiler these pointers don't alias, enabling auto-vectorization and instruction reordering that wasn't possible through the accessor indirection.
50+
51+
### 5. Persistent buffers as class members (~2%)
52+
53+
LP coarsening and LP refinement each use several large buffers: the hash map vector, a permutation array, and two queue-membership vectors (`vector<char>`). Originally these were local variables, allocated and freed on every call — once per coarsening level (typically 10-15 levels).
54+
55+
**Fix:** Move them to class member variables (`m_hash_map`, `m_permutation`, `m_qc_a`, `m_qc_b`). On each call, resize if needed (capacity grows monotonically during coarsening since graphs shrink), then `assign()` to reset values. This converts O(N) allocations to O(N) memsets, which are much cheaper — memset is a single cache-line-streaming operation vs malloc's free-list search, mmap, and page-fault overhead.
56+
57+
### 6. Stack allocation of framework objects (~1.5%)
58+
59+
The multilevel loop allocates LP, contraction, and stop-rule objects at each level. Originally these were heap-allocated (`new`/`delete`), producing malloc pressure and heap fragmentation over 10+ levels.
60+
61+
**Fix:** Stack-allocate them as local variables in the coarsening loop. Constructor/destructor run at scope entry/exit with zero allocator overhead. For refinement, the LP and k-way refinement objects are created once and reused across all uncoarsening levels via persistent smart pointers.
62+
63+
### 7. tcmalloc_minimal (~4%)
64+
65+
After eliminating the biggest allocation hotspots, the remaining malloc/free calls (from graph construction, edge arrays, STL containers) still added up. Linking Google's `tcmalloc_minimal` replaced glibc's allocator with one that uses per-thread free-list caches, avoiding lock contention and reducing fragmentation.
66+
67+
**Integration:** Auto-detected via CMake `find_library(TCMALLOC_LIB tcmalloc_minimal)`, linked only on Linux. Falls back to the default allocator if not found.
68+
69+
### 8. Smaller wins (~1.5% combined)
70+
71+
- **`vector<char>` over `vector<bool>`**: The queue-membership flags were `vector<bool>`, which uses bit-packing. Each access requires shift+mask operations. Switching to `vector<char>` (one byte per entry) trades 8x memory for direct byte access — worthwhile because these vectors are small relative to the graph and accessed in the hot loop.
72+
- **`MADV_HUGEPAGE` for LP arrays**: The hash map vector is randomly accessed by block ID. On graphs with 2M+ nodes, this causes TLB thrashing with 4KB pages. `madvise(MADV_HUGEPAGE)` hints the kernel to back it with 2MB pages, reducing TLB entries needed by 512x. Only applied to LP-local arrays — applying it to the main graph arrays caused THP overhead that was worse than the TLB savings.
73+
- **Compiler flags**: `-fprefetch-loop-arrays` lets GCC insert prefetch instructions for streaming edge-array iteration. `-fno-plt` eliminates PLT indirection on shared library calls (minor, but free).
74+
75+
## Experiment Data
76+
77+
| # | Commit | Geo-mean (s) | Status | Hypothesis |
78+
|---|--------|-------------|--------|------------|
79+
| 0 | f683dcc | 1.528 | baseline ||
80+
| 1 | 79c85ce | 1.426 | keep | Dense vector replaces unordered_map |
81+
| 2 | e35d815 | 1.427 | keep | Lazy reset for vertex_moved_hashtable |
82+
| 3 | cf06cce | 1.390 | keep | Direct contraction via counting sort |
83+
| 4 | 21c8fc2 | 1.362 | keep | Hoist max_blocks vector outside loop |
84+
| 5 | f72cc86 | 1.379 | keep | Stack-allocate LP queues |
85+
| 6 | cc13299 | 1.366 | keep | Stack-allocate coarsening objects |
86+
| 7 | c373c2e | 1.352 | keep | Stack-allocate refinement objects |
87+
| 8 | 9decfee | 1.340 | keep | vector\<char\> over vector\<bool\> |
88+
| 9 | b19650d | 1.350 | keep | Reserve FM vector capacity |
89+
| 10 | 6a846db | 1.377 | discard | Pre-size maxNodeHeap — wasted time |
90+
| 11 | 41e405d | 1.380 | discard | Software prefetching — overhead > benefit |
91+
| 12 | 5a1af07 | 1.368 | discard | Reuse contraction buffers — no gain |
92+
| 13 | aab4041 | 1.382 | discard | vector\<char\> in contraction — no gain |
93+
| 14 | 9c7ea1a | 1.370 | discard | Devirtualize FM queue — slight regression |
94+
| 15 | d166e01 | 1.356 | discard | Eliminate PartitionConfig copy — noise |
95+
| 16 | b02db57 | 1.369 | discard | Simplify relabeling loop — marginal regression |
96+
| 17 | 8f1838e | 1.344 | keep | MADV_HUGEPAGE for LP arrays |
97+
| 18 | 502ff05 | 1.395 | discard | Skip m_local_degrees init — regression |
98+
| 19 | ceb9beb | 1.352 | discard | Cache block IDs (full) — overhead from max-degree scan |
99+
| 20 | 0089f5b | 1.349 | keep | Cache block IDs for degree<=32 |
100+
| 21 | 859d930 | 1.330 | keep | Specialize LP sweep 2 for unconstrained path |
101+
| 22 | 0dcf027 | 1.316 | discard | blk_cache in FM — overhead on coarse levels |
102+
| 23 | 97189b0 | 1.307 | keep | Cache partition IDs in constrained path |
103+
| 24 | ec86900 | 1.258 | keep | tcmalloc_minimal via LD_PRELOAD |
104+
| 25 | 7b5981a | 1.267 | keep | tcmalloc_minimal linked via CMake |
105+
| 26 | 11135ac | 1.268 | discard | PGO — hurts non-profiled instances |
106+
| 27 | b73363b | 1.289 | discard | MADV_HUGEPAGE on graph arrays — THP overhead |
107+
| 28 | 92a38c7 | 1.244 | keep | Hoist edge/hash_map pointers |
108+
| 29 | 8034afa | 1.242 | keep | Persistent LP refinement buffers |
109+
110+
18 kept, 12 discarded (60% success rate). Final speedup: **1.23x**.
111+
112+
## What Didn't Work
113+
114+
- **Profile-guided optimization (PGO)**: Improved profiled instances but degraded others, hurting the geometric mean across the full benchmark suite.
115+
- **Software prefetching (`__builtin_prefetch`)**: Manual prefetch of edge arrays added more overhead than it saved. The hardware prefetcher was already doing a good job on the sequential access patterns.
116+
- **MADV_HUGEPAGE on graph arrays**: While it helped for LP-local arrays, applying it to the main graph adjacency arrays caused THP (Transparent Huge Pages) overhead that exceeded the TLB-miss savings.
117+
- **Devirtualizing FM queue**: Replacing virtual dispatch with templates caused a slight regression, likely due to increased code size reducing instruction cache efficiency.
118+
- **Reusing contraction buffers across levels**: The buffers change size each level, so reuse didn't save meaningful allocation work.
119+
120+
## Code Example
121+
122+
Core change: replacing hash map with dense vector in LP inner loop.
123+
124+
```cpp
125+
// Before: hash map lookups in hot loop
126+
std::unordered_map<PartitionID, EdgeWeight> block_weights;
127+
forall_out_edges(G, e, node) {
128+
PartitionID target_block = G.getPartitionIndex(G.getEdgeTarget(e));
129+
block_weights[target_block] += G.getEdgeWeight(e);
130+
} endfor
131+
132+
// After: dense vector indexed by block ID (cleared via touched-list)
133+
std::vector<EdgeWeight> block_weights(max_blocks, 0); // persistent, reused
134+
std::vector<PartitionID> touched;
135+
forall_out_edges(G, e, node) {
136+
PartitionID target_block = G.getPartitionIndex(G.getEdgeTarget(e));
137+
if (block_weights[target_block] == 0) touched.push_back(target_block);
138+
block_weights[target_block] += G.getEdgeWeight(e);
139+
} endfor
140+
// ... use block_weights, then reset only touched entries
141+
for (auto b : touched) block_weights[b] = 0;
142+
```
143+
144+
## Environment
145+
146+
C++17, GCC 11.4 with `-O3 -march=native`, Linux (Ubuntu 22.04), Intel Xeon. Graphs range from thousands to millions of nodes. Multilevel framework with label propagation coarsening/refinement and FM-based k-way refinement.

knowledge-base/INDEX.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,4 @@ Optimization techniques, experiment results, and lessons learned from AAE sessio
3737
| 031 | deque vs list for queue operations | BFS, FIFO queues | collections.deque for O(1) popleft vs list O(N) | [031-deque-vs-list-queue-ops.md](031-deque-vs-list-queue-ops.md) |
3838
| 032 | array module for typed numerical data | Compact storage, binary I/O | array.array for 72% memory reduction vs list | [032-array-module-typed-data.md](032-array-module-typed-data.md) |
3939
| 033 | Cython for C-speed hot loops | Custom metrics, graph traversal | Typed Cython with memoryviews for 95x speedup | [033-cython-c-speed-hot-loops.md](033-cython-c-speed-hot-loops.md) |
40+
| 034 | Optimizing label propagation in graph clustering | Multilevel graph clustering, LP refinement | Dense vectors, counting-sort contraction, sweep specialization, allocation elimination | [034-graph-clustering-lp-refinement.md](034-graph-clustering-lp-refinement.md) |

0 commit comments

Comments
 (0)