|
| 1 | +# MCAAT: Cycle Finder — Algorithmic & Optimization Report ✅ |
| 2 | + |
| 3 | +**Scope:** This document describes *only* the algorithmic changes and optimizations introduced in the `optimizations` branch for the cycle finder logic. It is organized so you can read step-by-step what changed, why it was done, and the expected impact. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Summary |
| 8 | + |
| 9 | +1. Replaced global critical sections and shared writes with *per-thread buffers* and a single serial merge step to remove contention. |
| 10 | +2. Replaced lock-based or synchronized "visited" bookkeeping with a *lock-free atomic bitset* (1 bit per node). |
| 11 | +3. Reduced allocations and allocator contention by *reusing per-thread pools* (megahit-style) and preallocating where useful. |
| 12 | +4. Applied traversal micro-optimizations (fixed-size arrays, prefetch, branch hints) to reduce per-edge overhead. |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## Step-by-step changes (algorithmic & optimization focus) |
| 17 | + |
| 18 | +1) Remove global critical sections → Per-thread buffers + serial merge 🔧 |
| 19 | + - What changed: |
| 20 | + - Replaced OpenMP `#pragma omp critical` style updates to shared containers with `vector<...>` of per-thread collectors (e.g., `local_chunks` and `local_results`). |
| 21 | + - After the parallel loop completes, a single-threaded loop merges per-thread buffers into the shared map or results container. |
| 22 | + - Files/locations: |
| 23 | + - `CycleFinder::ChunkStartNodes` (collect start nodes into `local_chunks[tid]` then merge). |
| 24 | + - `CycleFinder::FindApproximateCRISPRArrays` (collect per-thread `local_results`, then merge into `this->results`). |
| 25 | + - Why / Benefit: |
| 26 | + - Eliminates high-contention points on hot shared data structures, enabling scaling to higher core counts. |
| 27 | + - Serial merge cost is amortized and avoids expensive locking in hot loops. |
| 28 | + |
| 29 | +2) Lock-free visited bitmap (1 bit per node) 🔒→⚡ |
| 30 | + - What changed: |
| 31 | + - Introduced a global `std::vector<uint64_t> s_visited_words` as a bitset (one bit per node). |
| 32 | + - Provided helper inline functions: `InitializeVisitedGlobal(n)`, `IsVisitedGlobal(node)`, and `MarkVisitedGlobal(node)` implemented using GCC/Clang atomic builtins (`__atomic_load_n`, `__atomic_fetch_or`) with `__ATOMIC_RELAXED` ordering. |
| 33 | + - Files/locations: |
| 34 | + - `src/cycle_finder.cpp` (static `s_visited_words` and helpers) and uses in `FindCycle`, `FindCycleUtil`, and background checks. |
| 35 | + - Why / Benefit: |
| 36 | + - Avoids `vector<std::atomic>` pitfalls (copy/resize/copyability) and the overhead of locks around visited updates. |
| 37 | + - One atomic word operation per change (bit flip) is cheap and scales well. |
| 38 | + - Memory is compact (1 bit per node) and predictable for large graphs. |
| 39 | + - Correctness note: |
| 40 | + - Using relaxed atomics is acceptable because bits only transition from 0→1 (monotonic); races among writers do not break correctness, and reads can tolerate transient states. |
| 41 | + |
| 42 | +3) Reduce allocations and reuse per-thread pools (megahit-style) ♻️ |
| 43 | + - What changed: |
| 44 | + - Introduced `static thread_local` pools for DLS (`dls_stack_pool` and `dls_visited_pool`) used by `DepthLevelSearch`. |
| 45 | + - Pools are `clear()`ed between uses but retain capacity; small initial reserve is set to avoid repeated small allocations. |
| 46 | + - Files/locations: |
| 47 | + - `CycleFinder::DepthLevelSearch`. |
| 48 | + - Why / Benefit: |
| 49 | + - Avoids heavy allocator contention when many threads create/destroy temporaries frequently. |
| 50 | + - Reduced per-edge latency and improved throughput during parallel graph traversal. |
| 51 | + |
| 52 | +4) Traversal micro-optimizations (branch hints, fixed arrays, prefetch) 🧠 |
| 53 | + - What changed: |
| 54 | + - Use of fixed-size neighbor arrays (`uint64_t neighbors[MAX_EDGE_COUNT]`) rather than heap allocations per node. |
| 55 | + - Prefetching neighbor buffers and using `__builtin_expect` branch hints to optimize hot paths. |
| 56 | + - Small loop unrolling where out-degree is small (de Bruijn graph pattern) to reduce loop overhead. |
| 57 | + - Files/locations: |
| 58 | + - `DepthLevelSearch`, `_GetOutgoings`, and `_GetIncomings` helpers. |
| 59 | + - Why / Benefit: |
| 60 | + - Better cache locality and fewer branch mispredictions; straightforward per-edge speedups with little code complexity. |
| 61 | + |
| 62 | +5) Results merging and memory hygiene 🧽 |
| 63 | + - What changed: |
| 64 | + - Per-thread `local_results` (maps) are merged serially into `this->results` after each bucket processed. |
| 65 | + - Call `malloc_trim(0)` occasionally after buckets to release heap fragments back to the OS (for long runs with variable memory usage). |
| 66 | + - Files/locations: |
| 67 | + - `FindApproximateCRISPRArrays`. |
| 68 | + - Why / Benefit: |
| 69 | + - Reduces concurrent unordered_map modification (expensive) and helps long-running runs avoid growing memory footprints unnecessarily. |
| 70 | + |
| 71 | +--- |
| 72 | + |
| 73 | +## Expected performance and behavior improvements |
| 74 | + |
| 75 | +- Improved scalability with thread counts beyond the earlier observed plateau (~24 cores) because: |
| 76 | + - Contention points are removed or drastically reduced. |
| 77 | + - Allocator pressure is lowered by reusing containers. |
| 78 | + - Atomic operations on compact bitmaps replace heavier locks. |
| 79 | +- Memory cost: the visited bitset adds ~1 bit per node (compact) and per-thread buffers increase transient memory usage proportional to thread count but only for selected nodes. |
| 80 | + |
| 81 | +--- |
| 82 | + |
| 83 | +## Limitations & future work |
| 84 | + |
| 85 | +- NUMA-aware allocation and memory binding were not implemented yet — this is the natural next step for large multi-socket machines where memory bandwidth dominates. |
| 86 | +- Further profiling (perf/VTune) is needed to quantify the exact causes of any remaining scalability bottlenecks (cache-line bouncing, allocator hotspots, or procedural serial sections). |
| 87 | + |
| 88 | +--- |
| 89 | + |
| 90 | +## How to validate quickly (recommended) |
| 91 | + |
| 92 | +1. Check out the `optimizations` branch. |
| 93 | +2. Build (`cmake .. && make -j`) and run the same workload used before. |
| 94 | +3. Compare (a) execution time vs thread count (1, 8, 24, 48, 128), (b) throughput (nodes/sec), and (c) cycles found to ensure no correctness regression. |
| 95 | +4. Use `perf top` / `perf record` or `numastat` to verify reduced lock/atomic time and identify remaining hotspots. |
| 96 | + |
| 97 | +--- |
| 98 | + |
| 99 | +## Files touched (algorithmic/optimization only) |
| 100 | + |
| 101 | +- `src/cycle_finder.cpp` — main implementation of lock-free visited bitmap, per-thread collectors, DLS pools, traversal micro-optimizations, merging logic. |
| 102 | +- `include/cycle_finder.h` — updated helpers and declarations related to visited bookkeeping (if applicable). |
| 103 | + |
| 104 | +--- |
| 105 | + |
| 106 | +## TL;DR |
| 107 | + |
| 108 | +- Replaced shared locks with per-thread buffers + serial merges, added a compact lock-free visited bitmap, and reduced allocation churn via per-thread pools. These changes reduce contention and allocator pressure and improve multithreaded scaling while keeping memory usage reasonable for very large graphs. |
| 109 | + |
0 commit comments