Skip to content

Commit ea2c65c

Browse files
authored
Merge pull request #5 from RNABioInfo/optimizations
integrated optimizations, rolled back spacer ordering
2 parents da317f5 + 57b32ba commit ea2c65c

13 files changed

Lines changed: 481 additions & 67 deletions

.github/workflows/msbuild.yml

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# This workflow uses actions that are not certified by GitHub.
2+
# They are provided by a third-party and are governed by
3+
# separate terms of service, privacy policy, and support
4+
# documentation.
5+
6+
name: MSBuild
7+
8+
on:
9+
push:
10+
branches: [ "master" ]
11+
pull_request:
12+
branches: [ "master" ]
13+
14+
env:
15+
# Path to the solution file relative to the root of the project.
16+
SOLUTION_FILE_PATH: .
17+
18+
# Configuration type to build.
19+
# You can convert this to a build matrix if you need coverage of multiple configuration types.
20+
# https://docs.github.com/actions/learn-github-actions/managing-complex-workflows#using-a-build-matrix
21+
BUILD_CONFIGURATION: Release
22+
23+
permissions:
24+
contents: read
25+
26+
jobs:
27+
build:
28+
runs-on: windows-latest
29+
30+
steps:
31+
- uses: actions/checkout@v4
32+
33+
- name: Add MSBuild to PATH
34+
uses: microsoft/setup-msbuild@v1.0.2
35+
36+
- name: Restore NuGet packages
37+
working-directory: ${{env.GITHUB_WORKSPACE}}
38+
run: nuget restore ${{env.SOLUTION_FILE_PATH}}
39+
40+
- name: Build
41+
working-directory: ${{env.GITHUB_WORKSPACE}}
42+
# Add additional options to the MSBuild command line here (like platform or verbosity level).
43+
# See https://docs.microsoft.com/visualstudio/msbuild/msbuild-command-line-reference
44+
run: msbuild /m /p:Configuration=${{env.BUILD_CONFIGURATION}} ${{env.SOLUTION_FILE_PATH}}

CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Changelog
2+
3+
## [Unreleased] - 0.5.1
4+
- Fixed crash in spacer ordering when no reads are found (guard added).
5+
- Improved parallel scaling:
6+
- Replaced frequent `#pragma omp critical` usage with per-thread buffers and serial merges.
7+
- Introduced a lock-free visited bitmap (atomic 64-bit words) to remove synchronization hot-spots.
8+
- Reduced allocator contention and reused per-thread containers to lower memory churn under heavy parallelism.
9+
- Build fixes: added missing includes and small portability fixes so CMake build succeeds on target platforms.
10+
- Build: CMake configure and full build completed successfully (targets `mcaat` and `runTests` built).
11+
12+
13+
*Notes:* these changes focus on improving scalability and preventing serialization on large, memory-bound graphs.

bench/run_benchmarks.sh

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
4+
# Simple benchmark runner for MCAAT cycle_finder
5+
# Produces CSV rows: threads,run,elapsed_seconds,max_rss_kb,cycles
6+
7+
usage() {
8+
cat <<EOF
9+
Usage: $0 --binary PATH --input PATH --threads LIST --runs N -o OUTCSV
10+
11+
Options:
12+
--binary PATH Path to cycle_finder binary
13+
--input PATH Input graph file
14+
--threads LIST Comma-separated thread counts (e.g. 1,4,8,24)
15+
--runs N Number of runs per thread count (default: 3)
16+
-o OUTCSV Output CSV file (overwritten)
17+
18+
Example:
19+
$0 --binary ./bin/cycle_finder --input graphs/huge_graph.bin --threads 1,4,8,24 --runs 3 -o bench/results.csv
20+
EOF
21+
exit 1
22+
}
23+
24+
BINARY="" INPUT="" THREADS="" RUNS=3 OUTCSV=""
25+
26+
while [[ $# -gt 0 ]]; do
27+
case "$1" in
28+
--binary) BINARY="$2"; shift 2;;
29+
--input) INPUT="$2"; shift 2;;
30+
--threads) THREADS="$2"; shift 2;;
31+
--runs) RUNS="$2"; shift 2;;
32+
-o) OUTCSV="$2"; shift 2;;
33+
-h|--help) usage;;
34+
*) echo "Unknown arg: $1"; usage;;
35+
esac
36+
done
37+
38+
if [[ -z "$BINARY" || -z "$INPUT" || -z "$THREADS" || -z "$OUTCSV" ]]; then
39+
usage
40+
fi
41+
42+
mkdir -p "$(dirname "$OUTCSV")"
43+
44+
echo "threads,run,elapsed_s,max_rss_kb,cycles" > "$OUTCSV"
45+
46+
IFS=',' read -r -a THREAD_ARR <<< "$THREADS"
47+
48+
for threads in "${THREAD_ARR[@]}"; do
49+
for ((run=1; run<=RUNS; run++)); do
50+
outtmp="/tmp/cycle_out.${threads}.${run}.json"
51+
timetmp="/tmp/cycle_time.${threads}.${run}.txt"
52+
53+
echo "Running threads=$threads run=$run"
54+
55+
# Use /usr/bin/time to capture wall time and max RSS
56+
/usr/bin/time -f "%e %M" -o "$timetmp" "$BINARY" --input "$INPUT" --threads "$threads" --out "$outtmp"
57+
58+
read -r elapsed rss_kb < "$timetmp"
59+
60+
# Try to extract cycles from output JSON if possible
61+
cycles=""
62+
if command -v jq >/dev/null 2>&1 && jq -e . "$outtmp" >/dev/null 2>&1; then
63+
# Common fields: either cycles array or cycle_count
64+
if jq -e '.cycles' "$outtmp" >/dev/null 2>&1; then
65+
cycles=$(jq '.cycles | length' "$outtmp")
66+
elif jq -e '.cycle_count' "$outtmp" >/dev/null 2>&1; then
67+
cycles=$(jq '.cycle_count' "$outtmp")
68+
fi
69+
else
70+
# Fallback: grep for numbers labeled cycles or count keys
71+
cycles=$(grep -Eo '"cycle_count"[[:space:]]*:[[:space:]]*[0-9]+' "$outtmp" | head -n1 | grep -Eo '[0-9]+') || true
72+
if [[ -z "$cycles" ]]; then
73+
cycles=$(grep -Eo '"cycles"[[:space:]]*:[[:space:]]*[0-9]+' "$outtmp" | head -n1 | grep -Eo '[0-9]+') || true
74+
fi
75+
fi
76+
77+
echo "${threads},${run},${elapsed},${rss_kb},${cycles}" >> "$OUTCSV"
78+
79+
rm -f "$outtmp" "$timetmp"
80+
done
81+
done
82+
83+
echo "Done. Results at: $OUTCSV"
File renamed without changes.
File renamed without changes.

docs/report.html

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
<!doctype html>
2+
<html lang="en">
3+
<head>
4+
<meta charset="utf-8">
5+
<title>MCAAT: Cycle Finder — Algorithmic & Optimization Report</title>
6+
<meta name="viewport" content="width=device-width, initial-scale=1">
7+
<style>
8+
body { font-family: Arial, sans-serif; max-width: 900px; margin: 2rem auto; line-height: 1.6; color:#222 }
9+
pre { background:#f6f6f6; padding:0.5rem; }
10+
h1,h2 { color:#0b63a3 }
11+
.note { background:#fffbe6; border-left:4px solid #ffd24d; padding:0.5rem; }
12+
</style>
13+
</head>
14+
<body>
15+
<h1>MCAAT: Cycle Finder — Algorithmic & Optimization Report ✅</h1>
16+
<p><strong>Scope:</strong> This document describes <em>only</em> the algorithmic changes and optimizations introduced in the <code>optimizations</code> branch for the cycle finder logic, organized step-by-step.</p>
17+
<hr/>
18+
<h2>Summary</h2>
19+
<ol>
20+
<li>Replaced critical sections with per-thread buffers + serial merge.</li>
21+
<li>Introduced a lock-free atomic bitset as the visited structure.</li>
22+
<li>Reduced allocations by reusing per-thread pools (megahit-style).</li>
23+
<li>Applied traversal micro-optimizations (prefetch, fixed arrays, branch hints).</li>
24+
</ol>
25+
26+
<h2>Step-by-step changes</h2>
27+
<ol>
28+
<li><strong>Per-thread buffers + serial merge</strong>
29+
<ul>
30+
<li>Replaced concurrent writes to shared containers with <code>local_chunks[tid]</code> and <code>local_results[tid]</code>, then merged serially.</li>
31+
<li>Files: <code>ChunkStartNodes</code>, <code>FindApproximateCRISPRArrays</code>.</li>
32+
</ul>
33+
</li>
34+
35+
<li><strong>Lock-free visited bitmap</strong>
36+
<ul>
37+
<li>Introduced <code>std::vector<uint64_t> s_visited_words</code> and helpers: <code>InitializeVisitedGlobal</code>, <code>IsVisitedGlobal</code>, <code>MarkVisitedGlobal</code>.</li>
38+
<li>Atomic builtins (<code>__atomic_load_n</code>, <code>__atomic_fetch_or</code>) are used with relaxed ordering.</li>
39+
</ul>
40+
</li>
41+
42+
<li><strong>Per-thread pools & fewer allocations</strong>
43+
<ul>
44+
<li>Added <code>static thread_local</code> pools for DLS stack and visited set; reuse capacity to avoid repeated alloc/free.</li>
45+
<li>File: <code>DepthLevelSearch</code>.</li>
46+
</ul>
47+
</li>
48+
49+
<li><strong>Traversal micro-optimizations</strong>
50+
<ul>
51+
<li>Fixed-size neighbor arrays, prefetch, branch hints, and small unrolling in hot loops.</li>
52+
</ul>
53+
</li>
54+
55+
<li><strong>Serial result merging & memory hygiene</strong>
56+
<ul>
57+
<li>Per-thread maps are moved into the shared results in a serial loop; call <code>malloc_trim(0)</code> intermittently.</li>
58+
</ul>
59+
</li>
60+
</ol>
61+
62+
<h2>Expected impact</h2>
63+
<ul>
64+
<li>Better multithreaded scaling due to reduced contention and allocator pressure.</li>
65+
<li>Reasonable memory usage (1 bit per node for visited bitset).</li>
66+
</ul>
67+
68+
<h2>Limitations & future work</h2>
69+
<div class="note">NUMA-aware allocation and deeper profiling are recommended next steps.</div>
70+
71+
<h2>Quick validation</h2>
72+
<ol>
73+
<li>Checkout <code>optimizations</code>, build, and run the same workload across several thread counts (1, 8, 24, 48, 128).</li>
74+
<li>Use <code>perf</code> and <code>numastat</code> to verify reduced contention and memory hotspots.</li>
75+
</ol>
76+
77+
<hr/>
78+
<p><strong>Files touched:</strong> <code>src/cycle_finder.cpp</code> (+ associated header updates).</p>
79+
<p>TL;DR: per-thread buffers + lock-free visited bitmap + reused pools = less contention and better parallel throughput.</p>
80+
</body>
81+
</html>

docs/report.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# MCAAT: Cycle Finder — Algorithmic & Optimization Report ✅
2+
3+
**Scope:** This document describes *only* the algorithmic changes and optimizations introduced in the `optimizations` branch for the cycle finder logic. It is organized so you can read step-by-step what changed, why it was done, and the expected impact.
4+
5+
---
6+
7+
## Summary
8+
9+
1. Replaced global critical sections and shared writes with *per-thread buffers* and a single serial merge step to remove contention.
10+
2. Replaced lock-based or synchronized "visited" bookkeeping with a *lock-free atomic bitset* (1 bit per node).
11+
3. Reduced allocations and allocator contention by *reusing per-thread pools* (megahit-style) and preallocating where useful.
12+
4. Applied traversal micro-optimizations (fixed-size arrays, prefetch, branch hints) to reduce per-edge overhead.
13+
14+
---
15+
16+
## Step-by-step changes (algorithmic & optimization focus)
17+
18+
1) Remove global critical sections → Per-thread buffers + serial merge 🔧
19+
- What changed:
20+
- Replaced OpenMP `#pragma omp critical` style updates to shared containers with `vector<...>` of per-thread collectors (e.g., `local_chunks` and `local_results`).
21+
- After the parallel loop completes, a single-threaded loop merges per-thread buffers into the shared map or results container.
22+
- Files/locations:
23+
- `CycleFinder::ChunkStartNodes` (collect start nodes into `local_chunks[tid]` then merge).
24+
- `CycleFinder::FindApproximateCRISPRArrays` (collect per-thread `local_results`, then merge into `this->results`).
25+
- Why / Benefit:
26+
- Eliminates high-contention points on hot shared data structures, enabling scaling to higher core counts.
27+
- Serial merge cost is amortized and avoids expensive locking in hot loops.
28+
29+
2) Lock-free visited bitmap (1 bit per node) 🔒→⚡
30+
- What changed:
31+
- Introduced a global `std::vector<uint64_t> s_visited_words` as a bitset (one bit per node).
32+
- Provided helper inline functions: `InitializeVisitedGlobal(n)`, `IsVisitedGlobal(node)`, and `MarkVisitedGlobal(node)` implemented using GCC/Clang atomic builtins (`__atomic_load_n`, `__atomic_fetch_or`) with `__ATOMIC_RELAXED` ordering.
33+
- Files/locations:
34+
- `src/cycle_finder.cpp` (static `s_visited_words` and helpers) and uses in `FindCycle`, `FindCycleUtil`, and background checks.
35+
- Why / Benefit:
36+
- Avoids `vector<std::atomic>` pitfalls (copy/resize/copyability) and the overhead of locks around visited updates.
37+
- One atomic word operation per change (bit flip) is cheap and scales well.
38+
- Memory is compact (1 bit per node) and predictable for large graphs.
39+
- Correctness note:
40+
- Using relaxed atomics is acceptable because bits only transition from 0→1 (monotonic); races among writers do not break correctness, and reads can tolerate transient states.
41+
42+
3) Reduce allocations and reuse per-thread pools (megahit-style) ♻️
43+
- What changed:
44+
- Introduced `static thread_local` pools for DLS (`dls_stack_pool` and `dls_visited_pool`) used by `DepthLevelSearch`.
45+
- Pools are `clear()`ed between uses but retain capacity; small initial reserve is set to avoid repeated small allocations.
46+
- Files/locations:
47+
- `CycleFinder::DepthLevelSearch`.
48+
- Why / Benefit:
49+
- Avoids heavy allocator contention when many threads create/destroy temporaries frequently.
50+
- Reduced per-edge latency and improved throughput during parallel graph traversal.
51+
52+
4) Traversal micro-optimizations (branch hints, fixed arrays, prefetch) 🧠
53+
- What changed:
54+
- Use of fixed-size neighbor arrays (`uint64_t neighbors[MAX_EDGE_COUNT]`) rather than heap allocations per node.
55+
- Prefetching neighbor buffers and using `__builtin_expect` branch hints to optimize hot paths.
56+
- Small loop unrolling where out-degree is small (de Bruijn graph pattern) to reduce loop overhead.
57+
- Files/locations:
58+
- `DepthLevelSearch`, `_GetOutgoings`, and `_GetIncomings` helpers.
59+
- Why / Benefit:
60+
- Better cache locality and fewer branch mispredictions; straightforward per-edge speedups with little code complexity.
61+
62+
5) Results merging and memory hygiene 🧽
63+
- What changed:
64+
- Per-thread `local_results` (maps) are merged serially into `this->results` after each bucket processed.
65+
- Call `malloc_trim(0)` occasionally after buckets to release heap fragments back to the OS (for long runs with variable memory usage).
66+
- Files/locations:
67+
- `FindApproximateCRISPRArrays`.
68+
- Why / Benefit:
69+
- Reduces concurrent unordered_map modification (expensive) and helps long-running runs avoid growing memory footprints unnecessarily.
70+
71+
---
72+
73+
## Expected performance and behavior improvements
74+
75+
- Improved scalability with thread counts beyond the earlier observed plateau (~24 cores) because:
76+
- Contention points are removed or drastically reduced.
77+
- Allocator pressure is lowered by reusing containers.
78+
- Atomic operations on compact bitmaps replace heavier locks.
79+
- Memory cost: the visited bitset adds ~1 bit per node (compact) and per-thread buffers increase transient memory usage proportional to thread count but only for selected nodes.
80+
81+
---
82+
83+
## Limitations & future work
84+
85+
- NUMA-aware allocation and memory binding were not implemented yet — this is the natural next step for large multi-socket machines where memory bandwidth dominates.
86+
- Further profiling (perf/VTune) is needed to quantify the exact causes of any remaining scalability bottlenecks (cache-line bouncing, allocator hotspots, or procedural serial sections).
87+
88+
---
89+
90+
## How to validate quickly (recommended)
91+
92+
1. Check out the `optimizations` branch.
93+
2. Build (`cmake .. && make -j`) and run the same workload used before.
94+
3. Compare (a) execution time vs thread count (1, 8, 24, 48, 128), (b) throughput (nodes/sec), and (c) cycles found to ensure no correctness regression.
95+
4. Use `perf top` / `perf record` or `numastat` to verify reduced lock/atomic time and identify remaining hotspots.
96+
97+
---
98+
99+
## Files touched (algorithmic/optimization only)
100+
101+
- `src/cycle_finder.cpp` — main implementation of lock-free visited bitmap, per-thread collectors, DLS pools, traversal micro-optimizations, merging logic.
102+
- `include/cycle_finder.h` — updated helpers and declarations related to visited bookkeeping (if applicable).
103+
104+
---
105+
106+
## TL;DR
107+
108+
- Replaced shared locks with per-thread buffers + serial merges, added a compact lock-free visited bitmap, and reduced allocation churn via per-thread pools. These changes reduce contention and allocator pressure and improve multithreaded scaling while keeping memory usage reasonable for very large graphs.
109+

include/cycle_finder.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,10 @@ class CycleFinder {
3030
// Use SDBG pointer from settings everywhere instead of storing a separate reference
3131
//SDBG& sdbg;
3232
uint16_t cluster_bounds;
33-
vector<bool> visited;
33+
// visited bitset stored as 64-bit words (1 bit per node). Use atomic builtins on the words to avoid non-copyable std::atomic in vectors.
34+
vector<vector<uint64_t>> per_thread_visited;
3435
vector<bool> look_up_table;
36+
3537
// thread count obtained from settings
3638

3739
//#### DEVELOPER FUNCTIONS ####

readme.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
- Better data structures for preprocessing, `phmap::flat_hash_set`
1313
- Added compiler intrinsics to guide the hardware in the right direction
1414
- Reserving the capacity to prevent rehashing
15-
In depth technical details: [educational resource](./src/z_educational_guide.md) and [optimization developer notes](./src/z_optimization_dev_notes.md). As a result of the above optimizations we achieved __17-25__ times speedup in __1billion__ node graph(from 3 <span style="color:red">days</span> to 3 <span style="color:green">hours</span>). Considering the complexity of the graphs, this is a huge improvement.
15+
In depth technical details: [educational resource](./src/z_educational_guide.md) and [optimization developer notes](./src/z_optimization_dev_notes.md).
1616

1717

1818
### Installation using docker

0 commit comments

Comments
 (0)