Skip to content

Performance tuning of FREEMIX calculations from BAM#82

Merged
Griffan merged 10 commits into
Griffan:masterfrom
tfenne:tf_perf_tuning
Apr 19, 2026
Merged

Performance tuning of FREEMIX calculations from BAM#82
Griffan merged 10 commits into
Griffan:masterfrom
tfenne:tf_perf_tuning

Conversation

@tfenne
Copy link
Copy Markdown
Contributor

@tfenne tfenne commented Apr 14, 2026

Hi @Griffan - I've been running VerifyBamID2 on some fairly large panels (because my goal is to get to very low coverage), and spent some time working on the runtime of the process for calculating FREEMIX from a BAM file. This is all work done in partnership with Claude. I've done my best to organize this into atomic commits, with some additional logging and tests developed up front. These changes to not alter the outputs at all for the tests or my benchmarking real-world test case with ~5x coverage of 14k markers.

Please feel free to push back on any of this that you don't like, or would like restructured, better documented, better tested etc. But the end results are pretty compelling - I'm at ~20-fold speed up (i.e. 95% reduction) of the optimization/post BAM-reading phase running on an aarch64 mac, and I expect that the results on x86 will be broadly similar.

Summary

The Amoeba minimizer calls ComputeMixLLKs() up to 50,000 times during optimization, and each call iterates over every marker in the reference panel. As panels grow, this becomes the dominant cost. This PR applies a series of targeted optimizations to the inner loop of ComputeMixLLKs() — replacing expensive per-call computations with precomputed tables and eliminating redundant work — to achieve a 22x speedup on the optimization phase with bit-identical results on all tests.

All optimizations exploit the same insight: the inputs to the inner loop are drawn from small discrete domains, so values that were being recomputed millions of times can instead be looked up from compact tables built once per call or once at startup.

Benchmark

Measured on a 14K-marker feline panel, single-threaded (--NumPC 5 --NumThread 1), Release build:

Commit Description Optimize (s) vs Baseline
Baseline (before this PR) 198.5
cf01323 Add phase timing instrumentation. RAII PhaseTimer logs start/end/duration to stderr for each major phase (SVD loading, BAM reading, sanity check, optimization sub-phases), providing the measurement infrastructure for all subsequent changes. 198.5
2a45948 Add tests for all six optimization code paths. Previously only the default BetweenAncestry path was tested. Adds tests for --WithinAncestry, --WithinAncestry --FixPC, --FixAlpha, and --FixPC, covering OptimizeHomo, OptimizeHomoFixedPC, OptimizeHomoFixedAlpha, OptimizeHeterFixedPC, and OptimizeHeterFixedAlpha. All use exact-diff validation against golden output files. 198.5
f0e0367 Use const references for base/quality vectors in the hot loop. GetBaseInfoAt/GetQualInfoAt return references, but the loop was assigning to value types, causing a deep copy per marker per call. Negligible runtime impact but corrects a wasteful pattern. ~198 ~0%
b9e5c18 Replace Phred() pow() calls with a precomputed 94-entry lookup table. Quality scores are discrete integers in [0, 93], so pow(10.0, x/-10.0) can be tabulated once at startup. Also hoists the lookup above the 3x3 genotype loop so each base's error probability is fetched once instead of 9 times. 136.6 -31%
54f08db Replace getConditionalBaseLK (a 50-line branchy function called 36x per base) with a static constexpr 2x3x3 lookup table, and restructure the inner loop from genotype-outer/base-inner to base-outer/genotype-inner. Each base is now classified once (ref/alt/other), its 6 conditional likelihoods looked up from the table, and contributions accumulated into 9 genotype-pair accumulators — instead of processing each base 9 separate times. 47.1 -76%
fef0d70 Precompute resolved markers into a flat vector at startup, eliminating 4-6 nested unordered_map lookups per marker per ComputeMixLLKs call (posIndex, ChooseBed, knownAF). The hot loop now uses a single integer comparison to skip absent markers and direct array indexing for base/quality data. 34.4 -83%
cc59c54 Precompute a log-likelihood lookup table at the top of each ComputeMixLLKs call. Within a call, alpha is constant, so log(val) depends only on base class (3), quality score (94), and genotype pair (9) = 2,538 unique values. Building this table once replaces ~1.3M log() calls per invocation with table additions. 9.1 -95%

Key design decisions

  • Bit-identical results: Every optimization produces the exact same floating-point output. Tests use binary diff against golden files with no tolerance — all 15 tests pass on every commit.
  • One concern per commit: Each commit is independently buildable, testable, and benchmarkable. The commits were evaluated individually during development and only kept if they provided a clear benefit.
  • No algorithmic changes: The mathematical model, convergence criteria, and optimizer (Amoeba/Nelder-Mead) are unchanged. All speedups come from avoiding redundant computation of values that can be precomputed or tabulated.

Test plan

  • All 15 CTest tests pass (7 original + 8 new) with exact-diff validation
  • FREEMIX output identical to baseline on benchmark sample
  • Phase timing confirms optimization is no longer the bottleneck (9s optimization vs 29s BAM reading)
  • CI passes on Linux (GCC) and macOS (Clang)

tfenne added 7 commits April 14, 2026 10:39
Add RAII PhaseTimer that logs start/end/duration for each major phase
to stderr via notice(). Phases timed in main.cpp: SVD loading, BAM/pileup
reading, sanity check, and overall optimization. Sub-phases timed in
ContaminationEstimator.cpp: likelihood initialization, each optimizer
variant, and null-model LLK calculation.
Previously only the default BetweenAncestry path (OptimizeHeter +
OptimizeHomo initial) was tested. Add tests covering:
- WithinAncestry (pure OptimizeHomo)
- WithinAncestry + FixPC (OptimizeHomoFixedPC)
- FixAlpha (OptimizeHomoFixedAlpha initial + OptimizeHeterFixedAlpha)
- FixPC in BetweenAncestry (OptimizeHeterFixedPC)

This brings coverage from 1 of 6 optimization code paths to all 6,
providing a safety net for upcoming performance optimizations.
GetBaseInfoAt/GetQualInfoAt return references but the loop was
assigning to value types, causing a deep copy per marker per call.
Use const references instead.
Quality scores are discrete integers in [0,93]. Precompute a 94-entry
table of pow(10.0, x/-10.0) once, then use table lookups instead of
calling pow() in the hot loop. Also hoist the lookup above the 3x3
genotype loop so each base's error probability is fetched once instead
of 9 times.

~31% reduction in optimization phase runtime on a 14K-marker panel.
Replace getConditionalBaseLK (50-line branchy function called 36x per
base) with a static 2x3x3 lookup table indexed by [is_error][genotype]
[base_class]. Classify each base once as ref/alt/other, then look up
all 6 conditional likelihoods (3 genotypes x 2 error states).

Restructure the inner loop from genotype-outer/base-inner (processing
each base 9 times) to base-outer/genotype-inner (processing each base
once and accumulating into 9 genotype-pair log-likelihood accumulators).

Combined with the prior Phred table change, the optimization phase is
now ~4.2x faster than baseline (198s -> 47s) with bit-identical results.
ComputeMixLLKs previously performed 4-6 nested unordered_map lookups
per marker per call (posIndex, ChooseBed, knownAF). Since these values
are constant across all optimization iterations, resolve them once into
a flat vector of ResolvedMarker structs after data loading.

The hot loop now uses a single integer comparison to skip absent markers
and direct array indexing for base/quality data and alt alleles.

~27% reduction in optimization phase runtime (47s -> 34s), bringing the
cumulative speedup to ~5.8x vs baseline.
Within a single ComputeMixLLKs call, alpha is constant. The per-base
log-likelihood depends only on four discrete inputs: base class
(ref/alt/other), quality score (Phred 0-93), and the genotype pair
(g1, g2). That's 3 x 94 x 9 = 2,538 unique values.

Build a lookup table of these 2,538 log() results once at the top of
each ComputeMixLLKs call, then replace per-base log() calls with
table additions. This typically replaces ~1.3M log() calls with 2,538.

~74% reduction in optimization phase runtime (34s -> 9s), bringing
the cumulative speedup to ~22x vs the original baseline (198s -> 9s).
@tfenne tfenne changed the title Tf perf tuning Performance tuning of FREEMIX calculations from BAM Apr 14, 2026
In C++11, a static constexpr data member declared inside a class
requires an out-of-class definition if it is ODR-used. GCC enforces
this strictly while Clang does not, causing an undefined reference
to COND_LK on Ubuntu CI builds with GCC.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR focuses on speeding up VerifyBamID2’s FREEMIX optimization when estimating contamination from BAM/CRAM/pileup inputs by reducing repeated work inside the hottest likelihood loop.

Changes:

  • Adds phase-level timing instrumentation (RAII PhaseTimer) around major execution and optimization steps.
  • Optimizes ComputeMixLLKs() by replacing repeated computation (pow, log, nested map lookups, conditional branching) with compact lookup tables and pre-resolved per-marker metadata.
  • Expands CTest coverage to exercise additional optimization modes and adds corresponding golden expected outputs.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
main.cpp Adds phase timing around SVD loading, input reading, sanity check, and optimization.
ContaminationEstimator.h Introduces Phred lookup table, conditional likelihood lookup table, per-call log-likelihood table, and uses resolvedMarkers + direct base/qual indexing in the hot loop.
ContaminationEstimator.cpp Adds optimization sub-phase timers and precomputes resolvedMarkers once before optimization.
CMakeLists.txt Adds new tests for WithinAncestry/FixPC/FixAlpha/Heter FixPC paths with golden diffs.
resource/test/expected/test.*.Ancestry Adds new golden expected ancestry output files for the added test cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ContaminationEstimator.h Outdated
Comment thread ContaminationEstimator.h
Comment thread CMakeLists.txt Outdated
tfenne added 2 commits April 17, 2026 05:03
- getPhredTable(): replace manual 'static bool initialized' guard with
  C++11 thread-safe local static initialization via an immediately-invoked
  lambda, so concurrent first callers can never observe a partially-
  initialized table.

- ComputeMixLLKs(): clamp the per-base quality score to [0, 93] before
  indexing logLkTable. BAM qualities are guaranteed in-range but pileup
  input is ASCII and a malformed file could drive the index out of bounds.

- CTests: give each newly added test pair its own --Output prefix and add
  a DEPENDS property on the Diff test so CTest can safely run tests in
  parallel without racing on shared result.* filenames or sequencing a
  Diff before its Run.
Extends the --Output + DEPENDS pattern from the previous commit to the
six existing tests.  myTest1, myTest3, and myTest5 all wrote to the
shared result.Ancestry / result.selfSM filenames, and their paired
Diff tests (myTest2/4/6) had no explicit ordering, so the suite was
only reliable when CTest happened to run tests serially.

This race was discovered while running the suite under `ctest -j8` to
confirm the newly-added tests from the previous commit behaved
correctly under parallel execution: the new tests all passed, but
myTest2/4/6 began failing intermittently as the older Run tests
clobbered each other's output.  Giving each Run its own --Output
prefix and adding DEPENDS on each Diff test makes the full suite
parallel-safe (15/15 passing with -j8).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ContaminationEstimator.h
@Griffan Griffan self-requested a review April 19, 2026 02:08
Copy link
Copy Markdown
Owner

@Griffan Griffan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a long overdue optimization, thanks! @tfenne

@Griffan Griffan merged commit 1cb0101 into Griffan:master Apr 19, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants