Performance tuning of FREEMIX calculations from BAM by tfenne · Pull Request #82 · Griffan/VerifyBamID

tfenne · 2026-04-14T19:15:39Z

Hi @Griffan - I've been running VerifyBamID2 on some fairly large panels (because my goal is to get to very low coverage), and spent some time working on the runtime of the process for calculating FREEMIX from a BAM file. This is all work done in partnership with Claude. I've done my best to organize this into atomic commits, with some additional logging and tests developed up front. These changes to not alter the outputs at all for the tests or my benchmarking real-world test case with ~5x coverage of 14k markers.

Please feel free to push back on any of this that you don't like, or would like restructured, better documented, better tested etc. But the end results are pretty compelling - I'm at ~20-fold speed up (i.e. 95% reduction) of the optimization/post BAM-reading phase running on an aarch64 mac, and I expect that the results on x86 will be broadly similar.

Summary

The Amoeba minimizer calls ComputeMixLLKs() up to 50,000 times during optimization, and each call iterates over every marker in the reference panel. As panels grow, this becomes the dominant cost. This PR applies a series of targeted optimizations to the inner loop of ComputeMixLLKs() — replacing expensive per-call computations with precomputed tables and eliminating redundant work — to achieve a 22x speedup on the optimization phase with bit-identical results on all tests.

All optimizations exploit the same insight: the inputs to the inner loop are drawn from small discrete domains, so values that were being recomputed millions of times can instead be looked up from compact tables built once per call or once at startup.

Benchmark

Measured on a 14K-marker feline panel, single-threaded (--NumPC 5 --NumThread 1), Release build:

Commit	Description	Optimize (s)	vs Baseline
—	Baseline (before this PR)	198.5	—
`cf01323`	Add phase timing instrumentation. RAII `PhaseTimer` logs start/end/duration to stderr for each major phase (SVD loading, BAM reading, sanity check, optimization sub-phases), providing the measurement infrastructure for all subsequent changes.	198.5	—
`2a45948`	Add tests for all six optimization code paths. Previously only the default BetweenAncestry path was tested. Adds tests for `--WithinAncestry`, `--WithinAncestry --FixPC`, `--FixAlpha`, and `--FixPC`, covering `OptimizeHomo`, `OptimizeHomoFixedPC`, `OptimizeHomoFixedAlpha`, `OptimizeHeterFixedPC`, and `OptimizeHeterFixedAlpha`. All use exact-diff validation against golden output files.	198.5	—
`f0e0367`	Use const references for base/quality vectors in the hot loop. `GetBaseInfoAt`/`GetQualInfoAt` return references, but the loop was assigning to value types, causing a deep copy per marker per call. Negligible runtime impact but corrects a wasteful pattern.	~198	~0%
`b9e5c18`	Replace `Phred()` pow() calls with a precomputed 94-entry lookup table. Quality scores are discrete integers in [0, 93], so `pow(10.0, x/-10.0)` can be tabulated once at startup. Also hoists the lookup above the 3x3 genotype loop so each base's error probability is fetched once instead of 9 times.	136.6	-31%
`54f08db`	Replace `getConditionalBaseLK` (a 50-line branchy function called 36x per base) with a static `constexpr` 2x3x3 lookup table, and restructure the inner loop from genotype-outer/base-inner to base-outer/genotype-inner. Each base is now classified once (ref/alt/other), its 6 conditional likelihoods looked up from the table, and contributions accumulated into 9 genotype-pair accumulators — instead of processing each base 9 separate times.	47.1	-76%
`fef0d70`	Precompute resolved markers into a flat vector at startup, eliminating 4-6 nested `unordered_map` lookups per marker per `ComputeMixLLKs` call (posIndex, ChooseBed, knownAF). The hot loop now uses a single integer comparison to skip absent markers and direct array indexing for base/quality data.	34.4	-83%
`cc59c54`	Precompute a log-likelihood lookup table at the top of each `ComputeMixLLKs` call. Within a call, alpha is constant, so `log(val)` depends only on base class (3), quality score (94), and genotype pair (9) = 2,538 unique values. Building this table once replaces ~1.3M `log()` calls per invocation with table additions.	9.1	-95%

Key design decisions

Bit-identical results: Every optimization produces the exact same floating-point output. Tests use binary diff against golden files with no tolerance — all 15 tests pass on every commit.
One concern per commit: Each commit is independently buildable, testable, and benchmarkable. The commits were evaluated individually during development and only kept if they provided a clear benefit.
No algorithmic changes: The mathematical model, convergence criteria, and optimizer (Amoeba/Nelder-Mead) are unchanged. All speedups come from avoiding redundant computation of values that can be precomputed or tabulated.

Test plan

All 15 CTest tests pass (7 original + 8 new) with exact-diff validation
FREEMIX output identical to baseline on benchmark sample
Phase timing confirms optimization is no longer the bottleneck (9s optimization vs 29s BAM reading)
CI passes on Linux (GCC) and macOS (Clang)

Add RAII PhaseTimer that logs start/end/duration for each major phase to stderr via notice(). Phases timed in main.cpp: SVD loading, BAM/pileup reading, sanity check, and overall optimization. Sub-phases timed in ContaminationEstimator.cpp: likelihood initialization, each optimizer variant, and null-model LLK calculation.

Previously only the default BetweenAncestry path (OptimizeHeter + OptimizeHomo initial) was tested. Add tests covering: - WithinAncestry (pure OptimizeHomo) - WithinAncestry + FixPC (OptimizeHomoFixedPC) - FixAlpha (OptimizeHomoFixedAlpha initial + OptimizeHeterFixedAlpha) - FixPC in BetweenAncestry (OptimizeHeterFixedPC) This brings coverage from 1 of 6 optimization code paths to all 6, providing a safety net for upcoming performance optimizations.

GetBaseInfoAt/GetQualInfoAt return references but the loop was assigning to value types, causing a deep copy per marker per call. Use const references instead.

Quality scores are discrete integers in [0,93]. Precompute a 94-entry table of pow(10.0, x/-10.0) once, then use table lookups instead of calling pow() in the hot loop. Also hoist the lookup above the 3x3 genotype loop so each base's error probability is fetched once instead of 9 times. ~31% reduction in optimization phase runtime on a 14K-marker panel.

Replace getConditionalBaseLK (50-line branchy function called 36x per base) with a static 2x3x3 lookup table indexed by [is_error][genotype] [base_class]. Classify each base once as ref/alt/other, then look up all 6 conditional likelihoods (3 genotypes x 2 error states). Restructure the inner loop from genotype-outer/base-inner (processing each base 9 times) to base-outer/genotype-inner (processing each base once and accumulating into 9 genotype-pair log-likelihood accumulators). Combined with the prior Phred table change, the optimization phase is now ~4.2x faster than baseline (198s -> 47s) with bit-identical results.

ComputeMixLLKs previously performed 4-6 nested unordered_map lookups per marker per call (posIndex, ChooseBed, knownAF). Since these values are constant across all optimization iterations, resolve them once into a flat vector of ResolvedMarker structs after data loading. The hot loop now uses a single integer comparison to skip absent markers and direct array indexing for base/quality data and alt alleles. ~27% reduction in optimization phase runtime (47s -> 34s), bringing the cumulative speedup to ~5.8x vs baseline.

Within a single ComputeMixLLKs call, alpha is constant. The per-base log-likelihood depends only on four discrete inputs: base class (ref/alt/other), quality score (Phred 0-93), and the genotype pair (g1, g2). That's 3 x 94 x 9 = 2,538 unique values. Build a lookup table of these 2,538 log() results once at the top of each ComputeMixLLKs call, then replace per-base log() calls with table additions. This typically replaces ~1.3M log() calls with 2,538. ~74% reduction in optimization phase runtime (34s -> 9s), bringing the cumulative speedup to ~22x vs the original baseline (198s -> 9s).

In C++11, a static constexpr data member declared inside a class requires an out-of-class definition if it is ODR-used. GCC enforces this strictly while Clang does not, causing an undefined reference to COND_LK on Ubuntu CI builds with GCC.

Copilot

Pull request overview

This PR focuses on speeding up VerifyBamID2’s FREEMIX optimization when estimating contamination from BAM/CRAM/pileup inputs by reducing repeated work inside the hottest likelihood loop.

Changes:

Adds phase-level timing instrumentation (RAII PhaseTimer) around major execution and optimization steps.
Optimizes ComputeMixLLKs() by replacing repeated computation (pow, log, nested map lookups, conditional branching) with compact lookup tables and pre-resolved per-marker metadata.
Expands CTest coverage to exercise additional optimization modes and adds corresponding golden expected outputs.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`main.cpp`	Adds phase timing around SVD loading, input reading, sanity check, and optimization.
`ContaminationEstimator.h`	Introduces Phred lookup table, conditional likelihood lookup table, per-call log-likelihood table, and uses `resolvedMarkers` + direct base/qual indexing in the hot loop.
`ContaminationEstimator.cpp`	Adds optimization sub-phase timers and precomputes `resolvedMarkers` once before optimization.
`CMakeLists.txt`	Adds new tests for WithinAncestry/FixPC/FixAlpha/Heter FixPC paths with golden diffs.
`resource/test/expected/test.*.Ancestry`	Adds new golden expected ancestry output files for the added test cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- getPhredTable(): replace manual 'static bool initialized' guard with C++11 thread-safe local static initialization via an immediately-invoked lambda, so concurrent first callers can never observe a partially- initialized table. - ComputeMixLLKs(): clamp the per-base quality score to [0, 93] before indexing logLkTable. BAM qualities are guaranteed in-range but pileup input is ASCII and a malformed file could drive the index out of bounds. - CTests: give each newly added test pair its own --Output prefix and add a DEPENDS property on the Diff test so CTest can safely run tests in parallel without racing on shared result.* filenames or sequencing a Diff before its Run.

Extends the --Output + DEPENDS pattern from the previous commit to the six existing tests. myTest1, myTest3, and myTest5 all wrote to the shared result.Ancestry / result.selfSM filenames, and their paired Diff tests (myTest2/4/6) had no explicit ordering, so the suite was only reliable when CTest happened to run tests serially. This race was discovered while running the suite under `ctest -j8` to confirm the newly-added tests from the previous commit behaved correctly under parallel execution: the new tests all passed, but myTest2/4/6 began failing intermittently as the older Run tests clobbered each other's output. Giving each Run its own --Output prefix and adding DEPENDS on each Diff test makes the full suite parallel-safe (15/15 passing with -j8).

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Griffan

This is a long overdue optimization, thanks! @tfenne

tfenne added 7 commits April 14, 2026 10:39

Avoid unnecessary vector copies in ComputeMixLLKs hot loop

f0e0367

GetBaseInfoAt/GetQualInfoAt return references but the loop was assigning to value types, causing a deep copy per marker per call. Use const references instead.

tfenne changed the title ~~Tf perf tuning~~ Performance tuning of FREEMIX calculations from BAM Apr 14, 2026

Griffan requested a review from Copilot April 17, 2026 07:09

Copilot started reviewing on behalf of Griffan April 17, 2026 07:09 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread ContaminationEstimator.h Outdated

Comment thread ContaminationEstimator.h

Comment thread CMakeLists.txt Outdated

tfenne added 2 commits April 17, 2026 05:03

Griffan requested a review from Copilot April 19, 2026 02:01

Copilot started reviewing on behalf of Griffan April 19, 2026 02:01 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

Comment thread ContaminationEstimator.h

Griffan self-requested a review April 19, 2026 02:08

Griffan approved these changes Apr 19, 2026

View reviewed changes

Griffan merged commit 1cb0101 into Griffan:master Apr 19, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance tuning of FREEMIX calculations from BAM#82

Performance tuning of FREEMIX calculations from BAM#82
Griffan merged 10 commits into
Griffan:masterfrom
tfenne:tf_perf_tuning

tfenne commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Griffan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tfenne commented Apr 14, 2026

Summary

Benchmark

Key design decisions

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Griffan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants