Performance tuning of FREEMIX calculations from BAM#82
Merged
Conversation
Add RAII PhaseTimer that logs start/end/duration for each major phase to stderr via notice(). Phases timed in main.cpp: SVD loading, BAM/pileup reading, sanity check, and overall optimization. Sub-phases timed in ContaminationEstimator.cpp: likelihood initialization, each optimizer variant, and null-model LLK calculation.
Previously only the default BetweenAncestry path (OptimizeHeter + OptimizeHomo initial) was tested. Add tests covering: - WithinAncestry (pure OptimizeHomo) - WithinAncestry + FixPC (OptimizeHomoFixedPC) - FixAlpha (OptimizeHomoFixedAlpha initial + OptimizeHeterFixedAlpha) - FixPC in BetweenAncestry (OptimizeHeterFixedPC) This brings coverage from 1 of 6 optimization code paths to all 6, providing a safety net for upcoming performance optimizations.
GetBaseInfoAt/GetQualInfoAt return references but the loop was assigning to value types, causing a deep copy per marker per call. Use const references instead.
Quality scores are discrete integers in [0,93]. Precompute a 94-entry table of pow(10.0, x/-10.0) once, then use table lookups instead of calling pow() in the hot loop. Also hoist the lookup above the 3x3 genotype loop so each base's error probability is fetched once instead of 9 times. ~31% reduction in optimization phase runtime on a 14K-marker panel.
Replace getConditionalBaseLK (50-line branchy function called 36x per base) with a static 2x3x3 lookup table indexed by [is_error][genotype] [base_class]. Classify each base once as ref/alt/other, then look up all 6 conditional likelihoods (3 genotypes x 2 error states). Restructure the inner loop from genotype-outer/base-inner (processing each base 9 times) to base-outer/genotype-inner (processing each base once and accumulating into 9 genotype-pair log-likelihood accumulators). Combined with the prior Phred table change, the optimization phase is now ~4.2x faster than baseline (198s -> 47s) with bit-identical results.
ComputeMixLLKs previously performed 4-6 nested unordered_map lookups per marker per call (posIndex, ChooseBed, knownAF). Since these values are constant across all optimization iterations, resolve them once into a flat vector of ResolvedMarker structs after data loading. The hot loop now uses a single integer comparison to skip absent markers and direct array indexing for base/quality data and alt alleles. ~27% reduction in optimization phase runtime (47s -> 34s), bringing the cumulative speedup to ~5.8x vs baseline.
Within a single ComputeMixLLKs call, alpha is constant. The per-base log-likelihood depends only on four discrete inputs: base class (ref/alt/other), quality score (Phred 0-93), and the genotype pair (g1, g2). That's 3 x 94 x 9 = 2,538 unique values. Build a lookup table of these 2,538 log() results once at the top of each ComputeMixLLKs call, then replace per-base log() calls with table additions. This typically replaces ~1.3M log() calls with 2,538. ~74% reduction in optimization phase runtime (34s -> 9s), bringing the cumulative speedup to ~22x vs the original baseline (198s -> 9s).
In C++11, a static constexpr data member declared inside a class requires an out-of-class definition if it is ODR-used. GCC enforces this strictly while Clang does not, causing an undefined reference to COND_LK on Ubuntu CI builds with GCC.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR focuses on speeding up VerifyBamID2’s FREEMIX optimization when estimating contamination from BAM/CRAM/pileup inputs by reducing repeated work inside the hottest likelihood loop.
Changes:
- Adds phase-level timing instrumentation (RAII
PhaseTimer) around major execution and optimization steps. - Optimizes
ComputeMixLLKs()by replacing repeated computation (pow,log, nested map lookups, conditional branching) with compact lookup tables and pre-resolved per-marker metadata. - Expands CTest coverage to exercise additional optimization modes and adds corresponding golden expected outputs.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
main.cpp |
Adds phase timing around SVD loading, input reading, sanity check, and optimization. |
ContaminationEstimator.h |
Introduces Phred lookup table, conditional likelihood lookup table, per-call log-likelihood table, and uses resolvedMarkers + direct base/qual indexing in the hot loop. |
ContaminationEstimator.cpp |
Adds optimization sub-phase timers and precomputes resolvedMarkers once before optimization. |
CMakeLists.txt |
Adds new tests for WithinAncestry/FixPC/FixAlpha/Heter FixPC paths with golden diffs. |
resource/test/expected/test.*.Ancestry |
Adds new golden expected ancestry output files for the added test cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- getPhredTable(): replace manual 'static bool initialized' guard with C++11 thread-safe local static initialization via an immediately-invoked lambda, so concurrent first callers can never observe a partially- initialized table. - ComputeMixLLKs(): clamp the per-base quality score to [0, 93] before indexing logLkTable. BAM qualities are guaranteed in-range but pileup input is ASCII and a malformed file could drive the index out of bounds. - CTests: give each newly added test pair its own --Output prefix and add a DEPENDS property on the Diff test so CTest can safely run tests in parallel without racing on shared result.* filenames or sequencing a Diff before its Run.
Extends the --Output + DEPENDS pattern from the previous commit to the six existing tests. myTest1, myTest3, and myTest5 all wrote to the shared result.Ancestry / result.selfSM filenames, and their paired Diff tests (myTest2/4/6) had no explicit ordering, so the suite was only reliable when CTest happened to run tests serially. This race was discovered while running the suite under `ctest -j8` to confirm the newly-added tests from the previous commit behaved correctly under parallel execution: the new tests all passed, but myTest2/4/6 began failing intermittently as the older Run tests clobbered each other's output. Giving each Run its own --Output prefix and adding DEPENDS on each Diff test makes the full suite parallel-safe (15/15 passing with -j8).
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi @Griffan - I've been running VerifyBamID2 on some fairly large panels (because my goal is to get to very low coverage), and spent some time working on the runtime of the process for calculating FREEMIX from a BAM file. This is all work done in partnership with Claude. I've done my best to organize this into atomic commits, with some additional logging and tests developed up front. These changes to not alter the outputs at all for the tests or my benchmarking real-world test case with ~5x coverage of 14k markers.
Please feel free to push back on any of this that you don't like, or would like restructured, better documented, better tested etc. But the end results are pretty compelling - I'm at ~20-fold speed up (i.e. 95% reduction) of the optimization/post BAM-reading phase running on an aarch64 mac, and I expect that the results on x86 will be broadly similar.
Summary
The Amoeba minimizer calls
ComputeMixLLKs()up to 50,000 times during optimization, and each call iterates over every marker in the reference panel. As panels grow, this becomes the dominant cost. This PR applies a series of targeted optimizations to the inner loop ofComputeMixLLKs()— replacing expensive per-call computations with precomputed tables and eliminating redundant work — to achieve a 22x speedup on the optimization phase with bit-identical results on all tests.All optimizations exploit the same insight: the inputs to the inner loop are drawn from small discrete domains, so values that were being recomputed millions of times can instead be looked up from compact tables built once per call or once at startup.
Benchmark
Measured on a 14K-marker feline panel, single-threaded (
--NumPC 5 --NumThread 1), Release build:cf01323PhaseTimerlogs start/end/duration to stderr for each major phase (SVD loading, BAM reading, sanity check, optimization sub-phases), providing the measurement infrastructure for all subsequent changes.2a45948--WithinAncestry,--WithinAncestry --FixPC,--FixAlpha, and--FixPC, coveringOptimizeHomo,OptimizeHomoFixedPC,OptimizeHomoFixedAlpha,OptimizeHeterFixedPC, andOptimizeHeterFixedAlpha. All use exact-diff validation against golden output files.f0e0367GetBaseInfoAt/GetQualInfoAtreturn references, but the loop was assigning to value types, causing a deep copy per marker per call. Negligible runtime impact but corrects a wasteful pattern.b9e5c18Phred()pow() calls with a precomputed 94-entry lookup table. Quality scores are discrete integers in [0, 93], sopow(10.0, x/-10.0)can be tabulated once at startup. Also hoists the lookup above the 3x3 genotype loop so each base's error probability is fetched once instead of 9 times.54f08dbgetConditionalBaseLK(a 50-line branchy function called 36x per base) with a staticconstexpr2x3x3 lookup table, and restructure the inner loop from genotype-outer/base-inner to base-outer/genotype-inner. Each base is now classified once (ref/alt/other), its 6 conditional likelihoods looked up from the table, and contributions accumulated into 9 genotype-pair accumulators — instead of processing each base 9 separate times.fef0d70unordered_maplookups per marker perComputeMixLLKscall (posIndex, ChooseBed, knownAF). The hot loop now uses a single integer comparison to skip absent markers and direct array indexing for base/quality data.cc59c54ComputeMixLLKscall. Within a call, alpha is constant, solog(val)depends only on base class (3), quality score (94), and genotype pair (9) = 2,538 unique values. Building this table once replaces ~1.3Mlog()calls per invocation with table additions.Key design decisions
diffagainst golden files with no tolerance — all 15 tests pass on every commit.Test plan