The current run_mcmc_chains implementation in BARTSampler and BCFSampler loops over chains sequentially, restoring GFR snapshots between chains to give each a distinct starting point. The GFR snapshot mechanism is correct but the chains themselves run one after another.
This issue adds true parallel execution: each chain runs on its own thread with no shared mutable state (each chain owns its own sampler state, RNG, and BARTSamples/BCFSamples result object). Concretely: replace the sequential for-loop in run_mcmc_chains with a #pragma omp parallel for schedule(static) block; each thread i restores from gfr_snapshots_[num_chains - 1 - i] and writes into samples.chain_slice(i) with no coordination needed.
The existing warning about within-chain multi-threading conflicting with cross-chain parallelism should be preserved (raise a warning if num_threads > 1 and num_chains > 1).
Acceptance criteria:
- With
num_chains = 4 and the same four seeds, parallel results match sequential results within floating-point tolerance
- Wall-time speedup >= 3x with 4 chains on a 4-core machine
- All existing R and Python BART and BCF tests pass
The current
run_mcmc_chainsimplementation inBARTSamplerandBCFSamplerloops over chains sequentially, restoring GFR snapshots between chains to give each a distinct starting point. The GFR snapshot mechanism is correct but the chains themselves run one after another.This issue adds true parallel execution: each chain runs on its own thread with no shared mutable state (each chain owns its own sampler state, RNG, and BARTSamples/BCFSamples result object). Concretely: replace the sequential for-loop in
run_mcmc_chainswith a#pragma omp parallel for schedule(static)block; each thread i restores fromgfr_snapshots_[num_chains - 1 - i]and writes intosamples.chain_slice(i)with no coordination needed.The existing warning about within-chain multi-threading conflicting with cross-chain parallelism should be preserved (raise a warning if
num_threads > 1andnum_chains > 1).Acceptance criteria:
num_chains = 4and the same four seeds, parallel results match sequential results within floating-point tolerance