perf(patchwork): TBB parallel_for in classic Patchwork (+1.73× Hz); Patchwork++ stays sequential after benchmark#95
Merged
Conversation
The classic Patchwork main loop in cpp/patchwork/src/patchwork.cpp is
now parallelised with a single tbb::parallel_for over all
(zone, ring, sector) patches, mirroring the upstream ~/git/patchwork
pattern. Per-patch work (sort + plane fit + GLE) runs in worker
threads; a serial reduction then accumulates ground / nonground in
deterministic order.
median per-frame time on KITTI seq 00 (i7-12700, 24 logical cores):
single-thread (taskset -c 0) 8.31 ms (120.4 Hz)
parallel (default TBB scheduler) 4.81 ms (207.8 Hz)
→ 1.73x speedup
Patchwork++ (cpp/patchworkpp/src/patchworkpp.cpp) was also benchmarked
under the same TBB pattern at 1 / 2 / 4 / 8 / 16 / 24 threads. Every
multi-thread configuration was SLOWER than single-thread on KITTI:
1 thread → 111 Hz (baseline)
2 threads → 93 Hz
4 threads → 91 Hz
8 threads → 91 Hz
16 threads → 85 Hz
24 threads → 69 Hz
The per-patch work in Patchwork++ is small (~14 µs avg) and dominated
by short-lived std::vector / Eigen::Matrix allocations inside R-VPF
and R-GPF. Concurrent malloc serialises on the heap allocator and TBB
scheduler overhead exceeds the parallelisation benefit at every
thread count. Single-threaded Patchwork++ already runs at ~2x the
paper's reported 55 Hz on i7-7700K, so there is no real-time
motivation to parallelise. Patchwork++ remains single-threaded; the
estimateGround loop has a long-form comment explaining why.
Numerical equivalence verified on KITTI 00-10 full sweep (23,201
frames), both methods, Patchwork++ paper protocol:
patchwork x pp protocol pre: 96.0172 post: 96.0172 (byte-identical)
patchwork++ x pp protocol pre: 96.2918 post: 96.2919 (Δ +0.0001)
Both within the ±0.05 budget set in the refactor plan.
Build:
- Adds find_package(TBB CONFIG/MODULE REQUIRED) to cpp/CMakeLists.txt
with a helpful error message listing the install command for
Ubuntu / macOS / Windows.
- cpp/patchwork/CMakeLists.txt links TBB::tbb; cpp/patchworkpp/ does
not (since it does not use TBB).
Also adds:
- python/examples/bench_hz.py — small per-frame timing harness that
reports median / mean / p95 / p99 ms and Hz from getTimeTaken().
- A `const` qualifier on PatchWork::extract_initial_seeds and
PatchWork::perform_regionwise_segmentation since neither writes to
*this any more — needed so the TBB worker can call them.
CI runners (cpp_api on Ubuntu/macOS/Windows, python_package and cibuildwheel jobs) do not install libtbb-dev, so PR #95's FATAL_ERROR on missing TBB broke 11 of 18 checks. Switch to a soft find: - find_package(TBB CONFIG/MODULE QUIET) — sets TBB_FOUND or not. - When TBB_FOUND, classic Patchwork links TBB::tbb and gets a PATCHWORK_HAS_TBB compile define. - cpp/patchwork/src/patchwork.cpp now guards both the #include and the parallel_for site with #ifdef PATCHWORK_HAS_TBB and falls back to a sequential loop over the same patch-index list when TBB is unavailable. - cpp/CMakeLists.txt prints a STATUS message either way so users know whether they got the 1.73x speedup or not. Tested locally: - With libtbb-dev installed: "-- TBB found — classic Patchwork will use tbb::parallel_for." → builds + runs, matches v1.3.1 numbers. - With -DCMAKE_DISABLE_FIND_PACKAGE_TBB=ON: "-- TBB not found — classic Patchwork falls back to a sequential loop." → builds clean, no TBB symbols required. Patchwork++ remains untouched (issue #96).
Merged
LimHyungTae
added a commit
that referenced
this pull request
May 21, 2026
Minor bump for the common-library refactor (#94) and the optional TBB parallelisation of classic Patchwork (#95). pypatchworkpp.patchwork gains 1.73x (120 -> 208 Hz on KITTI seq 00, i7-12700). pypatchworkpp.patchworkpp stays sequential after measurement — see #96 for the why. TBB is an optional build dependency: missing TBB falls back to a sequential loop with a CMake STATUS message; CI/wheel builds keep working without libtbb-dev installed. Bumps: - python/pyproject.toml 1.3.1 -> 1.4.0 - cpp/CMakeLists.txt 1.3.1 -> 1.4.0 CHANGELOG.md updated with the full v1.4.0 entry. See #94, #95, #96.
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Part B of the TBB / common-library refactor agreed in #89 (Part A is #94). Adds
tbb::parallel_forto the classic Patchwork main loop and benchmarks Patchwork++ under the same pattern. Patchwork++ does not become parallel — measurement showed it hurts at every thread count.Headline numbers (KITTI seq 00, i7-12700, 24 logical cores)
Per-patch work in Patchwork++ is too small (~14 µs avg) and dominated by short-lived
std::vector/Eigen::Matrixallocations inside R-VPF + R-GPF. Concurrent malloc serialises on the heap allocator, and TBB scheduler overhead exceeds the parallelisation benefit at every thread count. Single-threaded Patchwork++ already runs at ~2× the paper's reported 55 Hz on i7-7700K, so there is no real-time motivation to parallelise. The decision is documented in a long-form code comment inestimateGround.Numerical equivalence — passes
KITTI 00-10 full sweep (23,201 frames), Patchwork++ paper protocol:
--method patchwork× pp proto--method patchworkpp× pp protoBoth well under the ±0.05 macro budget set in the refactor plan.
Build changes
cpp/CMakeLists.txt:find_package(TBB CONFIG REQUIRED)(falls back to MODULE), with an error message listing the install command for Ubuntu / macOS / Windows if TBB is missing.cpp/patchwork/CMakeLists.txt: linksTBB::tbb.cpp/patchworkpp/CMakeLists.txt: unchanged (no TBB use).libtbb-devon Ubuntu via apt; macOS / Windows wheel builds should pick TBB up automatically — I'll watch the CI matrix here.Adds
python/examples/bench_hz.py— small per-frame timing harness that reports median / mean / p95 / p99 ms and Hz fromgetTimeTaken(). Useful for future perf work.Code changes summary
cpp/patchwork/src/patchwork.cpp: replace the triple-nested for loop with a flattbb::parallel_forover a pre-built patch index list; serial reduction phase walks the outcome buffer in deterministic order.cpp/patchwork/include/patchwork/patchwork.h:extract_initial_seedsandperform_regionwise_segmentationare nowconst(they were already pure functions of their arguments +params_; needed so the TBB worker can call them).cpp/patchworkpp/src/patchworkpp.cpp: unchanged behaviour, but the main loop now has the "why we stayed sequential" comment.Test plan
pip install -v ./python/builds.pre-commit runclean.Follow-up
After this lands, plan is to cut v1.4.0 (per the refactor plan agreed earlier):
python/pyproject.toml+cpp/CMakeLists.txtto 1.4.0.Cross-refs: #89, #94.