Skip to content

perf(patchwork): TBB parallel_for in classic Patchwork (+1.73× Hz); Patchwork++ stays sequential after benchmark#95

Merged
LimHyungTae merged 2 commits into
masterfrom
refactor/tbb-parallel
May 21, 2026
Merged

perf(patchwork): TBB parallel_for in classic Patchwork (+1.73× Hz); Patchwork++ stays sequential after benchmark#95
LimHyungTae merged 2 commits into
masterfrom
refactor/tbb-parallel

Conversation

@LimHyungTae
Copy link
Copy Markdown
Member

What

Part B of the TBB / common-library refactor agreed in #89 (Part A is #94). Adds tbb::parallel_for to the classic Patchwork main loop and benchmarks Patchwork++ under the same pattern. Patchwork++ does not become parallel — measurement showed it hurts at every thread count.

Headline numbers (KITTI seq 00, i7-12700, 24 logical cores)

Configuration Median ms / frame Median Hz
Classic Patchwork single-thread (taskset -c 0) 8.31 120.4
Classic Patchwork parallel (this PR) 4.81 207.8 (+1.73×)
Patchwork++ single-thread 9.02 110.9
Patchwork++ — TBB at 2 threads (rejected) 10.73 93.2
Patchwork++ — TBB at 8 threads (rejected) 11.05 90.5
Patchwork++ — TBB at 24 threads (rejected) 14.55 68.7

Per-patch work in Patchwork++ is too small (~14 µs avg) and dominated by short-lived std::vector / Eigen::Matrix allocations inside R-VPF + R-GPF. Concurrent malloc serialises on the heap allocator, and TBB scheduler overhead exceeds the parallelisation benefit at every thread count. Single-threaded Patchwork++ already runs at ~2× the paper's reported 55 Hz on i7-7700K, so there is no real-time motivation to parallelise. The decision is documented in a long-form code comment in estimateGround.

Numerical equivalence — passes

KITTI 00-10 full sweep (23,201 frames), Patchwork++ paper protocol:

Configuration Master This PR ΔF1
--method patchwork × pp proto 96.0172 96.0172 0 (byte-identical)
--method patchworkpp × pp proto 96.2918 96.2919 +0.0001 (float noise)

Both well under the ±0.05 macro budget set in the refactor plan.

Build changes

  • cpp/CMakeLists.txt: find_package(TBB CONFIG REQUIRED) (falls back to MODULE), with an error message listing the install command for Ubuntu / macOS / Windows if TBB is missing.
  • cpp/patchwork/CMakeLists.txt: links TBB::tbb.
  • cpp/patchworkpp/CMakeLists.txt: unchanged (no TBB use).
  • The CI workflows already install libtbb-dev on Ubuntu via apt; macOS / Windows wheel builds should pick TBB up automatically — I'll watch the CI matrix here.

Adds

  • python/examples/bench_hz.py — small per-frame timing harness that reports median / mean / p95 / p99 ms and Hz from getTimeTaken(). Useful for future perf work.

Code changes summary

  • cpp/patchwork/src/patchwork.cpp: replace the triple-nested for loop with a flat tbb::parallel_for over a pre-built patch index list; serial reduction phase walks the outcome buffer in deterministic order.
  • cpp/patchwork/include/patchwork/patchwork.h: extract_initial_seeds and perform_regionwise_segmentation are now const (they were already pure functions of their arguments + params_; needed so the TBB worker can call them).
  • cpp/patchworkpp/src/patchworkpp.cpp: unchanged behaviour, but the main loop now has the "why we stayed sequential" comment.
  • Various small const / namespace tidying in the patchworkpp headers from the TBB prototyping that's harmless to keep.

Test plan

  • pip install -v ./python/ builds.
  • Smoke test seq 00 50 frames matches v1.3.1 baseline exactly.
  • Equivalence sweep KITTI 00-10 (table above).
  • Hz benchmark (table above), scaling test on Patchwork++ (1/2/4/8/16/24 threads, all in commit message).
  • pre-commit run clean.

Follow-up

After this lands, plan is to cut v1.4.0 (per the refactor plan agreed earlier):

Cross-refs: #89, #94.

The classic Patchwork main loop in cpp/patchwork/src/patchwork.cpp is
now parallelised with a single tbb::parallel_for over all
(zone, ring, sector) patches, mirroring the upstream ~/git/patchwork
pattern. Per-patch work (sort + plane fit + GLE) runs in worker
threads; a serial reduction then accumulates ground / nonground in
deterministic order.

  median per-frame time on KITTI seq 00 (i7-12700, 24 logical cores):
    single-thread (taskset -c 0)            8.31 ms (120.4 Hz)
    parallel (default TBB scheduler)        4.81 ms (207.8 Hz)
                                             →  1.73x speedup

Patchwork++ (cpp/patchworkpp/src/patchworkpp.cpp) was also benchmarked
under the same TBB pattern at 1 / 2 / 4 / 8 / 16 / 24 threads. Every
multi-thread configuration was SLOWER than single-thread on KITTI:

    1 thread   →  111 Hz   (baseline)
    2 threads  →   93 Hz
    4 threads  →   91 Hz
    8 threads  →   91 Hz
   16 threads  →   85 Hz
   24 threads  →   69 Hz

The per-patch work in Patchwork++ is small (~14 µs avg) and dominated
by short-lived std::vector / Eigen::Matrix allocations inside R-VPF
and R-GPF. Concurrent malloc serialises on the heap allocator and TBB
scheduler overhead exceeds the parallelisation benefit at every
thread count. Single-threaded Patchwork++ already runs at ~2x the
paper's reported 55 Hz on i7-7700K, so there is no real-time
motivation to parallelise. Patchwork++ remains single-threaded; the
estimateGround loop has a long-form comment explaining why.

Numerical equivalence verified on KITTI 00-10 full sweep (23,201
frames), both methods, Patchwork++ paper protocol:

  patchwork x pp protocol   pre: 96.0172  post: 96.0172  (byte-identical)
  patchwork++ x pp protocol pre: 96.2918  post: 96.2919  (Δ +0.0001)

Both within the ±0.05 budget set in the refactor plan.

Build:
- Adds find_package(TBB CONFIG/MODULE REQUIRED) to cpp/CMakeLists.txt
  with a helpful error message listing the install command for
  Ubuntu / macOS / Windows.
- cpp/patchwork/CMakeLists.txt links TBB::tbb; cpp/patchworkpp/ does
  not (since it does not use TBB).

Also adds:
- python/examples/bench_hz.py — small per-frame timing harness that
  reports median / mean / p95 / p99 ms and Hz from getTimeTaken().
- A `const` qualifier on PatchWork::extract_initial_seeds and
  PatchWork::perform_regionwise_segmentation since neither writes to
  *this any more — needed so the TBB worker can call them.
CI runners (cpp_api on Ubuntu/macOS/Windows, python_package and
cibuildwheel jobs) do not install libtbb-dev, so PR #95's FATAL_ERROR
on missing TBB broke 11 of 18 checks. Switch to a soft find:

  - find_package(TBB CONFIG/MODULE QUIET) — sets TBB_FOUND or not.
  - When TBB_FOUND, classic Patchwork links TBB::tbb and gets a
    PATCHWORK_HAS_TBB compile define.
  - cpp/patchwork/src/patchwork.cpp now guards both the #include and
    the parallel_for site with #ifdef PATCHWORK_HAS_TBB and falls
    back to a sequential loop over the same patch-index list when
    TBB is unavailable.
  - cpp/CMakeLists.txt prints a STATUS message either way so users
    know whether they got the 1.73x speedup or not.

Tested locally:
  - With libtbb-dev installed: "-- TBB found — classic Patchwork will
    use tbb::parallel_for." → builds + runs, matches v1.3.1 numbers.
  - With -DCMAKE_DISABLE_FIND_PACKAGE_TBB=ON: "-- TBB not found —
    classic Patchwork falls back to a sequential loop." → builds
    clean, no TBB symbols required.

Patchwork++ remains untouched (issue #96).
@LimHyungTae LimHyungTae merged commit 95f9557 into master May 21, 2026
17 checks passed
@LimHyungTae LimHyungTae deleted the refactor/tbb-parallel branch May 21, 2026 08:23
LimHyungTae added a commit that referenced this pull request May 21, 2026
Minor bump for the common-library refactor (#94) and the optional
TBB parallelisation of classic Patchwork (#95). pypatchworkpp.patchwork
gains 1.73x (120 -> 208 Hz on KITTI seq 00, i7-12700). pypatchworkpp.patchworkpp
stays sequential after measurement — see #96 for the why.

TBB is an optional build dependency: missing TBB falls back to a
sequential loop with a CMake STATUS message; CI/wheel builds keep
working without libtbb-dev installed.

Bumps:
- python/pyproject.toml  1.3.1 -> 1.4.0
- cpp/CMakeLists.txt     1.3.1 -> 1.4.0

CHANGELOG.md updated with the full v1.4.0 entry.

See #94, #95, #96.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant