Skip to content

perf(commons): C++ Karmarkar-Karp partitioner releases GIL#390

Open
JacoCheung wants to merge 1 commit into
NVIDIA:mainfrom
JacoCheung:junzhang/cpp_kk_partitioner
Open

perf(commons): C++ Karmarkar-Karp partitioner releases GIL#390
JacoCheung wants to merge 1 commit into
NVIDIA:mainfrom
JacoCheung:junzhang/cpp_kk_partitioner

Conversation

@JacoCheung
Copy link
Copy Markdown
Collaborator

@JacoCheung JacoCheung commented May 14, 2026

Summary

KK runs on every load-balanced batch shuffle. The pure-Python
implementation in examples/commons/perf_model/partitioner.py
uses heapq with Python __lt__ callbacks, which keep the GIL the
whole time KK runs in the background ThreadPoolExecutor. nsys shows a
visible "Waiting for GIL" gap on the main thread (~1.5 ms) during the
karmarkar_karp NVTX range — the main thread cannot keep submitting
CUDA kernels while KK contends the GIL.

This PR ports KK to a pybind11 C++ extension (kk_cpu_ops) with
py::gil_scoped_release around the compute. Output is bit-for-bit
identical to the Python implementation (same Set.__lt__ /
State.__lt__ tie-breaking). The Python implementation is kept as a
fallback when the extension is not built, gated by KK_FORCE_PYTHON=1
for parity testing.

Why this matters

  • GIL bubble eliminated. nsys karmarkar_karp NVTX duration on
    the kk worker thread drops from Python's ms-scale (~1.5 ms, with
    ~1.36 ms of "Waiting for GIL" on the main thread) to
    avg 135 μs / max 270 μs under C++ — 50× shorter, and the
    main-thread GIL-wait gap is gone.

  • Drop-in for production training. commons.perf_model.partitioner.karmarkar_karp
    signature is unchanged; the resolution order is:

    1. honour KK_FORCE_PYTHON=1 (escape hatch / parity tests)
    2. top-level import kk_cpu_ops (matches the
      python setup.py install layout picked up automatically when
      the Docker image is rebuilt via examples/commons/setup.py)
    3. sibling .so next to the perf_model package (matches
      python setup.py build_ext --inplace for dev iteration)
    4. fall back to the pure-Python implementation otherwise
  • Isolated micro-benchmark at n=k*4096=32768, k=8: Python median
    237.6 ms → C++ median 13.5 ms (17.6× speedup).

Implementation

  • examples/commons/perf_model/csrc/kk_partition.cpp
    Faithful port of the Karp-Karmarkar heap with the same tie-breaking
    rules. py::gil_scoped_release release covers the whole compute
    (sort + std::make_heap / std::pop_heap / std::push_heap
    loop, no Python callbacks anywhere).

  • examples/commons/perf_model/partitioner.py
    Adds a thin wrapper that probes for kk_cpu_ops and dispatches to
    C++ when available, otherwise to the renamed _karmarkar_karp_python
    helper. No call-site changes.

  • examples/commons/setup.py
    Adds the kk_cpu_ops CppExtension alongside the existing CUDA
    extensions. Introduces BUILD_EXT_ONLY=kk_cpu_ops env knob to
    rebuild just the CPU partitioner without spending minutes on the
    CUDA extensions.

  • examples/tests/commons/test_kk_partitioner.py
    New parity + sufficiency test file:

    • 16 parity cases across k = 2/4/8/16 × n/k = 1/4/32/128
    • 3 unequal-size parity cases
    • Adversarial: all-equal workloads, deliberate len(items) ties
    • Sufficiency invariants: partitions cover [0, n) exactly once;
      KK beats naive chunking on skewed input
    • test_gil_released_during_compute: runs C++ KK in a
      background thread while the main thread runs a tight Python
      compute loop, asserts main-thread retains > 50% of its
      GIL-free baseline throughput — proves the release actually
      fires.

Verification

Result
26/26 parity tests pytest tests/commons/test_kk_partitioner.py -q → all pass
Isolated speedup 17.6× (Python 237 ms → C++ 13.5 ms at n=32k, k=8)
nsys NVTX karmarkar_karp C++ avg 135 μs / max 270 μs (was ms-scale)
Main-thread "Waiting for GIL" gap gone (was ~1.36 ms in the original screenshot)
Parity-fallback KK_FORCE_PYTHON=1 bypasses C++; tests still pass
E2E with --nsys ~3.86 ms / iter step-time saving in the profile-visible regime
E2E without --nsys within measurement noise; KK is fully overlapped by forward in production mode (expected — 135 μs vs 32 ms forward)

Test plan

  • Local parity tests pass in container (26/26)
  • Build inplace + production PYTHONPATH import works
  • E2E with --nsys shows GIL bubble eliminated and ~3.86 ms/iter step-time saving
  • E2E without --nsys shows no regression
  • /build to confirm new container builds kk_cpu_ops cleanly
  • CI green

🤖 Generated with Claude Code

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR ports the Karmarkar-Karp k-way partitioner from pure Python (with GIL-holding heapq callbacks) to a pybind11 C++ extension (kk_cpu_ops) that releases the GIL for the entire compute, eliminating a ~1.36 ms main-thread GIL-wait gap visible in nsys traces. The Python implementation is kept as a fallback and the public karmarkar_karp signature is unchanged.

  • kk_partition.cpp: C++ port with py::gil_scoped_release covering all heap operations; Set::operator< and State::operator< faithfully replicate Python's tie-breaking rules (sum → size → lexicographic on items; spread → sets[0]), verified bit-for-bit identical by 26 parity tests.
  • partitioner.py: New dispatcher layer probes for kk_cpu_ops via stdlib import then inplace-glob fallback, with sys.modules registration and broad exception handling for stale .so files; NVTX range and equal_size assertion are lifted into the wrapper so both paths are covered.
  • setup.py: Adds CppExtension for kk_cpu_ops and a BUILD_EXT_ONLY env knob so the CPU-only extension can be rebuilt inside a container without recompiling CUDA extensions.

Confidence Score: 5/5

Safe to merge — the C++ port faithfully replicates Python's tie-breaking, the GIL release is correctly scoped with RAII, and the Python fallback remains intact for environments without the extension.

The tie-breaking analysis (sum to size to lexicographic on items in Set; spread to sets[0] in State) matches between Python and C++ exactly, as confirmed by the 26-case parity test suite. The dispatcher, fallback chain, and build system changes are all additive and well-guarded. Only minor const-correctness and validation-placement style points were found.

No files require special attention.

Important Files Changed

Filename Overview
examples/commons/perf_model/csrc/kk_partition.cpp New C++ pybind11 KK implementation with correct GIL release, accurate tie-breaking, and valid heap semantics; minor const-correctness and validation-placement style issues.
examples/commons/perf_model/partitioner.py Dispatcher wrapper cleanly handles C++ / Python fallback with robust error handling, sys.modules registration, and preserved NVTX range; no issues.
examples/commons/setup.py Adds CppExtension for kk_cpu_ops and BUILD_EXT_ONLY env knob; refactoring is correct and the unknown-extension guard prevents silent misconfiguration.
examples/tests/commons/test_kk_partitioner.py Comprehensive parity, sufficiency, and GIL-release test suite; tests correctly skip when extension is not built; no issues beyond timing sensitivity already noted in prior review thread.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Wrapper as "karmarkar_karp (Python)"
    participant CPP as "kk_cpu_ops (C++)"
    participant PyKK as "_karmarkar_karp_python"

    Caller->>Wrapper: workloads, k, equal_size
    Wrapper->>Wrapper: nvtx.range_push / tolist()
    alt C++ extension available
        Wrapper->>CPP: karmarkar_karp(list, k, equal_size)
        Note over CPP: py::gil_scoped_release sort + heap loop (GIL free)
        CPP-->>Wrapper: partitions
    else fallback
        Wrapper->>PyKK: workloads, k, equal_size
        Note over PyKK: heapq + Python callbacks (GIL held)
        PyKK-->>Wrapper: partitions
    end
    Wrapper->>Wrapper: equal_size assert / nvtx.range_pop
    Wrapper-->>Caller: partitions
Loading

Reviews (4): Last reviewed commit: "perf(commons): C++ Karmarkar-Karp partit..." | Re-trigger Greptile

Comment thread examples/tests/commons/test_kk_partitioner.py Outdated
Comment on lines +67 to +71
if _matches:
_spec = _importlib_util.spec_from_file_location("kk_cpu_ops", _matches[0])
if _spec is not None and _spec.loader is not None:
_kk_cpu_ops = _importlib_util.module_from_spec(_spec)
_spec.loader.exec_module(_kk_cpu_ops)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inplace .so loaded without sys.modules registration

When the fallback exec_module path fires, _kk_cpu_ops holds the loaded extension but it is never inserted into sys.modules["kk_cpu_ops"]. If any code later does a plain import kk_cpu_ops it will raise ImportError (the module exists in memory but is unreachable via the import system), and a importlib.reload(partitioner) would execute the .so a second time. Adding sys.modules.setdefault("kk_cpu_ops", _kk_cpu_ops) after exec_module would make the loaded module discoverable and prevent double-loading.

@JacoCheung JacoCheung force-pushed the junzhang/cpp_kk_partitioner branch from 209709d to 26511b0 Compare May 14, 2026 03:41
Comment thread examples/commons/perf_model/partitioner.py Outdated
@JacoCheung JacoCheung force-pushed the junzhang/cpp_kk_partitioner branch from 26511b0 to 05a513d Compare May 14, 2026 03:49
KK runs on every load-balanced batch shuffle.  The pure-Python
implementation uses heapq with Python __lt__ callbacks, which keep
the GIL the whole time KK runs in the background ThreadPoolExecutor.
nsys shows a visible "Waiting for GIL" gap on the main thread
(~1.5 ms) during the karmarkar_karp NVTX range — the main thread
cannot keep submitting CUDA kernels while KK contends the GIL.

This commit ports KK to a pybind11 C++ extension (kk_cpu_ops) with
`py::gil_scoped_release` around the compute.  Output is bit-for-bit
identical to the Python implementation (same Set.__lt__ /
State.__lt__ tie-breaking).  partitioner.py resolution order:

  1. honour KK_FORCE_PYTHON=1 (escape hatch / parity tests)
  2. top-level import kk_cpu_ops (matches `python setup.py install`
     layout — picked up automatically when the Docker image is
     rebuilt via examples/commons/setup.py)
  3. sibling .so next to the perf_model package (matches
     `python setup.py build_ext --inplace` during dev iteration)
  4. fall back to the pure-Python implementation otherwise

`BUILD_EXT_ONLY=kk_cpu_ops` env knob in setup.py rebuilds just the
CPU partitioner without spending minutes on the CUDA extensions.

Measurements:

  Isolated micro-benchmark (n=k*4096=32768, k=8):
      Python median 237.6 ms  →  C++ median 13.5 ms   (17.6x)

  nsys NVTX karmarkar_karp duration on the kk worker thread:
      Python ms-scale  →  C++ avg 135 us / max 270 us (50x shorter)

  Main-thread "Waiting for GIL" gap during the karmarkar_karp NVTX
  range (per the original nsys screenshot, ~1.36 ms): gone under C++.

Tests (26 cases in examples/tests/commons/test_kk_partitioner.py):

  - 16 parity cases across k = 2/4/8/16 x n/k = 1/4/32/128
  - 3 unequal-size parity cases
  - Adversarial: all-equal workloads, deliberate len(items) ties
  - Sufficiency invariants: partitions cover [0, n) exactly once;
    KK beats naive chunking on skewed input.
  - test_gil_released_during_compute runs C++ KK in a background
    thread while the main thread runs a tight Python compute loop,
    asserts the main thread retains > 50 % of its GIL-free baseline
    throughput — proves the release actually fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JacoCheung JacoCheung force-pushed the junzhang/cpp_kk_partitioner branch from 05a513d to 0add388 Compare May 14, 2026 08:37
@JacoCheung JacoCheung requested a review from shijieliu May 14, 2026 14:02
@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

JacoCheung commented May 14, 2026

Pipeline #51290628 -- failed

Job Status Log
pre_check ✅ success view
train_build ❌ failed view
inference_build ✅ success view
tritonserver_build ❌ failed view
build_whl ❌ failed view
dynamicemb_test_fwd_bwd_8gpus ❌ failed view
dynamicemb_test_load_dump_8gpus ❌ failed view
unit_test_1gpu_a100 ❌ failed view
unit_test_1gpu_h100 ❌ failed view
unit_test_4gpu ❌ failed view
unit_test_tp_4gpu ❌ failed view
L20_unit_test_1gpu ❌ failed view
inference_unit_test_1gpu ❌ failed view
inference_test_1gpu ❌ failed view

Result: 2/14 jobs passed

View full pipeline

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

/build

@JacoCheung
Copy link
Copy Markdown
Collaborator Author

JacoCheung commented May 14, 2026

Pipeline #51299141 -- failed

Job Status Log
pre_check ✅ success view
train_build ✅ success view
inference_build ✅ success view
tritonserver_build ✅ success view
build_whl ✅ success view
dynamicemb_test_fwd_bwd_8gpus ❌ failed view
dynamicemb_test_load_dump_8gpus ❌ failed view
unit_test_1gpu_a100 ❌ failed view
unit_test_1gpu_h100 ✅ success view
unit_test_4gpu ✅ success view
unit_test_tp_4gpu ❌ failed view
L20_unit_test_1gpu ✅ success view
inference_unit_test_1gpu ❌ failed view
inference_test_1gpu ✅ success view

Result: 9/14 jobs passed

View full pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant