perf(commons): C++ Karmarkar-Karp partitioner releases GIL by JacoCheung · Pull Request #390 · NVIDIA/recsys-examples

JacoCheung · 2026-05-14T03:30:23Z

Summary

KK runs on every load-balanced batch shuffle. The pure-Python
implementation in examples/commons/perf_model/partitioner.py
uses heapq with Python __lt__ callbacks, which keep the GIL the
whole time KK runs in the background ThreadPoolExecutor. nsys shows a
visible "Waiting for GIL" gap on the main thread (~1.5 ms) during the
karmarkar_karp NVTX range — the main thread cannot keep submitting
CUDA kernels while KK contends the GIL.

This PR ports KK to a pybind11 C++ extension (kk_cpu_ops) with
py::gil_scoped_release around the compute. Output is bit-for-bit
identical to the Python implementation (same Set.__lt__ /
State.__lt__ tie-breaking). The Python implementation is kept as a
fallback when the extension is not built, gated by KK_FORCE_PYTHON=1
for parity testing.

Why this matters

GIL bubble eliminated. nsys karmarkar_karp NVTX duration on
the kk worker thread drops from Python's ms-scale (~1.5 ms, with
~1.36 ms of "Waiting for GIL" on the main thread) to
avg 135 μs / max 270 μs under C++ — 50× shorter, and the
main-thread GIL-wait gap is gone.
Drop-in for production training. commons.perf_model.partitioner.karmarkar_karp
signature is unchanged; the resolution order is:
1. honour KK_FORCE_PYTHON=1 (escape hatch / parity tests)
2. top-level import kk_cpu_ops (matches the
  python setup.py install layout picked up automatically when
  the Docker image is rebuilt via examples/commons/setup.py)
3. sibling .so next to the perf_model package (matches
  python setup.py build_ext --inplace for dev iteration)
4. fall back to the pure-Python implementation otherwise
Isolated micro-benchmark at n=k*4096=32768, k=8: Python median
237.6 ms → C++ median 13.5 ms (17.6× speedup).

Implementation

examples/commons/perf_model/csrc/kk_partition.cpp
Faithful port of the Karp-Karmarkar heap with the same tie-breaking
rules. py::gil_scoped_release release covers the whole compute
(sort + std::make_heap / std::pop_heap / std::push_heap
loop, no Python callbacks anywhere).
examples/commons/perf_model/partitioner.py
Adds a thin wrapper that probes for kk_cpu_ops and dispatches to
C++ when available, otherwise to the renamed _karmarkar_karp_python
helper. No call-site changes.
examples/commons/setup.py
Adds the kk_cpu_ops CppExtension alongside the existing CUDA
extensions. Introduces BUILD_EXT_ONLY=kk_cpu_ops env knob to
rebuild just the CPU partitioner without spending minutes on the
CUDA extensions.
examples/tests/commons/test_kk_partitioner.py
New parity + sufficiency test file:
- 16 parity cases across k = 2/4/8/16 × n/k = 1/4/32/128
- 3 unequal-size parity cases
- Adversarial: all-equal workloads, deliberate len(items) ties
- Sufficiency invariants: partitions cover [0, n) exactly once;
  KK beats naive chunking on skewed input
- test_gil_released_during_compute: runs C++ KK in a
  background thread while the main thread runs a tight Python
  compute loop, asserts main-thread retains > 50% of its
  GIL-free baseline throughput — proves the release actually
  fires.

Verification

	Result
26/26 parity tests	`pytest tests/commons/test_kk_partitioner.py -q` → all pass
Isolated speedup	17.6× (Python 237 ms → C++ 13.5 ms at n=32k, k=8)
nsys NVTX `karmarkar_karp`	C++ avg 135 μs / max 270 μs (was ms-scale)
Main-thread "Waiting for GIL" gap	gone (was ~1.36 ms in the original screenshot)
Parity-fallback	`KK_FORCE_PYTHON=1` bypasses C++; tests still pass
E2E with `--nsys`	~3.86 ms / iter step-time saving in the profile-visible regime
E2E without `--nsys`	within measurement noise; KK is fully overlapped by forward in production mode (expected — 135 μs vs 32 ms forward)

Test plan

Local parity tests pass in container (26/26)
Build inplace + production PYTHONPATH import works
E2E with --nsys shows GIL bubble eliminated and ~3.86 ms/iter step-time saving
E2E without --nsys shows no regression
/build to confirm new container builds kk_cpu_ops cleanly
CI green

🤖 Generated with Claude Code

greptile-apps · 2026-05-14T03:36:41Z

Greptile Summary

This PR ports the Karmarkar-Karp k-way partitioner from pure Python (with GIL-holding heapq callbacks) to a pybind11 C++ extension (kk_cpu_ops) that releases the GIL for the entire compute, eliminating a ~1.36 ms main-thread GIL-wait gap visible in nsys traces. The Python implementation is kept as a fallback and the public karmarkar_karp signature is unchanged.

kk_partition.cpp: C++ port with py::gil_scoped_release covering all heap operations; Set::operator< and State::operator< faithfully replicate Python's tie-breaking rules (sum → size → lexicographic on items; spread → sets[0]), verified bit-for-bit identical by 26 parity tests.
partitioner.py: New dispatcher layer probes for kk_cpu_ops via stdlib import then inplace-glob fallback, with sys.modules registration and broad exception handling for stale .so files; NVTX range and equal_size assertion are lifted into the wrapper so both paths are covered.
setup.py: Adds CppExtension for kk_cpu_ops and a BUILD_EXT_ONLY env knob so the CPU-only extension can be rebuilt inside a container without recompiling CUDA extensions.

Confidence Score: 5/5

Safe to merge — the C++ port faithfully replicates Python's tie-breaking, the GIL release is correctly scoped with RAII, and the Python fallback remains intact for environments without the extension.

The tie-breaking analysis (sum to size to lexicographic on items in Set; spread to sets[0] in State) matches between Python and C++ exactly, as confirmed by the 26-case parity test suite. The dispatcher, fallback chain, and build system changes are all additive and well-guarded. Only minor const-correctness and validation-placement style points were found.

No files require special attention.

Important Files Changed

Filename	Overview
examples/commons/perf_model/csrc/kk_partition.cpp	New C++ pybind11 KK implementation with correct GIL release, accurate tie-breaking, and valid heap semantics; minor const-correctness and validation-placement style issues.
examples/commons/perf_model/partitioner.py	Dispatcher wrapper cleanly handles C++ / Python fallback with robust error handling, sys.modules registration, and preserved NVTX range; no issues.
examples/commons/setup.py	Adds CppExtension for kk_cpu_ops and BUILD_EXT_ONLY env knob; refactoring is correct and the unknown-extension guard prevents silent misconfiguration.
examples/tests/commons/test_kk_partitioner.py	Comprehensive parity, sufficiency, and GIL-release test suite; tests correctly skip when extension is not built; no issues beyond timing sensitivity already noted in prior review thread.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Wrapper as "karmarkar_karp (Python)"
    participant CPP as "kk_cpu_ops (C++)"
    participant PyKK as "_karmarkar_karp_python"

    Caller->>Wrapper: workloads, k, equal_size
    Wrapper->>Wrapper: nvtx.range_push / tolist()
    alt C++ extension available
        Wrapper->>CPP: karmarkar_karp(list, k, equal_size)
        Note over CPP: py::gil_scoped_release sort + heap loop (GIL free)
        CPP-->>Wrapper: partitions
    else fallback
        Wrapper->>PyKK: workloads, k, equal_size
        Note over PyKK: heapq + Python callbacks (GIL held)
        PyKK-->>Wrapper: partitions
    end
    Wrapper->>Wrapper: equal_size assert / nvtx.range_pop
    Wrapper-->>Caller: partitions

_{Reviews (4): Last reviewed commit: "perf(commons): C++ Karmarkar-Karp partit..." | Re-trigger Greptile}

greptile-apps · 2026-05-14T03:36:48Z

+        if _matches:
+            _spec = _importlib_util.spec_from_file_location("kk_cpu_ops", _matches[0])
+            if _spec is not None and _spec.loader is not None:
+                _kk_cpu_ops = _importlib_util.module_from_spec(_spec)
+                _spec.loader.exec_module(_kk_cpu_ops)


Inplace .so loaded without sys.modules registration

When the fallback exec_module path fires, _kk_cpu_ops holds the loaded extension but it is never inserted into sys.modules["kk_cpu_ops"]. If any code later does a plain import kk_cpu_ops it will raise ImportError (the module exists in memory but is unreachable via the import system), and a importlib.reload(partitioner) would execute the .so a second time. Adding sys.modules.setdefault("kk_cpu_ops", _kk_cpu_ops) after exec_module would make the loaded module discoverable and prevent double-loading.

KK runs on every load-balanced batch shuffle. The pure-Python implementation uses heapq with Python __lt__ callbacks, which keep the GIL the whole time KK runs in the background ThreadPoolExecutor. nsys shows a visible "Waiting for GIL" gap on the main thread (~1.5 ms) during the karmarkar_karp NVTX range — the main thread cannot keep submitting CUDA kernels while KK contends the GIL. This commit ports KK to a pybind11 C++ extension (kk_cpu_ops) with `py::gil_scoped_release` around the compute. Output is bit-for-bit identical to the Python implementation (same Set.__lt__ / State.__lt__ tie-breaking). partitioner.py resolution order: 1. honour KK_FORCE_PYTHON=1 (escape hatch / parity tests) 2. top-level import kk_cpu_ops (matches `python setup.py install` layout — picked up automatically when the Docker image is rebuilt via examples/commons/setup.py) 3. sibling .so next to the perf_model package (matches `python setup.py build_ext --inplace` during dev iteration) 4. fall back to the pure-Python implementation otherwise `BUILD_EXT_ONLY=kk_cpu_ops` env knob in setup.py rebuilds just the CPU partitioner without spending minutes on the CUDA extensions. Measurements: Isolated micro-benchmark (n=k*4096=32768, k=8): Python median 237.6 ms → C++ median 13.5 ms (17.6x) nsys NVTX karmarkar_karp duration on the kk worker thread: Python ms-scale → C++ avg 135 us / max 270 us (50x shorter) Main-thread "Waiting for GIL" gap during the karmarkar_karp NVTX range (per the original nsys screenshot, ~1.36 ms): gone under C++. Tests (26 cases in examples/tests/commons/test_kk_partitioner.py): - 16 parity cases across k = 2/4/8/16 x n/k = 1/4/32/128 - 3 unequal-size parity cases - Adversarial: all-equal workloads, deliberate len(items) ties - Sufficiency invariants: partitions cover [0, n) exactly once; KK beats naive chunking on skewed input. - test_gil_released_during_compute runs C++ KK in a background thread while the main thread runs a tight Python compute loop, asserts the main thread retains > 50 % of its GIL-free baseline throughput — proves the release actually fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JacoCheung · 2026-05-14T14:04:14Z

/build

JacoCheung · 2026-05-14T14:08:43Z

❌ Pipeline #51290628 -- failed

Job	Status	Log
pre_check	✅ success	view
train_build	❌ failed	view
inference_build	✅ success	view
tritonserver_build	❌ failed	view
build_whl	❌ failed	view
dynamicemb_test_fwd_bwd_8gpus	❌ failed	view
dynamicemb_test_load_dump_8gpus	❌ failed	view
unit_test_1gpu_a100	❌ failed	view
unit_test_1gpu_h100	❌ failed	view
unit_test_4gpu	❌ failed	view
unit_test_tp_4gpu	❌ failed	view
L20_unit_test_1gpu	❌ failed	view
inference_unit_test_1gpu	❌ failed	view
inference_test_1gpu	❌ failed	view

Result: 2/14 jobs passed

View full pipeline

JacoCheung · 2026-05-14T15:41:55Z

/build

JacoCheung · 2026-05-14T15:46:25Z

❌ Pipeline #51299141 -- failed

Job	Status	Log
pre_check	✅ success	view
train_build	✅ success	view
inference_build	✅ success	view
tritonserver_build	✅ success	view
build_whl	✅ success	view
dynamicemb_test_fwd_bwd_8gpus	❌ failed	view
dynamicemb_test_load_dump_8gpus	❌ failed	view
unit_test_1gpu_a100	❌ failed	view
unit_test_1gpu_h100	✅ success	view
unit_test_4gpu	✅ success	view
unit_test_tp_4gpu	❌ failed	view
L20_unit_test_1gpu	✅ success	view
inference_unit_test_1gpu	❌ failed	view
inference_test_1gpu	✅ success	view

Result: 9/14 jobs passed

View full pipeline

greptile-apps Bot reviewed May 14, 2026

View reviewed changes

JacoCheung force-pushed the junzhang/cpp_kk_partitioner branch from 209709d to 26511b0 Compare May 14, 2026 03:41

greptile-apps Bot reviewed May 14, 2026

View reviewed changes

Comment thread examples/commons/perf_model/partitioner.py Outdated

JacoCheung force-pushed the junzhang/cpp_kk_partitioner branch from 26511b0 to 05a513d Compare May 14, 2026 03:49

JacoCheung force-pushed the junzhang/cpp_kk_partitioner branch from 05a513d to 0add388 Compare May 14, 2026 08:37

JacoCheung requested a review from shijieliu May 14, 2026 14:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(commons): C++ Karmarkar-Karp partitioner releases GIL#390

perf(commons): C++ Karmarkar-Karp partitioner releases GIL#390
JacoCheung wants to merge 1 commit into
NVIDIA:mainfrom
JacoCheung:junzhang/cpp_kk_partitioner

JacoCheung commented May 14, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

greptile-apps Bot May 14, 2026

Uh oh!

Uh oh!

JacoCheung commented May 14, 2026

Uh oh!

JacoCheung commented May 14, 2026 •

edited

Loading

Uh oh!

JacoCheung commented May 14, 2026

Uh oh!

JacoCheung commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JacoCheung commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this matters

Implementation

Verification

Test plan

Uh oh!

greptile-apps Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

greptile-apps Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JacoCheung commented May 14, 2026

Uh oh!

JacoCheung commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JacoCheung commented May 14, 2026

Uh oh!

JacoCheung commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JacoCheung commented May 14, 2026 •

edited

Loading

greptile-apps Bot commented May 14, 2026 •

edited

Loading

JacoCheung commented May 14, 2026 •

edited

Loading

JacoCheung commented May 14, 2026 •

edited

Loading