perf(commons): C++ Karmarkar-Karp partitioner releases GIL#390
perf(commons): C++ Karmarkar-Karp partitioner releases GIL#390JacoCheung wants to merge 1 commit into
Conversation
Greptile SummaryThis PR ports the Karmarkar-Karp k-way partitioner from pure Python (with GIL-holding
Confidence Score: 5/5Safe to merge — the C++ port faithfully replicates Python's tie-breaking, the GIL release is correctly scoped with RAII, and the Python fallback remains intact for environments without the extension. The tie-breaking analysis (sum to size to lexicographic on items in Set; spread to sets[0] in State) matches between Python and C++ exactly, as confirmed by the 26-case parity test suite. The dispatcher, fallback chain, and build system changes are all additive and well-guarded. Only minor const-correctness and validation-placement style points were found. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant Wrapper as "karmarkar_karp (Python)"
participant CPP as "kk_cpu_ops (C++)"
participant PyKK as "_karmarkar_karp_python"
Caller->>Wrapper: workloads, k, equal_size
Wrapper->>Wrapper: nvtx.range_push / tolist()
alt C++ extension available
Wrapper->>CPP: karmarkar_karp(list, k, equal_size)
Note over CPP: py::gil_scoped_release sort + heap loop (GIL free)
CPP-->>Wrapper: partitions
else fallback
Wrapper->>PyKK: workloads, k, equal_size
Note over PyKK: heapq + Python callbacks (GIL held)
PyKK-->>Wrapper: partitions
end
Wrapper->>Wrapper: equal_size assert / nvtx.range_pop
Wrapper-->>Caller: partitions
Reviews (4): Last reviewed commit: "perf(commons): C++ Karmarkar-Karp partit..." | Re-trigger Greptile |
| if _matches: | ||
| _spec = _importlib_util.spec_from_file_location("kk_cpu_ops", _matches[0]) | ||
| if _spec is not None and _spec.loader is not None: | ||
| _kk_cpu_ops = _importlib_util.module_from_spec(_spec) | ||
| _spec.loader.exec_module(_kk_cpu_ops) |
There was a problem hiding this comment.
Inplace
.so loaded without sys.modules registration
When the fallback exec_module path fires, _kk_cpu_ops holds the loaded extension but it is never inserted into sys.modules["kk_cpu_ops"]. If any code later does a plain import kk_cpu_ops it will raise ImportError (the module exists in memory but is unreachable via the import system), and a importlib.reload(partitioner) would execute the .so a second time. Adding sys.modules.setdefault("kk_cpu_ops", _kk_cpu_ops) after exec_module would make the loaded module discoverable and prevent double-loading.
209709d to
26511b0
Compare
26511b0 to
05a513d
Compare
KK runs on every load-balanced batch shuffle. The pure-Python
implementation uses heapq with Python __lt__ callbacks, which keep
the GIL the whole time KK runs in the background ThreadPoolExecutor.
nsys shows a visible "Waiting for GIL" gap on the main thread
(~1.5 ms) during the karmarkar_karp NVTX range — the main thread
cannot keep submitting CUDA kernels while KK contends the GIL.
This commit ports KK to a pybind11 C++ extension (kk_cpu_ops) with
`py::gil_scoped_release` around the compute. Output is bit-for-bit
identical to the Python implementation (same Set.__lt__ /
State.__lt__ tie-breaking). partitioner.py resolution order:
1. honour KK_FORCE_PYTHON=1 (escape hatch / parity tests)
2. top-level import kk_cpu_ops (matches `python setup.py install`
layout — picked up automatically when the Docker image is
rebuilt via examples/commons/setup.py)
3. sibling .so next to the perf_model package (matches
`python setup.py build_ext --inplace` during dev iteration)
4. fall back to the pure-Python implementation otherwise
`BUILD_EXT_ONLY=kk_cpu_ops` env knob in setup.py rebuilds just the
CPU partitioner without spending minutes on the CUDA extensions.
Measurements:
Isolated micro-benchmark (n=k*4096=32768, k=8):
Python median 237.6 ms → C++ median 13.5 ms (17.6x)
nsys NVTX karmarkar_karp duration on the kk worker thread:
Python ms-scale → C++ avg 135 us / max 270 us (50x shorter)
Main-thread "Waiting for GIL" gap during the karmarkar_karp NVTX
range (per the original nsys screenshot, ~1.36 ms): gone under C++.
Tests (26 cases in examples/tests/commons/test_kk_partitioner.py):
- 16 parity cases across k = 2/4/8/16 x n/k = 1/4/32/128
- 3 unequal-size parity cases
- Adversarial: all-equal workloads, deliberate len(items) ties
- Sufficiency invariants: partitions cover [0, n) exactly once;
KK beats naive chunking on skewed input.
- test_gil_released_during_compute runs C++ KK in a background
thread while the main thread runs a tight Python compute loop,
asserts the main thread retains > 50 % of its GIL-free baseline
throughput — proves the release actually fires.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
05a513d to
0add388
Compare
|
/build |
|
❌ Pipeline #51290628 -- failed
Result: 2/14 jobs passed |
|
/build |
|
❌ Pipeline #51299141 -- failed
Result: 9/14 jobs passed |
Summary
KK runs on every load-balanced batch shuffle. The pure-Python
implementation in examples/commons/perf_model/partitioner.py
uses
heapqwith Python__lt__callbacks, which keep the GIL thewhole time KK runs in the background
ThreadPoolExecutor. nsys shows avisible "Waiting for GIL" gap on the main thread (~1.5 ms) during the
karmarkar_karpNVTX range — the main thread cannot keep submittingCUDA kernels while KK contends the GIL.
This PR ports KK to a pybind11 C++ extension (
kk_cpu_ops) withpy::gil_scoped_releasearound the compute. Output is bit-for-bitidentical to the Python implementation (same
Set.__lt__/State.__lt__tie-breaking). The Python implementation is kept as afallback when the extension is not built, gated by
KK_FORCE_PYTHON=1for parity testing.
Why this matters
GIL bubble eliminated. nsys
karmarkar_karpNVTX duration onthe kk worker thread drops from Python's ms-scale (~1.5 ms, with
~1.36 ms of "Waiting for GIL" on the main thread) to
avg 135 μs / max 270 μs under C++ — 50× shorter, and the
main-thread GIL-wait gap is gone.
Drop-in for production training.
commons.perf_model.partitioner.karmarkar_karpsignature is unchanged; the resolution order is:
KK_FORCE_PYTHON=1(escape hatch / parity tests)import kk_cpu_ops(matches thepython setup.py installlayout picked up automatically whenthe Docker image is rebuilt via
examples/commons/setup.py).sonext to theperf_modelpackage (matchespython setup.py build_ext --inplacefor dev iteration)Isolated micro-benchmark at n=k*4096=32768, k=8: Python median
237.6 ms → C++ median 13.5 ms (17.6× speedup).
Implementation
examples/commons/perf_model/csrc/kk_partition.cppFaithful port of the Karp-Karmarkar heap with the same tie-breaking
rules.
py::gil_scoped_release releasecovers the whole compute(sort +
std::make_heap/std::pop_heap/std::push_heaploop, no Python callbacks anywhere).
examples/commons/perf_model/partitioner.pyAdds a thin wrapper that probes for
kk_cpu_opsand dispatches toC++ when available, otherwise to the renamed
_karmarkar_karp_pythonhelper. No call-site changes.
examples/commons/setup.pyAdds the
kk_cpu_opsCppExtensionalongside the existing CUDAextensions. Introduces
BUILD_EXT_ONLY=kk_cpu_opsenv knob torebuild just the CPU partitioner without spending minutes on the
CUDA extensions.
examples/tests/commons/test_kk_partitioner.pyNew parity + sufficiency test file:
len(items)tiesKK beats naive chunking on skewed input
test_gil_released_during_compute: runs C++ KK in abackground thread while the main thread runs a tight Python
compute loop, asserts main-thread retains > 50% of its
GIL-free baseline throughput — proves the release actually
fires.
Verification
pytest tests/commons/test_kk_partitioner.py -q→ all passkarmarkar_karpKK_FORCE_PYTHON=1bypasses C++; tests still pass--nsys--nsysTest plan
--nsysshows GIL bubble eliminated and ~3.86 ms/iter step-time saving--nsysshows no regression/buildto confirm new container buildskk_cpu_opscleanly🤖 Generated with Claude Code