You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implements Stages 0-5 of the k-bit quantization plan from cuda-spec.md:
- Pure Python reference (quantize_kbit_ref, dequantize_kbit_ref) with 57 passing tests
- CUDA kernels using __ballot_sync bit-plane packing and __shfl_sync codebook lookup
- Test kernels (pack/unpack, memory format, codebook lookup) and production kernels
- All C interface symbols exported and loadable via ctypes
CUDA kernels compile but are not yet executable due to an RDC device
linking issue where template instantiations in kernels.cu are not
pulled into the final fatbinary. See KBIT_PROGRESS.md for diagnosis
and recommended fix (move kernel bodies into ops.cu or a new self-contained file).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- CUDA ctypes wrappers: `_cuda_test_pack_unpack()`, `_cuda_quantize_kbit()`, `_cuda_dequantize_kbit()`, etc.
48
+
- CUDA tests (Stages 1-5): `TestStage1PackUnpackCUDA`, `TestStage2PackMemoryCUDA`, `TestStage3CodebookLookupCUDA`, `TestStage4QuantizeCUDA`, `TestStage5DequantizeCUDA`
49
+
50
+
## Current Blocker: RDC Device Linking
51
+
52
+
### Problem
53
+
The compiled kernels exist in the `.o` object files (verified via `nm`), and the C-level symbols are exported in the final `.so` (verified via `nm -D`), but the **CUDA device code** (fatbinary) does not contain the new kernel functions. Running any kernel gives "invalid device function".
54
+
55
+
### Root Cause
56
+
The project uses `-rdc=true` (relocatable device code) for separate compilation. The device link step (`cmake_device_link.o`) needs to resolve all device-side references. The template instantiations in `kernels.cu` produce weak symbols in the object file, but the device linker may not be pulling them in because they're not referenced from the device link compilation unit.
57
+
58
+
### How to Fix (options)
59
+
60
+
1.**Add `__global__` function declarations to the device link file**: Check how CMake generates the device link step and ensure it sees all `.cu` object files.
61
+
62
+
2.**Use `--relocatable-device-code=false` for the kbit kernels**: If the kbit kernels don't need cross-file device calls, they could be compiled without RDC. But this requires CMake changes.
63
+
64
+
3.**Move kernel definitions to the same file as the launch wrappers**: Instead of splitting between `kernels.cu` (kernel definitions) and `ops.cu` (launch wrappers), put everything in a single `.cu` file. This is the simplest fix -- add the kernel bodies directly to `ops.cu` or create a new `kbit_kernels.cu` that contains both kernels and launch wrappers.
65
+
66
+
4.**Check CMakeLists.txt for device link configuration**: The CMake `CUDA_SEPARABLE_COMPILATION` property or `CUDA_RESOLVE_DEVICE_SYMBOLS` might need adjustment.
67
+
68
+
**Recommended fix**: Option 3 -- move all kbit kernel code from `kernels.cu` into `ops.cu` (or a new self-contained file). This sidesteps the RDC linking issue entirely since the kernel and its launch site would be in the same compilation unit.
0 commit comments