bitsandbytes-foundation
diff --git a/‎KBIT_PROGRESS.md‎
Lines changed: 94 additions & 0 deletions b/‎KBIT_PROGRESS.md‎
Lines changed: 94 additions & 0 deletions
diff --git a/‎csrc/kernels.cu‎
Lines changed: 294 additions & 0 deletions b/‎csrc/kernels.cu‎
Lines changed: 294 additions & 0 deletions
@@ -0,0 +1,94 @@
+# K-Bit Quantization Implementation Progress
+
+**Branch**: `feature/kbit-quantization` (worktree at `~/git/bitsandbytes-kbit`)
+**Spec files**: `cuda-spec.md`, `cuda-spec-additions.md` (in main repo, gitignored)
+
+## Completed
+
+### Stage 0: Pure Python Reference -- DONE
+- File: `tests/test_kbit_quantization.py`
+- Functions: `create_normal_float_codebook()`, `quantize_kbit_ref()`, `dequantize_kbit_ref()`, `pack_kbit_ref()`, `unpack_kbit_ref()`
+- 57 tests pass (codebook generation, round-trip, MSE ordering, error bounds, pack/unpack)
+- Serves as permanent ground truth for all CUDA validation
+
+### Stages 1-5: CUDA Kernels -- CODE WRITTEN, BUILD ISSUE
+
+All CUDA kernel code is written and compiles, but there's a **device linker issue** preventing the kernels from appearing in the final `.so`.
+
+#### Files modified:
+
+1. **`csrc/kernels.cu`** (appended at end, ~200 lines):
+   - `warp_reduce_absmax()` -- device helper for warp-level max reduction
+   - `pack_kbit_warp<K>()` -- device helper, __ballot_sync bit-plane packing
+   - `unpack_kbit_warp<K>()` -- device helper, bit extraction unpacking
+   - `kTestPackUnpack_kbit<K>` -- Stage 1 test kernel (in-warp round-trip)
+   - `kTestPackWrite_kbit<K>` -- Stage 2 test kernel (pack to global memory)
+   - `kTestReadUnpack_kbit<K>` -- Stage 2 test kernel (read from global memory)
+   - `kTestCodebookLookup_kbit<K>` -- Stage 3 test kernel (shfl_sync codebook)
+   - `kQuantizeBlockwise_kbit<T, K>` -- Stage 4 production quantize kernel
+   - `kDequantizeBlockwise_kbit<T, K>` -- Stage 5 production dequantize kernel
+   - Template instantiation macros for K=2,3,4,5 x T=half,bf16,float
+
+2. **`csrc/kernels.cuh`** (appended before `#endif`):
+   - Forward declarations of all kernel templates
+
+3. **`csrc/ops.cu`** (appended at end, ~100 lines):
+   - Launch wrappers: `test_pack_unpack_kbit<K>()`, `test_pack_write_kbit<K>()`, etc.
+   - Launch wrappers: `quantizeBlockwise_kbit<T,K>()`, `dequantizeBlockwise_kbit<T,K>()`
+   - Grid calculation: `ceil(n/32)/8` CUDA blocks, 256 threads per block
+   - Template instantiation macros
+
+4. **`csrc/pythonInterface.cpp`** (two sections added):
+   - Unmangled wrappers (inside `#if BUILD_CUDA || BUILD_HIP`): `test_pack_unpack_k{K}()`, `quantize_kbit_{fp16,bf16,fp32}_k{K}()`, etc.
+   - extern "C" wrappers: `ctest_pack_unpack_k{K}()`, `cquantize_kbit_{tname}_k{K}()`, `cdequantize_kbit_{tname}_k{K}()`, etc.
+
+5. **`tests/test_kbit_quantization.py`** (comprehensive test file):
+   - Python reference tests (Stage 0): `TestCodebook`, `TestQuantizeRef`, `TestPackUnpackRef`
+   - CUDA ctypes wrappers: `_cuda_test_pack_unpack()`, `_cuda_quantize_kbit()`, `_cuda_dequantize_kbit()`, etc.
+   - CUDA tests (Stages 1-5): `TestStage1PackUnpackCUDA`, `TestStage2PackMemoryCUDA`, `TestStage3CodebookLookupCUDA`, `TestStage4QuantizeCUDA`, `TestStage5DequantizeCUDA`
+
+## Current Blocker: RDC Device Linking
+
+### Problem
+The compiled kernels exist in the `.o` object files (verified via `nm`), and the C-level symbols are exported in the final `.so` (verified via `nm -D`), but the **CUDA device code** (fatbinary) does not contain the new kernel functions. Running any kernel gives "invalid device function".
+
+### Root Cause
+The project uses `-rdc=true` (relocatable device code) for separate compilation. The device link step (`cmake_device_link.o`) needs to resolve all device-side references. The template instantiations in `kernels.cu` produce weak symbols in the object file, but the device linker may not be pulling them in because they're not referenced from the device link compilation unit.
+
+### How to Fix (options)
+
+1. **Add `__global__` function declarations to the device link file**: Check how CMake generates the device link step and ensure it sees all `.cu` object files.
+
+2. **Use `--relocatable-device-code=false` for the kbit kernels**: If the kbit kernels don't need cross-file device calls, they could be compiled without RDC. But this requires CMake changes.
+
+3. **Move kernel definitions to the same file as the launch wrappers**: Instead of splitting between `kernels.cu` (kernel definitions) and `ops.cu` (launch wrappers), put everything in a single `.cu` file. This is the simplest fix -- add the kernel bodies directly to `ops.cu` or create a new `kbit_kernels.cu` that contains both kernels and launch wrappers.
+
+4. **Check CMakeLists.txt for device link configuration**: The CMake `CUDA_SEPARABLE_COMPILATION` property or `CUDA_RESOLVE_DEVICE_SYMBOLS` might need adjustment.
+
+**Recommended fix**: Option 3 -- move all kbit kernel code from `kernels.cu` into `ops.cu` (or a new self-contained file). This sidesteps the RDC linking issue entirely since the kernel and its launch site would be in the same compilation unit.
+
+## Build Instructions
+
+```bash
+cd ~/git/bitsandbytes-kbit
+cmake -DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY="89;90" -S . -B build
+make -C build -j$(nproc)
+ln -sf libbitsandbytes_cuda124.so bitsandbytes/libbitsandbytes_cuda128.so
+```
+
+## Test Instructions
+
+```bash
+# Python-only tests (all pass)
+python -m pytest tests/test_kbit_quantization.py -k "not CUDA" -v
+
+# CUDA tests (currently fail due to device link issue)
+python -m pytest tests/test_kbit_quantization.py -k "CUDA" -v
+```
+
+## Not Yet Implemented
+
+- Stages 6-8: Error analysis, NF4 cross-validation, performance benchmarking (test code not written)
+- Python API in `bitsandbytes/functional.py` (quantize_kbit, dequantize_kbit)
+- `torch.library` registration in `bitsandbytes/_ops.py`
+- Codebook caching/registration system
@@ -2601,3 +2601,297 @@ MAKE_OptimizerStatic8bit1StateBlockwise(LION, __nv_bfloat16, 256, 1)
 MAKE_OptimizerStatic8bit1StateBlockwise(ADAGRAD, float, 256, 1)
 MAKE_OptimizerStatic8bit1StateBlockwise(ADAGRAD, half, 256, 1)
 MAKE_OptimizerStatic8bit1StateBlockwise(ADAGRAD, __nv_bfloat16, 256, 1)
+
+// ===========================================================================
+// K-bit blockwise quantization/dequantization kernels (blocksize=32, K=2..5)
+//
+// Uses bit-plane packing via __ballot_sync and codebook lookup via __shfl_sync.
+// One warp (32 threads) per quantization block. 8 warps per CUDA block.
+// ===========================================================================
+
+// ---- Device helpers ----
+
+// Warp-level max reduction (32 threads). Returns the max broadcast to all lanes.
+__device__ __forceinline__ float warp_reduce_absmax(float val) {
+    #pragma unroll
+    for (int offset = 16; offset > 0; offset >>= 1)
+        val = fmaxf(val, __shfl_down_sync(0xFFFFFFFF, val, offset));
+    return __shfl_sync(0xFFFFFFFF, val, 0);
+}
+
+// Pack one K-bit value per lane into K bit-plane uint32 words via __ballot_sync.
+// packed_words[0..K-1] are written with the bit-plane representation.
+// All lanes in the warp must call this simultaneously.
+template <int K>
+__device__ __forceinline__ void pack_kbit_warp(unsigned char qval, unsigned int* packed_words) {
+    #pragma unroll
+    for (int bit = 0; bit < K; bit++)
+        packed_words[bit] = __ballot_sync(0xFFFFFFFF, (qval >> bit) & 1);
+}
+
+// Unpack one K-bit value for this lane from K bit-plane uint32 words.
+template <int K>
+__device__ __forceinline__ unsigned char unpack_kbit_warp(const unsigned int* packed_words, int lane_id) {
+    unsigned char val = 0;
+    #pragma unroll
+    for (int bit = 0; bit < K; bit++)
+        val |= ((packed_words[bit] >> lane_id) & 1) << bit;
+    return val;
+}
+
+// ---- Stage 1: Pack/unpack round-trip test kernel ----
+// Input: uint8 indices[n], Output: uint8 recovered[n]
+template <int K>
+__global__ void kTestPackUnpack_kbit(
+    const unsigned char* __restrict__ indices,
+    unsigned char* __restrict__ recovered,
+    const int n
+) {
+    const int warp_id = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
+    const int lane_id = threadIdx.x % 32;
+    const int block_start = warp_id * 32;
+
+    if (block_start >= n) return;
+
+    // Load index (with bounds guard for partial last block)
+    unsigned char qval = 0;
+    if (block_start + lane_id < n)
+        qval = indices[block_start + lane_id];
+
+    // Pack into bit planes
+    unsigned int packed[K];
+    pack_kbit_warp<K>(qval, packed);
+
+    // Unpack
+    unsigned char recovered_val = unpack_kbit_warp<K>(packed, lane_id);
+
+    // Store
+    if (block_start + lane_id < n)
+        recovered[block_start + lane_id] = recovered_val;
+}
+
+// ---- Stage 2: Pack-write and read-unpack test kernels ----
+
+// Pack indices and write bit-plane words to global memory
+template <int K>
+__global__ void kTestPackWrite_kbit(
+    const unsigned char* __restrict__ indices,
+    unsigned int* __restrict__ packed_out,
+    const int n
+) {
+    const int warp_id = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
+    const int lane_id = threadIdx.x % 32;
+    const int block_start = warp_id * 32;
+
+    if (block_start >= n) return;
+
+    unsigned char qval = 0;
+    if (block_start + lane_id < n)
+        qval = indices[block_start + lane_id];
+
+    unsigned int packed[K];
+    pack_kbit_warp<K>(qval, packed);
+
+    // Lanes 0..K-1 each write one word
+    if (lane_id < K)
+        packed_out[warp_id * K + lane_id] = packed[lane_id];
+}
+
+// Read bit-plane words from global memory and unpack to indices
+template <int K>
+__global__ void kTestReadUnpack_kbit(
+    const unsigned int* __restrict__ packed_in,
+    unsigned char* __restrict__ indices_out,
+    const int n
+) {
+    const int warp_id = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
+    const int lane_id = threadIdx.x % 32;
+    const int block_start = warp_id * 32;
+
+    if (block_start >= n) return;
+
+    // Load K words, broadcast to all lanes
+    unsigned int packed[K];
+    #pragma unroll
+    for (int bit = 0; bit < K; bit++) {
+        unsigned int word = 0;
+        if (lane_id == bit)
+            word = packed_in[warp_id * K + bit];
+        packed[bit] = __shfl_sync(0xFFFFFFFF, word, bit);
+    }
+
+    unsigned char val = unpack_kbit_warp<K>(packed, lane_id);
+
+    if (block_start + lane_id < n)
+        indices_out[block_start + lane_id] = val;
+}
+
+// ---- Stage 3: Codebook shuffle lookup test kernel ----
+
+template <int K>
+__global__ void kTestCodebookLookup_kbit(
+    const unsigned char* __restrict__ indices,
+    const float* __restrict__ codebook,
+    float* __restrict__ out,
+    const int n
+) {
+    const int warp_id = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
+    const int lane_id = threadIdx.x % 32;
+    const int block_start = warp_id * 32;
+
+    if (block_start >= n) return;
+
+    // Load codebook into warp lanes
+    float cb = (lane_id < (1 << K)) ? codebook[lane_id] : 0.0f;
+
+    // Load index
+    unsigned char idx = 0;
+    if (block_start + lane_id < n)
+        idx = indices[block_start + lane_id];
+
+    // Shuffle lookup
+    float val = __shfl_sync(0xFFFFFFFF, cb, idx);
+
+    if (block_start + lane_id < n)
+        out[block_start + lane_id] = val;
+}
+
+// ---- Stage 4: Full quantize kernel ----
+
+template <typename T, int K>
+__global__ void kQuantizeBlockwise_kbit(
+    const float* __restrict__ codebook,
+    const T* __restrict__ A,
+    float* __restrict__ absmax,
+    unsigned int* __restrict__ packed_out,
+    const int n
+) {
+    const int warp_id = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
+    const int lane_id = threadIdx.x % 32;
+    const int block_start = warp_id * 32;
+
+    if (block_start >= n) return;
+
+    // 1. Load input value
+    float val = 0.0f;
+    if (block_start + lane_id < n)
+        val = (float)A[block_start + lane_id];
+
+    // 2. Warp-level absmax reduction
+    float amax = warp_reduce_absmax(fabsf(val));
+    float amax_safe = fmaxf(amax, 1e-8f);
+
+    // 3. Lane 0 stores absmax
+    if (lane_id == 0)
+        absmax[warp_id] = amax;
+
+    // 4. Normalize to [-1, 1]
+    float normalized = val / amax_safe;
+
+    // 5. Load codebook into warp lanes
+    float cb = (lane_id < (1 << K)) ? codebook[lane_id] : 0.0f;
+
+    // 6. Branchless nearest-codebook search
+    unsigned char best_idx = 0;
+    float best_dist = 1e10f;
+    #pragma unroll
+    for (int i = 0; i < (1 << K); i++) {
+        float cb_val = __shfl_sync(0xFFFFFFFF, cb, i);
+        float dist = fabsf(normalized - cb_val);
+        bool closer = (dist < best_dist);
+        best_dist = closer ? dist : best_dist;
+        best_idx = closer ? (unsigned char)i : best_idx;
+    }
+
+    // 7. Pack into bit planes
+    unsigned int packed[K];
+    pack_kbit_warp<K>(best_idx, packed);
+
+    // 8. Write K packed words
+    if (lane_id < K)
+        packed_out[warp_id * K + lane_id] = packed[lane_id];
+}
+
+// ---- Stage 5: Full dequantize kernel ----
+
+template <typename T, int K>
+__global__ void kDequantizeBlockwise_kbit(
+    const unsigned int* __restrict__ packed_in,
+    const float* __restrict__ codebook,
+    const float* __restrict__ absmax,
+    T* __restrict__ out,
+    const int n
+) {
+    const int warp_id = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
+    const int lane_id = threadIdx.x % 32;
+    const int block_start = warp_id * 32;
+
+    if (block_start >= n) return;
+
+    // 1. Load codebook into warp lanes
+    float cb = (lane_id < (1 << K)) ? codebook[lane_id] : 0.0f;
+
+    // 2. Load absmax for this block
+    float amax = absmax[warp_id];
+
+    // 3. Load K packed words, broadcast to all lanes
+    unsigned int packed[K];
+    #pragma unroll
+    for (int bit = 0; bit < K; bit++) {
+        unsigned int word = 0;
+        if (lane_id == bit)
+            word = packed_in[warp_id * K + bit];
+        packed[bit] = __shfl_sync(0xFFFFFFFF, word, bit);
+    }
+
+    // 4. Unpack this thread's K-bit index
+    unsigned char idx = unpack_kbit_warp<K>(packed, lane_id);
+
+    // 5. Codebook lookup via shuffle
+    float val = __shfl_sync(0xFFFFFFFF, cb, idx);
+
+    // 6. Scale by absmax
+    val *= amax;
+
+    // 7. Store
+    if (block_start + lane_id < n)
+        out[block_start + lane_id] = (T)val;
+}
+
+// ---- Template instantiations ----
+
+// Test kernels (Stage 1-3)
+#define INSTANTIATE_TEST_KBIT(K) \
+    template __global__ void kTestPackUnpack_kbit<K>( \
+        const unsigned char*, unsigned char*, const int); \
+    template __global__ void kTestPackWrite_kbit<K>( \
+        const unsigned char*, unsigned int*, const int); \
+    template __global__ void kTestReadUnpack_kbit<K>( \
+        const unsigned int*, unsigned char*, const int); \
+    template __global__ void kTestCodebookLookup_kbit<K>( \
+        const unsigned char*, const float*, float*, const int);
+
+INSTANTIATE_TEST_KBIT(2)
+INSTANTIATE_TEST_KBIT(3)
+INSTANTIATE_TEST_KBIT(4)
+INSTANTIATE_TEST_KBIT(5)
+
+// Production kernels (Stage 4-5)
+#define INSTANTIATE_KBIT_QUANT(T, K) \
+    template __global__ void kQuantizeBlockwise_kbit<T, K>( \
+        const float*, const T*, float*, unsigned int*, const int); \
+    template __global__ void kDequantizeBlockwise_kbit<T, K>( \
+        const unsigned int*, const float*, const float*, T*, const int);
+
+INSTANTIATE_KBIT_QUANT(half, 2)
+INSTANTIATE_KBIT_QUANT(half, 3)
+INSTANTIATE_KBIT_QUANT(half, 4)
+INSTANTIATE_KBIT_QUANT(half, 5)
+INSTANTIATE_KBIT_QUANT(__nv_bfloat16, 2)
+INSTANTIATE_KBIT_QUANT(__nv_bfloat16, 3)
+INSTANTIATE_KBIT_QUANT(__nv_bfloat16, 4)
+INSTANTIATE_KBIT_QUANT(__nv_bfloat16, 5)
+INSTANTIATE_KBIT_QUANT(float, 2)
+INSTANTIATE_KBIT_QUANT(float, 3)
+INSTANTIATE_KBIT_QUANT(float, 4)
+INSTANTIATE_KBIT_QUANT(float, 5)