bitsandbytes-foundation
diff --git a/‎CLAUDE.md‎
Lines changed: 30 additions & 1 deletion b/‎CLAUDE.md‎
Lines changed: 30 additions & 1 deletion
diff --git a/‎CMakeLists.txt‎
Lines changed: 93 additions & 3 deletions b/‎CMakeLists.txt‎
Lines changed: 93 additions & 3 deletions
diff --git a/‎PROGRESS.md‎
Lines changed: 210 additions & 0 deletions b/‎PROGRESS.md‎
Lines changed: 210 additions & 0 deletions
@@ -40,7 +40,36 @@ Do NOT run the full test suite — it takes 10+ minutes. Instead, run only the t
 pytest tests/test_relevant_file.py -v --tb=short -k "relevant_test_name"
 ```
 
-The full suite will be run separately. Best practices and known issues: `agents/testing_guide.md`
+The full suite will be run separately. Best practices, benchmark data, and known architecture-specific issues: `agents/testing_guide.md`
+
+# Benchmarking
+
+Benchmark scripts live in `benchmarks/`. The two kbit-specific ones:
+
+- `bench_hadamard.py` — Hadamard rotation kernel + M=1 pipeline (rotation + scalar GEMV) vs cuBLAS FP16. Quick focused benchmark for the decode path.
+- `bench_kbit_vlm.py` — Comprehensive sweep across all VLM-relevant M values (1 to 1024), all kernel variants (scalar GEMV, MMA, dequant+cuBLAS), all k values (2-5), with and without Hadamard rotation. GLM-4.7 shapes (see `spec.md` § Target Model for layer dimensions).
+
+```bash
+# Quick M=1 decode benchmark
+python benchmarks/bench_hadamard.py
+
+# Full VLM sweep (all M, all k)
+python benchmarks/bench_kbit_vlm.py
+
+# Single k value, subset of M
+python benchmarks/bench_kbit_vlm.py --k 4 --m 1,4,16,256,1024
+
+# Higher accuracy (more iterations)
+python benchmarks/bench_kbit_vlm.py --inner 1000 --outer 30
+```
+
+## CUDA graph benchmarking methodology
+
+Single graph replay has a ~14 us timing floor (on RTX 4090) that masks sub-14 us kernel differences. The benchmarks use **batched graph replay**: replay the graph N times within one event-timed region, then divide. This amortizes the per-replay overhead to ~14/N us per iteration.
+
+The `--inner` flag controls N (replays per measurement). Default 500 gives ~0.03 us amortized overhead. Use `--inner 1000` for the highest accuracy when comparing kernels that differ by < 1 us.
+
+`--outer` controls the number of measurements (default 15). The median is reported to reject outliers.
 
 # Agent Dispatch (the "Dispatcher" role)
 
 
@@ -248,11 +248,9 @@ if(BUILD_CUDA)
         if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.8" AND EXISTS "${CMAKE_SOURCE_DIR}/third_party/cutlass/include")
             list(APPEND _NVFP4_SM120_SOURCES
                 csrc/qutlass/gemm_nvfp4_sm120.cu
-                csrc/qutlass/scale_reorder.cu
-                csrc/qutlass/fused_quantize_nv.cu
             )
             set(_HAS_CUTLASS_NVFP4 TRUE)
-            message(STATUS "CUTLASS NVFP4 SM_120a GEMM + fused quantize enabled")
+            message(STATUS "CUTLASS NVFP4 SM_120a GEMM enabled")
         else()
             set(_HAS_CUTLASS_NVFP4 FALSE)
             message(STATUS "CUTLASS NVFP4 GEMM disabled (needs CUDA >= 12.8 and third_party/cutlass)")
@@ -286,6 +284,88 @@ if(BUILD_CUDA)
         message(STATUS "NVFP4 SM_120a GEMM kernel enabled")
     endif()
 
+    # SM_100a NVFP4 GEMM kernel: requires compute_100a for block-scaled MMA
+    # Only include if 100 or 101 is in the target architectures
+    set(_HAS_SM100 FALSE)
+    foreach(_cap IN LISTS COMPUTE_CAPABILITY)
+        if(_cap MATCHES "^10[01]$")
+            set(_HAS_SM100 TRUE)
+        endif()
+    endforeach()
+    if(_LATEST_CAPABILITY MATCHES "^10[01]$")
+        set(_HAS_SM100 TRUE)
+    endif()
+    if(_HAS_SM100)
+        # CUTLASS-based NVFP4 GEMM for SM_100 (requires CUDA 12.8+)
+        if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.8" AND EXISTS "${CMAKE_SOURCE_DIR}/third_party/cutlass/include")
+            set(_NVFP4_SM100_SOURCES
+                csrc/qutlass/gemm_nvfp4_sm100.cu
+                csrc/qutlass/gemm_nvfp4_moe_sm100.cu
+            )
+
+            add_library(nvfp4_sm100a OBJECT ${_NVFP4_SM100_SOURCES})
+            set_target_properties(nvfp4_sm100a PROPERTIES
+                CUDA_ARCHITECTURES "100a"
+                POSITION_INDEPENDENT_CODE ON
+                CUDA_SEPARABLE_COMPILATION OFF
+            )
+            target_compile_options(nvfp4_sm100a PRIVATE
+                $<$<COMPILE_LANGUAGE:CUDA>:--use_fast_math>
+                $<$<COMPILE_LANGUAGE:CUDA>:--expt-relaxed-constexpr>
+                $<$<COMPILE_LANGUAGE:CUDA>:-std=c++17>
+                $<$<COMPILE_LANGUAGE:CUDA>:-O3>
+                $<$<COMPILE_LANGUAGE:CUDA>:-DNDEBUG>
+                $<$<COMPILE_LANGUAGE:CUDA>:-DQUTLASS_DISABLE_PYBIND>
+            )
+            target_include_directories(nvfp4_sm100a PRIVATE
+                "${CMAKE_SOURCE_DIR}/third_party/cutlass/include"
+                "${CMAKE_SOURCE_DIR}/third_party/cutlass/tools/util/include"
+                "${CMAKE_SOURCE_DIR}/csrc/qutlass/include"
+            )
+            message(STATUS "CUTLASS NVFP4 SM_100a GEMM enabled")
+        else()
+            message(STATUS "CUTLASS NVFP4 SM_100a GEMM disabled (needs CUDA >= 12.8 and third_party/cutlass)")
+        endif()
+    endif()
+
+    # Common CUTLASS utilities (scale_reorder, fused_quantize) compiled for
+    # ALL enabled Blackwell architectures. Kept in one library to avoid
+    # duplicate C symbol errors from the extern "C" wrappers.
+    if(_HAS_CUTLASS_NVFP4 OR _HAS_SM100)
+        set(_NVFP4_COMMON_ARCHS "")
+        if(_HAS_SM120 AND _HAS_CUTLASS_NVFP4)
+            list(APPEND _NVFP4_COMMON_ARCHS "120a")
+        endif()
+        if(_HAS_SM100)
+            list(APPEND _NVFP4_COMMON_ARCHS "100a")
+        endif()
+
+        add_library(nvfp4_common OBJECT
+            csrc/qutlass/scale_reorder.cu
+            csrc/qutlass/fused_quantize_nv.cu
+            csrc/qutlass/moe_scatter_gather.cu
+        )
+        set_target_properties(nvfp4_common PROPERTIES
+            CUDA_ARCHITECTURES "${_NVFP4_COMMON_ARCHS}"
+            POSITION_INDEPENDENT_CODE ON
+            CUDA_SEPARABLE_COMPILATION OFF
+        )
+        target_include_directories(nvfp4_common PRIVATE
+            "${CMAKE_SOURCE_DIR}/third_party/cutlass/include"
+            "${CMAKE_SOURCE_DIR}/third_party/cutlass/tools/util/include"
+            "${CMAKE_SOURCE_DIR}/csrc/qutlass/include"
+        )
+        target_compile_options(nvfp4_common PRIVATE
+            $<$<COMPILE_LANGUAGE:CUDA>:--use_fast_math>
+            $<$<COMPILE_LANGUAGE:CUDA>:--expt-relaxed-constexpr>
+            $<$<COMPILE_LANGUAGE:CUDA>:-std=c++17>
+            $<$<COMPILE_LANGUAGE:CUDA>:-O3>
+            $<$<COMPILE_LANGUAGE:CUDA>:-DNDEBUG>
+            $<$<COMPILE_LANGUAGE:CUDA>:-DQUTLASS_DISABLE_PYBIND>
+        )
+        message(STATUS "CUTLASS common utilities (scale_reorder, fused_quantize) for archs: ${_NVFP4_COMMON_ARCHS}")
+    endif()
+
     string(APPEND BNB_OUTPUT_NAME "_cuda${CUDA_VERSION_SHORT}")
     add_compile_definitions(BUILD_CUDA)
 elseif(BUILD_HIP)
@@ -379,6 +459,16 @@ if(TARGET nvfp4_sm120a)
     target_sources(bitsandbytes PRIVATE $<TARGET_OBJECTS:nvfp4_sm120a>)
 endif()
 
+# Link NVFP4 SM_100a object library if available
+if(TARGET nvfp4_sm100a)
+    target_sources(bitsandbytes PRIVATE $<TARGET_OBJECTS:nvfp4_sm100a>)
+endif()
+
+# Link common CUTLASS utilities (scale_reorder, fused_quantize) if available
+if(TARGET nvfp4_common)
+    target_sources(bitsandbytes PRIVATE $<TARGET_OBJECTS:nvfp4_common>)
+endif()
+
 if (BUILD_CPU)
     if (OpenMP_CXX_FOUND)
         target_link_libraries(bitsandbytes PRIVATE OpenMP::OpenMP_CXX)
 
@@ -0,0 +1,210 @@
+# Absmax format migration: float32 -> uint8 E4M4 (default) + float16 (option)
+
+Branch: `experiment/scalar-gemv-int8-absmax`
+Worktree: `/home/tim/git/bnb-kbit-gemm-int8-absmax`
+Base: `23f92e5` (feature/kbit-gemv-v8)
+
+## Motivation
+
+Benchmarking shows uint8 E4M4 absmax has identical performance to float32
+absmax in the scalar GEMV kernel, and adds at most ~4.5% to mean absolute
+error (at k=5; negligible at k=2-3) on top of the existing kbit quantization
+error. Switching to uint8 halves absmax storage (4 bytes -> 1 byte per quant
+block) and unifies the format across all kernels.
+
+## Current absmax formats (before this branch)
+
+| Kernel              | Absmax type   | Layout |
+|---------------------|---------------|--------|
+| MMA (dense)         | uint8 E4M4    | tiled  |
+| MMA (grouped/MoE)   | uint8 E4M4    | tiled  |
+| Scalar GEMV (dense) | **float32**   | flat   |
+| Scalar GEMV (grouped/MoE) | **float32** | flat |
+| Dequantize          | templated (both) | flat/tiled |
+
+**Target**: all kernels use uint8 E4M4 by default, with float16 as alternative.
+Remove float32 absmax path entirely.
+
+## Current status
+
+### Code changes DONE (uncommitted, in working tree):
+
+**CUDA kernels (`csrc/ops.cu`)**:
+- Moved E4M4 encode/decode functions before quantize kernel (eliminated forward declaration issue)
+- `kQuantizeBlockwise_kbit`: writes `unsigned char*` absmax via `encode_e4m4_absmax(amax)`
+- `kRepackKbit`: accepts `unsigned char*` absmax input, copies bytes directly (no re-encode)
+- `kbitScalarGemv` / `kbitGroupedScalarGemv`: `unsigned char*` absmax + `load_absmax()` decode
+- All launchers, entry points, and template instantiations updated
+
+**C++ interface (`csrc/pythonInterface.cpp`)**:
+- All forward declarations, wrappers, and extern C macros updated for `unsigned char*`
+- Added extern C wrappers for fp16abs scalar GEMV + grouped scalar GEMV (16 new symbols)
+
+**Python (`bitsandbytes/`)**:
+- `backends/cuda/ops.py`: quantize_kbit allocates uint8, repack_kbit expects uint8
+- `backends/cuda/ops.py`: scalar GEMV + grouped GEMV dispatch routes by absmax dtype (uint8 default, fp16 via `_fp16abs` suffix)
+- `_ops.py`: quantize_kbit fake op returns uint8
+- `functional.py`: removed redundant Python-side E4M4 encode (kernel does it natively)
+
+**Tests**:
+- `test_scalar_gemv.py`: added `decode_e4m4_absmax`, updated `dequant_reference`
+- `test_kbit_gemm.py`: `quantize_kbit_ref` returns uint8 E4M4, updated dequant/repack refs
+
+**Benchmarks**:
+- `ncu_driver.py`: updated comments, removed stale `.cuda()` call; all 4 kernel modes verified
+
+### Bug: illegal memory access at runtime — FIXED
+
+Root cause: stale build artifact. The previous session's `make` command
+didn't actually recompile `ops.cu` after source changes. The `.so` still
+had the old `float*` absmax signature while `pythonInterface.cpp` was
+passing `unsigned char*` via ctypes — causing out-of-bounds reads (the
+kernel read 4 bytes per absmax element instead of 1).
+
+Fix: clean rebuild (`rm -rf build && cmake -B build ... && make`).
+
+## Work items
+
+### 1. Scalar GEMV (dense) — float32 -> uint8 E4M4
+- [x] Baseline benchmark (current float32)
+- [x] Change kernel to use `unsigned char*` + `load_absmax<unsigned char>`
+- [x] Update pythonInterface.cpp, backends/cuda/ops.py
+- [x] **FIX BUG**: stale build — clean rebuild fixed it
+- [x] Post-change benchmark
+- [x] Record results below — **no regression**
+
+### 2. Grouped scalar GEMV (MoE) — float32 -> uint8 E4M4
+- [x] Baseline benchmark (current float32)
+- [x] Change kernel to use `unsigned char*` + `load_absmax<unsigned char>`
+- [x] Update pythonInterface.cpp, backends/cuda/ops.py
+- [x] **FIX BUG**: same stale build issue
+- [x] Post-change benchmark
+- [x] Record results below — **within noise for M=4, slight regression for M=1**
+
+### 3. quantize_kbit — return uint8 E4M4 by default
+- [x] Add E4M4 encode to quantize kernel (`encode_e4m4_absmax` in kQuantizeBlockwise_kbit)
+- [x] Update Python op return type (`_ops.py` allocates uint8, `backends/cuda/ops.py` allocates uint8)
+- [x] Remove Python-side double-encode in `functional.py::quantize_kbit` (kernel does it natively)
+- [x] Update repack_kbit: kernel accepts `unsigned char*` input, just copies bytes (no re-encode)
+- [x] Move E4M4 encode/decode definitions before quantize kernel (was forward-declared, caused issues)
+- [x] **BUG FIXED**: Previous session's forward declaration of `encode_e4m4_absmax` before `E4M4_BIAS`
+  was defined compiled but produced wrong results. Moved all E4M4 functions before quantize kernel.
+- [x] **BUG FIXED**: `functional.py::quantize_kbit` applied Python-side E4M4 encode on top of the
+  already-encoded kernel output (double encoding). Removed the redundant Python encode.
+
+### 4. Add float16 absmax alternative path — DONE
+- [x] Generic `load_absmax<ABSMAX_T>` already handles `half` (casts to float)
+- [x] Templated scalar GEMV + grouped scalar GEMV on `ABSMAX_T` (default = `unsigned char`)
+- [x] Added fp16 absmax template instantiations in ops.cu
+- [x] Added fp16abs C++ wrappers in pythonInterface.cpp (unmangled functions ready)
+- [x] Added extern C wrappers for fp16abs scalar GEMV + grouped scalar GEMV (in pythonInterface.cpp)
+- [x] Added Python dispatch: absmax dtype routing via `_fp16abs` suffix in `backends/cuda/ops.py`
+- [x] `_ops.py` — no changes needed, torch op defs use generic `Tensor` type
+- [x] Build compiles, all 31 scalar GEMV tests pass, all 195 GEMM tests pass
+- [x] Verified fp16abs path produces identical results to uint8 path (when E4M4→fp16 is lossless)
+
+### 5. Tests
+- [x] Updated test_scalar_gemv.py: added `decode_e4m4_absmax`, updated `dequant_reference`
+- [x] Updated test_kbit_gemm.py: `quantize_kbit_ref` now returns uint8 E4M4, updated dequant/repack refs
+- [x] All 31 test_scalar_gemv tests pass
+- [x] All 195 test_kbit_gemm tests pass
+- [ ] test_grouped_gemm.py has pre-existing failures (missing `max_M` arg, not related)
+
+### 6. Benchmark driver — DONE
+- [x] Updated ncu_driver.py: comment fix (uint8 absmax), removed stale `.cuda()` call
+- [x] All 4 kernel modes (mma, scalar, grouped, grouped_mma) verified working
+
+### 7. Update _ops.py
+- [x] No changes needed — torch op defs use generic `Tensor` type
+
+## Benchmark results
+
+### Scalar GEMV (dense)
+
+#### Baseline (float32 absmax)
+
+CUDA events, WARMUP=50, ITERS=200, fp16, RTX 4090
+
+| shape    |  k |  M |    us |
+|----------|----|----|-------|
+| gateup   |  3 |  1 |  87.5 |
+| gateup   |  3 |  4 | 163.5 |
+| gateup   |  4 |  1 | 117.1 |
+| gateup   |  4 |  4 | 172.7 |
+| down     |  3 |  1 |  80.4 |
+| down     |  3 |  4 | 165.5 |
+| down     |  4 |  1 | 118.9 |
+| down     |  4 |  4 | 186.3 |
+| Q        |  3 |  1 |  36.7 |
+| Q        |  3 |  4 |  64.2 |
+| Q        |  4 |  1 |  38.9 |
+| Q        |  4 |  4 |  65.7 |
+| KV       |  3 |  1 |  36.7 |
+| KV       |  3 |  4 |  35.9 |
+| KV       |  4 |  1 |  36.1 |
+| KV       |  4 |  4 |  36.5 |
+
+#### After change (uint8 E4M4 absmax)
+
+CUDA events, WARMUP=100, ITERS=500, fp16, RTX 4090
+Baseline and uint8 runs done with proper `pip install -e .` for each worktree.
+
+| shape    |  k |  M |  f32(us) |  u8(us) | delta |
+|----------|----|----|----------|---------|-------|
+| gateup   |  3 |  1 |     81.6 |    83.5 |  +2.3% |
+| gateup   |  3 |  4 |    164.1 |   168.7 |  +2.8% |
+| gateup   |  4 |  1 |    104.5 |   101.2 |  -3.2% |
+| gateup   |  4 |  4 |    151.9 |   146.9 |  -3.3% |
+| down     |  3 |  1 |     69.2 |    74.2 |  +7.2% |
+| down     |  3 |  4 |    169.1 |   152.9 |  -9.6% |
+| down     |  4 |  1 |    120.6 |    85.6 | -29.0% |
+| down     |  4 |  4 |    185.4 |   176.6 |  -4.7% |
+| Q        |  3 |  1 |     38.5 |    39.1 |  +1.6% |
+| Q        |  3 |  4 |     60.7 |    72.1 | +18.8% |
+| Q        |  4 |  1 |     37.4 |    40.1 |  +7.2% |
+| Q        |  4 |  4 |     65.7 |    62.9 |  -4.3% |
+| KV       |  3 |  1 |     38.5 |    37.1 |  -3.6% |
+| KV       |  3 |  4 |     35.5 |    37.2 |  +4.8% |
+| KV       |  4 |  1 |     36.3 |    37.7 |  +3.9% |
+| KV       |  4 |  4 |     35.8 |    39.7 | +10.9% |
+
+**Summary**: High variance between runs (up to ~30% swing on some shapes).
+Overall no consistent pattern — performance is essentially equivalent.
+The variance dominates any signal from the absmax format change.
+
+### Grouped scalar GEMV (MoE)
+
+#### Baseline (float32 absmax)
+
+CUDA events, WARMUP=100, ITERS=500, fp16, 8 experts, RTX 4090
+
+| shape    |  k |  M |    us |
+|----------|----|----|-------|
+| moe_gu   |  3 |  1 |  47.8 |
+| moe_gu   |  3 |  4 | 101.8 |
+| moe_gu   |  4 |  1 |  58.3 |
+| moe_gu   |  4 |  4 | 103.6 |
+| moe_dn   |  3 |  1 |  47.2 |
+| moe_dn   |  3 |  4 |  92.7 |
+| moe_dn   |  4 |  1 |  55.0 |
+| moe_dn   |  4 |  4 |  94.2 |
+
+#### After change (uint8 E4M4 absmax)
+
+CUDA events, WARMUP=100, ITERS=500, fp16, 8 experts, RTX 4090
+
+| shape    |  k |  M |  f32(us) |  u8(us) | delta |
+|----------|----|----|----------|---------|-------|
+| moe_gu   |  3 |  1 |     47.8 |    58.3 | +22.0% |
+| moe_gu   |  3 |  4 |    101.8 |    98.8 |  -2.9% |
+| moe_gu   |  4 |  1 |     58.3 |    61.3 |  +5.1% |
+| moe_gu   |  4 |  4 |    103.6 |   102.0 |  -1.5% |
+| moe_dn   |  3 |  1 |     47.2 |    51.9 | +10.0% |
+| moe_dn   |  3 |  4 |     92.7 |    91.3 |  -1.5% |
+| moe_dn   |  4 |  1 |     55.0 |    57.6 |  +4.7% |
+| moe_dn   |  4 |  4 |     94.2 |    92.5 |  -1.8% |
+
+**Summary**: M=4 cases within noise (~+/-3%). M=1 cases show 5-22% regression,
+possibly from E4M4 decode overhead being a larger fraction of work with only
+1 row of FMA. But variance is high — the moe_gu k=3 M=1 outlier (+22%) is
+likely noise since other M=1 shapes show only +5%.