Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ jobs:
cuda: ${{ matrix.cuda_version }}
method: "network"
# The "crt" "nvvm" and "nvptxcompiler" components are added for CUDA 13.
sub-packages: ${{ format('["nvcc"{0},"cudart","cusparse","cublas","thrust","cublas_dev","cusparse_dev"]', startsWith(matrix.cuda_version, '13.') && ',"crt","nvvm","nvptxcompiler"' || '') }}
sub-packages: ${{ format('["nvcc"{0},"cudart","cublas","thrust","cublas_dev"]', startsWith(matrix.cuda_version, '13.') && ',"crt","nvvm","nvptxcompiler"' || '') }}
use-github-cache: false
use-local-cache: false
log-file-suffix: ${{matrix.os}}-${{matrix.cuda_version}}.txt
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test-runner.yml
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ jobs:
with:
cuda: ${{ inputs.cuda_version }}
method: "network"
sub-packages: '["nvcc","cudart","cusparse","cublas","thrust","nvrtc_dev","cublas_dev","cusparse_dev"]'
sub-packages: '["nvcc","cudart","cublas","thrust","nvrtc_dev","cublas_dev"]'
use-github-cache: false

# Windows: Setup MSVC (needed for both CPU and CUDA builds)
Expand Down
5 changes: 2 additions & 3 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -349,7 +349,7 @@ endif()

if(BUILD_CUDA)
target_include_directories(bitsandbytes PUBLIC ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES})
target_link_libraries(bitsandbytes PUBLIC CUDA::cudart CUDA::cublas CUDA::cublasLt CUDA::cusparse)
target_link_libraries(bitsandbytes PUBLIC CUDA::cudart CUDA::cublas CUDA::cublasLt)
set_target_properties(bitsandbytes
PROPERTIES
CUDA_SEPARABLE_COMPILATION ON
Expand All @@ -369,7 +369,6 @@ if(BUILD_HIP)
endmacro()
find_package_and_print_version(hipblas REQUIRED)
find_package_and_print_version(hiprand REQUIRED)
find_package_and_print_version(hipsparse REQUIRED)

## hacky way of excluding hip::amdhip64 (with it linked many tests unexpectedly fail e.g. adam8bit because of inaccuracies)
## On Windows, we need to link amdhip64 explicitly
Expand All @@ -381,7 +380,7 @@ if(BUILD_HIP)

target_include_directories(bitsandbytes PRIVATE ${CMAKE_SOURCE_DIR} ${CMAKE_SOURCE_DIR}/include ${ROCM_PATH}/include /include)
target_link_directories(bitsandbytes PRIVATE ${ROCM_PATH}/lib /lib)
target_link_libraries(bitsandbytes PUBLIC roc::hipblas hip::hiprand roc::hipsparse)
target_link_libraries(bitsandbytes PUBLIC roc::hipblas hip::hiprand)

# On Windows, rocblas is not pulled in transitively by roc::hipblas
# and is needed because ops_hip.cuh uses rocblas_handle directly.
Expand Down
73 changes: 6 additions & 67 deletions agents/api_surface.md
Original file line number Diff line number Diff line change
Expand Up @@ -860,57 +860,7 @@ F.batched_igemm(
Batched int8 matrix multiplication.
**Stability:** Stable (internal).

### 4.9 Sparse Operations

#### `COOSparseTensor`

```python
class F.COOSparseTensor:
def __init__(self, rows, cols, nnz, rowidx, colidx, values): ...
```

**Stability:** Legacy — used internally for sparse decomposition.

#### `CSRSparseTensor` / `CSCSparseTensor`

Similar sparse tensor containers.
**Stability:** Legacy.

#### `coo_zeros`

```python
F.coo_zeros(rows, cols, nnz, device, dtype=torch.half) -> COOSparseTensor
```

#### `coo2csr` / `coo2csc`

```python
F.coo2csr(cooA: COOSparseTensor) -> CSRSparseTensor
F.coo2csc(cooA: COOSparseTensor) -> CSCSparseTensor
```

#### `spmm_coo`

```python
F.spmm_coo(
cooA: COOSparseTensor, B: torch.Tensor,
out: Optional[torch.Tensor] = None,
) -> torch.Tensor
```

Sparse matrix-dense matrix multiply using cusparse.
**Stability:** Legacy.

#### `spmm_coo_very_sparse`

```python
F.spmm_coo_very_sparse(cooA, B, dequant_stats=None, out=None) -> torch.Tensor
```

Optimized for very sparse matrices with custom kernel.
**Stability:** Legacy.

### 4.10 Paged Memory
### 4.9 Paged Memory

#### `get_paged`

Expand All @@ -930,7 +880,7 @@ F.prefetch_tensor(A: torch.Tensor, to_cpu: bool = False) -> None
Prefetch a paged tensor to GPU or CPU.
**Stability:** Stable (internal).

### 4.11 CPU-Specific Functions
### 4.10 CPU-Specific Functions

#### `_convert_weight_packed_for_cpu`

Expand Down Expand Up @@ -963,7 +913,7 @@ F.has_avx512bf16() -> bool
Detects AVX512BF16 CPU support.
**Stability:** Internal but may be useful externally.

### 4.12 Utility Functions
### 4.11 Utility Functions

#### `is_on_gpu`

Expand All @@ -983,7 +933,7 @@ F.get_ptr(A: Optional[Tensor]) -> Optional[ct.c_void_p]
Gets the data pointer of a tensor for ctypes calls.
**Stability:** Internal.

### 4.13 Singleton Managers
### 4.12 Singleton Managers

#### `GlobalPageManager`

Expand All @@ -1003,15 +953,6 @@ F.CUBLAS_Context.get_instance() -> CUBLAS_Context
Manages cuBLAS context handles per device.
**Stability:** Internal.

#### `Cusparse_Context`

```python
F.Cusparse_Context.get_instance() -> Cusparse_Context
```

Manages cusparse context handle.
**Stability:** Internal.

---

## 5. Autograd Functions
Expand Down Expand Up @@ -1234,7 +1175,7 @@ bitsandbytes.utils.replace_linear(
| Class | Description |
|-------|-------------|
| `BNBNativeLibrary` | Base wrapper for the ctypes-loaded native library |
| `CudaBNBNativeLibrary` | CUDA-specific subclass (sets up context/cusparse/managed ptr) |
| `CudaBNBNativeLibrary` | CUDA-specific subclass (sets up context/managed ptr) |
| `ErrorHandlerMockBNBNativeLibrary` | Fallback mock that defers error messages to call time |

### Module-level symbols
Expand Down Expand Up @@ -1396,11 +1337,9 @@ A PR that changes any of these symbols MUST consider downstream impact:

- `bitsandbytes.cextension.*` (native library loading)
- `bitsandbytes.functional.get_ptr`, `is_on_gpu`, `_get_tensor_stream`
- `bitsandbytes.functional.GlobalPageManager`, `CUBLAS_Context`, `Cusparse_Context`
- `bitsandbytes.functional.GlobalPageManager`, `CUBLAS_Context`
- `bitsandbytes.functional._convert_weight_packed_for_cpu*`
- `bitsandbytes.functional.check_matmul`, `elementwise_func`, `fill`, `_mul`
- `bitsandbytes.functional.spmm_coo`, `spmm_coo_very_sparse`
- `bitsandbytes.functional.COOSparseTensor`, `CSRSparseTensor`, `CSCSparseTensor`
- `bitsandbytes.utils.pack_dict_to_tensor`, `unpack_tensor_to_dict`
- `bitsandbytes.utils.execute_and_return`, `sync_gpu`
- `bitsandbytes.optim.optimizer.MockArgs`
Expand Down
4 changes: 2 additions & 2 deletions agents/architecture_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -962,8 +962,8 @@ The `COMPUTE_BACKEND` CMake variable selects the target:
| Backend | Library name | Languages | Dependencies |
|---|---|---|---|
| `cpu` | `libbitsandbytes_cpu.so` | C++17 | OpenMP (optional) |
| `cuda` | `libbitsandbytes_cuda{VER}.so` | C++17 + CUDA | cudart, cublas, cublasLt, cusparse |
| `hip` | `libbitsandbytes_rocm{VER}.so` | C++17 + HIP | hipblas, hiprand, hipsparse |
| `cuda` | `libbitsandbytes_cuda{VER}.so` | C++17 + CUDA | cudart, cublas, cublasLt |
| `hip` | `libbitsandbytes_rocm{VER}.so` | C++17 + HIP | hipblas, hiprand |
| `mps` | `libbitsandbytes_mps.dylib` | C++17 + ObjC++ | Metal framework |
| `xpu` | `libbitsandbytes_xpu.so` | C++20 + SYCL | Intel oneAPI |
Expand Down
3 changes: 1 addition & 2 deletions agents/code_standards.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ class GlobalOptimManager:
```

This pattern is used by: `GlobalOptimManager`, `GlobalPageManager`, `CUBLAS_Context`,
`Cusparse_Context`, `GlobalOutlierPooler`, `OutlierTracer`.
`GlobalOutlierPooler`, `OutlierTracer`.

---

Expand Down Expand Up @@ -867,7 +867,6 @@ Use the project's error checking macros:

```cpp
CUDA_CHECK_RETURN(cudaMemcpy(...));
CHECK_CUSPARSE(cusparseCreate(...));
```

The `checkCublasStatus` function returns an error code rather than throwing — the Python
Expand Down
4 changes: 2 additions & 2 deletions agents/issue_patterns.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,9 @@ These are the single largest category of issues. Most are environment problems o
>
> If you're still hitting problems on the **latest** bitsandbytes (v0.45+), please open a new issue with the output of `python -m bitsandbytes` and your environment details.

### Missing `libcusparse.so.11` / shared library mismatch
### Missing shared CUDA library / shared library mismatch

**How to identify:** `OSError: libcusparse.so.11: cannot open shared object file: No such file or directory`. Or similar errors for `libcusparse.so.12`, `libcublasLt.so.11`, etc.
**How to identify:** `OSError: libcublasLt.so.11: cannot open shared object file: No such file or directory`. Or similar errors for `libcudart`, `libcublas`, etc.

**What happened:** The bnb binary was compiled against one CUDA version (e.g., 11.x) but the system only has another (e.g., 12.x). The shared library dependencies don't exist. Modern releases ship platform-specific wheels with better CUDA version detection and multiple binary variants.

Expand Down
2 changes: 1 addition & 1 deletion bitsandbytes/backends/cuda/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ def _(
A: torch.Tensor,
threshold=0.0,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, Optional[torch.Tensor]]:
# Use CUDA kernel for rowwise and COO tensor
# Use CUDA kernel for rowwise quant and outlier column detection
quant_row, row_stats, outlier_cols = torch.ops.bitsandbytes.int8_vectorwise_quant.default(
A,
threshold=threshold,
Expand Down
1 change: 0 additions & 1 deletion bitsandbytes/cextension.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,6 @@ class CudaBNBNativeLibrary(BNBNativeLibrary):
def __init__(self, lib: ct.CDLL):
super().__init__(lib)
lib.get_context.restype = ct.c_void_p
lib.get_cusparse.restype = ct.c_void_p
lib.cget_managed_ptr.restype = ct.c_void_p


Expand Down
Loading