Skip to content

Commit c1666aa

Browse files
TimDettmersclaude
andcommitted
Add testing guide with benchmark data and update CLAUDE.md
Documents optimal test parallelization (-n 4) based on benchmarks across two machines (8-core/RTX 4090 and 32-core/RTX PRO 6000 Blackwell). Includes CPU/GPU utilization analysis, known Blackwell-specific test failures, and practical pytest recommendations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 5ea3c89 commit c1666aa

File tree

5 files changed

+437
-0
lines changed

5 files changed

+437
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,3 +156,5 @@ dmypy.json
156156
dependencies
157157
cuda_build
158158
output/
159+
cuda-spec.md
160+
cuda-spec-additions.md

CLAUDE.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Parallel sessions
2+
3+
To work on multiple branches at once, use git worktrees:
4+
5+
```bash
6+
git worktree add ../bitsandbytes-<branch-name> -b <branch-name>
7+
cd ../bitsandbytes-<branch-name>
8+
claude
9+
```
10+
11+
Full guide: `agents/worktree_guide.md`
12+
13+
# Testing
14+
15+
Run the test suite with 4 parallel workers (optimal for any machine):
16+
17+
```bash
18+
pytest tests/ -v --tb=short -n 4
19+
```
20+
21+
Best practices, benchmark data, and known architecture-specific issues: `agents/testing_guide.md`

COMPILE_H100_L40.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Compiling bitsandbytes for H100 and L40 GPUs
2+
3+
This guide shows how to compile bitsandbytes from source specifically optimized for NVIDIA H100 and L40 GPUs.
4+
5+
## Prerequisites
6+
7+
- CMake >= 3.22.1
8+
- Python >= 3.9
9+
- GCC (version 9+ recommended)
10+
- CUDA Toolkit (11.8+)
11+
- PyTorch with CUDA support
12+
13+
Verify your system:
14+
```bash
15+
cmake --version
16+
python3 --version
17+
gcc --version
18+
nvcc --version
19+
```
20+
21+
## GPU Compute Capabilities
22+
23+
- **L40**: Compute Capability 8.9 (sm_89)
24+
- **H100**: Compute Capability 9.0 (sm_90)
25+
26+
## Compilation Steps
27+
28+
### 1. Clean any previous build configuration
29+
30+
```bash
31+
cd /path/to/bitsandbytes
32+
rm -rf CMakeCache.txt CMakeFiles/ build/
33+
```
34+
35+
### 2. Configure CMake for H100 and L40
36+
37+
```bash
38+
cmake -DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY="89;90" -S .
39+
```
40+
41+
This configures the build to target only compute capabilities 89 (L40) and 90 (H100), significantly reducing compilation time compared to building for all architectures.
42+
43+
### 3. Compile the library
44+
45+
```bash
46+
make -j$(nproc)
47+
```
48+
49+
This will create `bitsandbytes/libbitsandbytes_cuda<VERSION>.so` where `<VERSION>` matches your CUDA Toolkit version (e.g., `cuda124` for CUDA 12.4).
50+
51+
### 4. Install the package
52+
53+
```bash
54+
pip install -e .
55+
```
56+
57+
Use `-e` flag for editable/development install, or omit it for regular installation.
58+
59+
### 5. Handle PyTorch CUDA version mismatch (if needed)
60+
61+
If your PyTorch was compiled with a different CUDA version than your Toolkit, you may need to create a symlink:
62+
63+
```bash
64+
# Example: PyTorch uses CUDA 12.8, but you compiled with CUDA 12.4
65+
ln -sf libbitsandbytes_cuda124.so bitsandbytes/libbitsandbytes_cuda128.so
66+
```
67+
68+
Alternatively, set the environment variable:
69+
```bash
70+
export BNB_CUDA_VERSION=124 # Use your compiled CUDA version
71+
```
72+
73+
### 6. Verify installation
74+
75+
```bash
76+
python3 -c "import bitsandbytes as bnb; print(f'bitsandbytes version: {bnb.__version__}'); print('Success!')"
77+
```
78+
79+
## Expected Output
80+
81+
After compilation, you should see:
82+
- Binary file: `bitsandbytes/libbitsandbytes_cuda<VERSION>.so` (approximately 7MB when targeting only sm_89 and sm_90)
83+
- Successful import in Python with no errors
84+
85+
## Compilation Time
86+
87+
Building for only H100/L40 (2 architectures) takes approximately **1-2 minutes** compared to **5+ minutes** when building for all 14+ compute capabilities.
88+
89+
## Troubleshooting
90+
91+
### Warning messages during compilation
92+
Warnings like "variable declared but never referenced" are harmless and can be ignored.
93+
94+
### Wrong CUDA binary error
95+
If you see `Configured CUDA binary not found`, check:
96+
1. The compiled `.so` file exists in `bitsandbytes/` directory
97+
2. The CUDA version matches or create a symlink as shown in step 5
98+
3. Use `BNB_CUDA_VERSION` environment variable to override
99+
100+
### CUDA version check
101+
```bash
102+
# Check your CUDA Toolkit version
103+
nvcc --version
104+
105+
# Check PyTorch CUDA version
106+
python3 -c "import torch; print(torch.version.cuda)"
107+
```
108+
109+
## Notes
110+
111+
- The compiled library will **only work on GPUs with compute capability 8.9 or 9.0** (L40 and H100)
112+
- For other GPUs, you'll need to recompile with appropriate compute capabilities
113+
- The `-DCOMPUTE_CAPABILITY` flag accepts a semicolon-separated list: e.g., `"75;80;89;90"` for T4, A100, L40, and H100

agents/testing_guide.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# Testing Guide for bitsandbytes
2+
3+
## Quick Start
4+
5+
Run the full test suite with optimal parallelization:
6+
7+
```bash
8+
pytest tests/ -v --tb=short -n 4
9+
```
10+
11+
`-n 4` (4 pytest-xdist workers) is the recommended default for any machine.
12+
13+
## Why 4 Workers?
14+
15+
Benchmarks across two machines with very different hardware show that `-n 4` is consistently the fastest configuration. Going higher provides no benefit and often makes things worse.
16+
17+
### Benchmark Data
18+
19+
**Machine A:** AMD Threadripper 1900X (8 cores / 16 threads), RTX 4090 (24 GB), CUDA 12.4
20+
21+
| Workers | Wall Time | Speedup vs n=1 | Avg CPU | Avg GPU | Failures |
22+
|---------|-----------|-----------------|---------|---------|----------|
23+
| 1 | 1319s | 1.00x | 32.5% | 3.4% | 0 |
24+
| **4** | **565s** | **2.33x** | 70.5% | 12.9% | 0 |
25+
| 6 | 588s | 2.24x | 74.8% | 10.9% | 7 (OOM) |
26+
| 8 | 570s | 2.31x | 87.9% | 12.5% | 7 (OOM) |
27+
28+
**Machine B:** AMD Threadripper PRO 9975WX (32 cores / 64 threads), RTX PRO 6000 Blackwell (98 GB), CUDA 13.0
29+
30+
| Workers | Wall Time | Speedup vs n=1 | Avg CPU | Avg GPU | Failures |
31+
|---------|-----------|-----------------|---------|---------|----------|
32+
| 1 | 428s | 1.00x | 13.4% | 3.1% | 25* |
33+
| **4** | **322s** | **1.33x** | 75.3% | 5.7% | 25* |
34+
| 8 | 578s | 0.74x (slower) | 91.9% | 3.5% | 25* |
35+
| 16 | 566s | 0.76x (slower) | 97.0% | 6.2% | 25* |
36+
| 24 | 560s | 0.76x (slower) | 97.2% | 6.2% | 40 |
37+
38+
\* Blackwell-specific failures unrelated to worker count (see Known Issues below).
39+
40+
### Analysis
41+
42+
- **GPU utilization stays very low** (3-13%) regardless of worker count. The tests are primarily CPU-bound: short GPU kernel bursts interleaved with Python/numpy work for test setup, tensor creation, and result validation.
43+
- **4 workers is the sweet spot** because it balances overlapping CPU prep with GPU execution across workers. Each worker can prepare data while another waits on a GPU kernel.
44+
- **Beyond 4 workers, overhead dominates.** Additional workers add pytest-xdist coordination costs and per-worker CUDA context overhead without meaningful GPU throughput gain. On Machine B, `-n 8` was nearly 2x slower than `-n 4` despite 75% idle CPU at `-n 4`.
45+
- **Per-core CPU speed matters more than core count.** Machine B is 3.1x faster single-threaded (Zen 5 vs Zen 1). Having 4x more cores provided no additional benefit at the optimal worker count.
46+
- **GPU memory affects reliability, not speed.** More free VRAM avoids OOM failures at higher worker counts but does not improve throughput.
47+
48+
### What About More/Fewer Workers?
49+
50+
| Situation | Recommendation |
51+
|-----------|---------------|
52+
| Default | `-n 4` |
53+
| Low GPU memory (<8 GB free) | `-n 2` to avoid OOM |
54+
| Running a subset of tests | `-n 4` still fine |
55+
| Single specific test | No `-n` flag needed |
56+
| CI environment | `-n 4` |
57+
58+
## Useful pytest Options
59+
60+
```bash
61+
# Full suite, optimal speed
62+
pytest tests/ -v --tb=short -n 4
63+
64+
# With timing breakdown of slowest tests
65+
pytest tests/ -v --tb=short -n 4 --durations=20
66+
67+
# Run a specific test file
68+
pytest tests/test_functional.py -v --tb=short -n 4
69+
70+
# Run tests matching a keyword
71+
pytest tests/ -v --tb=short -n 4 -k "4bit"
72+
73+
# Stop on first failure
74+
pytest tests/ -v --tb=short -n 4 -x
75+
76+
# Single worker (debugging, deterministic output)
77+
pytest tests/ -v --tb=long
78+
```
79+
80+
## Test Suite Characteristics
81+
82+
The full suite has ~7500 parametrized tests. Most of the wall-clock time is consumed by a small number of test functions with many parametrizations:
83+
84+
- **`test_gemv_4bit`** dominates (~70% of total time) with 1500+ combinations. CPU variants at dim=1024 take 16-20s each; CUDA variants finish in ~0.05s.
85+
- **`test_functional.py`** alone accounts for ~80% of total test time.
86+
- **CPU tests are the bottleneck**: 81% of total time despite being only 37% of test count.
87+
- **87% of individual tests finish under 1 second**, but the remaining 13% consume 80% of wall-clock time.
88+
89+
## Known Issues by Architecture
90+
91+
### Blackwell (sm_120, e.g. RTX PRO 6000)
92+
93+
25 tests fail on Blackwell as of the `main` branch (Feb 2026):
94+
95+
1. **Int8 batched matmul (`test_ibmm`) - 16 failures**: cuBLAS returns `CUBLAS_STATUS_NOT_SUPPORTED` (status 15) for the int8 batched GEMM path on Blackwell. The legacy cuBLAS int8 API is not supported on sm_120. These tests produce garbage output (100% element mismatch). A fix would require migrating to cublasLt or a different int8 GEMM implementation.
96+
97+
2. **FP4 quantization at blocksize=256 - 9 failures**: Relative error is marginally above the threshold (e.g., 0.29091 vs limit of 0.2908). Only affects `fp4` at `blocksize=256` on CUDA across all dtypes (fp32, fp16, bf16). The `nf4` quant type and other blocksizes pass. This is a minor numerical difference in fp4 dequantization likely caused by different FP rounding behavior on Blackwell.
98+
99+
### Ada Lovelace (sm_89, e.g. RTX 4090)
100+
101+
No architecture-specific failures. All tests pass with `-n 4`.
102+
103+
## Build Before Testing
104+
105+
Tests require a compiled native library matching your GPU and CUDA toolkit. See `COMPILE_H100_L40.md` for build instructions. Quick version:
106+
107+
```bash
108+
# Find your GPU's compute capability
109+
nvidia-smi --query-gpu=compute_cap --format=csv,noheader
110+
111+
# Build (replace 89 with your compute capability, e.g. 120 for Blackwell)
112+
cmake -DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY="89" -S . -B build
113+
cmake --build build -j$(nproc)
114+
115+
# If your CUDA toolkit version differs from PyTorch's CUDA version, create a symlink:
116+
# e.g., toolkit is 12.4 but PyTorch expects 12.8:
117+
ln -sf bitsandbytes/libbitsandbytes_cuda124.so bitsandbytes/libbitsandbytes_cuda128.so
118+
119+
# Install in editable mode
120+
pip install -e .
121+
```
122+
123+
## Test Dependencies
124+
125+
```bash
126+
pip install einops lion-pytorch pytest pytest-xdist scipy transformers
127+
```

0 commit comments

Comments
 (0)