|
| 1 | +# Testing Guide for bitsandbytes |
| 2 | + |
| 3 | +## Quick Start |
| 4 | + |
| 5 | +Run the full test suite with optimal parallelization: |
| 6 | + |
| 7 | +```bash |
| 8 | +pytest tests/ -v --tb=short -n 4 |
| 9 | +``` |
| 10 | + |
| 11 | +`-n 4` (4 pytest-xdist workers) is the recommended default for any machine. |
| 12 | + |
| 13 | +## Why 4 Workers? |
| 14 | + |
| 15 | +Benchmarks across two machines with very different hardware show that `-n 4` is consistently the fastest configuration. Going higher provides no benefit and often makes things worse. |
| 16 | + |
| 17 | +### Benchmark Data |
| 18 | + |
| 19 | +**Machine A:** AMD Threadripper 1900X (8 cores / 16 threads), RTX 4090 (24 GB), CUDA 12.4 |
| 20 | + |
| 21 | +| Workers | Wall Time | Speedup vs n=1 | Avg CPU | Avg GPU | Failures | |
| 22 | +|---------|-----------|-----------------|---------|---------|----------| |
| 23 | +| 1 | 1319s | 1.00x | 32.5% | 3.4% | 0 | |
| 24 | +| **4** | **565s** | **2.33x** | 70.5% | 12.9% | 0 | |
| 25 | +| 6 | 588s | 2.24x | 74.8% | 10.9% | 7 (OOM) | |
| 26 | +| 8 | 570s | 2.31x | 87.9% | 12.5% | 7 (OOM) | |
| 27 | + |
| 28 | +**Machine B:** AMD Threadripper PRO 9975WX (32 cores / 64 threads), RTX PRO 6000 Blackwell (98 GB), CUDA 13.0 |
| 29 | + |
| 30 | +| Workers | Wall Time | Speedup vs n=1 | Avg CPU | Avg GPU | Failures | |
| 31 | +|---------|-----------|-----------------|---------|---------|----------| |
| 32 | +| 1 | 428s | 1.00x | 13.4% | 3.1% | 25* | |
| 33 | +| **4** | **322s** | **1.33x** | 75.3% | 5.7% | 25* | |
| 34 | +| 8 | 578s | 0.74x (slower) | 91.9% | 3.5% | 25* | |
| 35 | +| 16 | 566s | 0.76x (slower) | 97.0% | 6.2% | 25* | |
| 36 | +| 24 | 560s | 0.76x (slower) | 97.2% | 6.2% | 40 | |
| 37 | + |
| 38 | +\* Blackwell-specific failures unrelated to worker count (see Known Issues below). |
| 39 | + |
| 40 | +### Analysis |
| 41 | + |
| 42 | +- **GPU utilization stays very low** (3-13%) regardless of worker count. The tests are primarily CPU-bound: short GPU kernel bursts interleaved with Python/numpy work for test setup, tensor creation, and result validation. |
| 43 | +- **4 workers is the sweet spot** because it balances overlapping CPU prep with GPU execution across workers. Each worker can prepare data while another waits on a GPU kernel. |
| 44 | +- **Beyond 4 workers, overhead dominates.** Additional workers add pytest-xdist coordination costs and per-worker CUDA context overhead without meaningful GPU throughput gain. On Machine B, `-n 8` was nearly 2x slower than `-n 4` despite 75% idle CPU at `-n 4`. |
| 45 | +- **Per-core CPU speed matters more than core count.** Machine B is 3.1x faster single-threaded (Zen 5 vs Zen 1). Having 4x more cores provided no additional benefit at the optimal worker count. |
| 46 | +- **GPU memory affects reliability, not speed.** More free VRAM avoids OOM failures at higher worker counts but does not improve throughput. |
| 47 | + |
| 48 | +### What About More/Fewer Workers? |
| 49 | + |
| 50 | +| Situation | Recommendation | |
| 51 | +|-----------|---------------| |
| 52 | +| Default | `-n 4` | |
| 53 | +| Low GPU memory (<8 GB free) | `-n 2` to avoid OOM | |
| 54 | +| Running a subset of tests | `-n 4` still fine | |
| 55 | +| Single specific test | No `-n` flag needed | |
| 56 | +| CI environment | `-n 4` | |
| 57 | + |
| 58 | +## Useful pytest Options |
| 59 | + |
| 60 | +```bash |
| 61 | +# Full suite, optimal speed |
| 62 | +pytest tests/ -v --tb=short -n 4 |
| 63 | + |
| 64 | +# With timing breakdown of slowest tests |
| 65 | +pytest tests/ -v --tb=short -n 4 --durations=20 |
| 66 | + |
| 67 | +# Run a specific test file |
| 68 | +pytest tests/test_functional.py -v --tb=short -n 4 |
| 69 | + |
| 70 | +# Run tests matching a keyword |
| 71 | +pytest tests/ -v --tb=short -n 4 -k "4bit" |
| 72 | + |
| 73 | +# Stop on first failure |
| 74 | +pytest tests/ -v --tb=short -n 4 -x |
| 75 | + |
| 76 | +# Single worker (debugging, deterministic output) |
| 77 | +pytest tests/ -v --tb=long |
| 78 | +``` |
| 79 | + |
| 80 | +## Test Suite Characteristics |
| 81 | + |
| 82 | +The full suite has ~7500 parametrized tests. Most of the wall-clock time is consumed by a small number of test functions with many parametrizations: |
| 83 | + |
| 84 | +- **`test_gemv_4bit`** dominates (~70% of total time) with 1500+ combinations. CPU variants at dim=1024 take 16-20s each; CUDA variants finish in ~0.05s. |
| 85 | +- **`test_functional.py`** alone accounts for ~80% of total test time. |
| 86 | +- **CPU tests are the bottleneck**: 81% of total time despite being only 37% of test count. |
| 87 | +- **87% of individual tests finish under 1 second**, but the remaining 13% consume 80% of wall-clock time. |
| 88 | + |
| 89 | +## Known Issues by Architecture |
| 90 | + |
| 91 | +### Blackwell (sm_120, e.g. RTX PRO 6000) |
| 92 | + |
| 93 | +25 tests fail on Blackwell as of the `main` branch (Feb 2026): |
| 94 | + |
| 95 | +1. **Int8 batched matmul (`test_ibmm`) - 16 failures**: cuBLAS returns `CUBLAS_STATUS_NOT_SUPPORTED` (status 15) for the int8 batched GEMM path on Blackwell. The legacy cuBLAS int8 API is not supported on sm_120. These tests produce garbage output (100% element mismatch). A fix would require migrating to cublasLt or a different int8 GEMM implementation. |
| 96 | + |
| 97 | +2. **FP4 quantization at blocksize=256 - 9 failures**: Relative error is marginally above the threshold (e.g., 0.29091 vs limit of 0.2908). Only affects `fp4` at `blocksize=256` on CUDA across all dtypes (fp32, fp16, bf16). The `nf4` quant type and other blocksizes pass. This is a minor numerical difference in fp4 dequantization likely caused by different FP rounding behavior on Blackwell. |
| 98 | + |
| 99 | +### Ada Lovelace (sm_89, e.g. RTX 4090) |
| 100 | + |
| 101 | +No architecture-specific failures. All tests pass with `-n 4`. |
| 102 | + |
| 103 | +## Build Before Testing |
| 104 | + |
| 105 | +Tests require a compiled native library matching your GPU and CUDA toolkit. See `COMPILE_H100_L40.md` for build instructions. Quick version: |
| 106 | + |
| 107 | +```bash |
| 108 | +# Find your GPU's compute capability |
| 109 | +nvidia-smi --query-gpu=compute_cap --format=csv,noheader |
| 110 | + |
| 111 | +# Build (replace 89 with your compute capability, e.g. 120 for Blackwell) |
| 112 | +cmake -DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY="89" -S . -B build |
| 113 | +cmake --build build -j$(nproc) |
| 114 | + |
| 115 | +# If your CUDA toolkit version differs from PyTorch's CUDA version, create a symlink: |
| 116 | +# e.g., toolkit is 12.4 but PyTorch expects 12.8: |
| 117 | +ln -sf bitsandbytes/libbitsandbytes_cuda124.so bitsandbytes/libbitsandbytes_cuda128.so |
| 118 | + |
| 119 | +# Install in editable mode |
| 120 | +pip install -e . |
| 121 | +``` |
| 122 | + |
| 123 | +## Test Dependencies |
| 124 | + |
| 125 | +```bash |
| 126 | +pip install einops lion-pytorch pytest pytest-xdist scipy transformers |
| 127 | +``` |
0 commit comments