|
| 1 | +# Agent Testing Prompt for ROCm 4-bit Kernel Optimizations |
| 2 | + |
| 3 | +Copy everything below the line and paste it into a Cursor Agent session on the target machine. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +I'm testing a ROCm 4-bit kernel optimization branch for bitsandbytes on this machine. The branch has kernel-level optimizations and a fused multi-batch dispatch that dramatically improves vLLM serving throughput. |
| 8 | + |
| 9 | +## Setup |
| 10 | + |
| 11 | +```bash |
| 12 | +git clone https://github.com/sstamenk/bitsandbytes |
| 13 | +cd bitsandbytes |
| 14 | +git checkout rocm-4bit-kernel-optimization |
| 15 | +``` |
| 16 | + |
| 17 | +Read TESTING_OPTIMIZATIONS.md and OPTIMIZATION_REPORT.md for full context on what was changed, why, and reference results from gfx1151. |
| 18 | + |
| 19 | +## What to do |
| 20 | + |
| 21 | +Execute all 3 phases (A, B, C) from TESTING_OPTIMIZATIONS.md in order. Each phase tests a different configuration: |
| 22 | + |
| 23 | +- Phase A: Baseline (upstream kernel, no changes) -- stash all changes, rebuild, test |
| 24 | +- Phase B: Kernel optimizations only (FUSED_4BIT_M_LIMIT=1) -- optimized kernel but M>1 still uses dequant+GEMM |
| 25 | +- Phase C: Full optimization (FUSED_4BIT_M_LIMIT=16) -- fused kernel for M<=16, critical for vLLM serving |
| 26 | + |
| 27 | +### Build instructions |
| 28 | + |
| 29 | +```bash |
| 30 | +# Find your GPU arch |
| 31 | +rocminfo | grep "Name:" | head -5 |
| 32 | +# or: python -c "import torch; print(torch.cuda.get_device_properties(0).gcnArchName)" |
| 33 | + |
| 34 | +# Build (replace gfx arch and ROCm version as needed) |
| 35 | +cmake -B build -DBUILD_HIP=ON -DBNB_ROCM_ARCH="<your_arch>" -DROCM_VERSION="<version>" |
| 36 | +cmake --build build -j$(nproc) |
| 37 | +``` |
| 38 | + |
| 39 | +If you get a missing header error for hip_bfloat16.h, add `-DCMAKE_CXX_FLAGS="-I/opt/rocm/include"` to the cmake configure step. |
| 40 | + |
| 41 | +### For each phase, run these tests: |
| 42 | + |
| 43 | +1. **Correctness** (must pass -- any failure here is a regression): |
| 44 | +```bash |
| 45 | +python -m pytest tests/test_ops.py -k "test_gemv_4bit" -v |
| 46 | +python -m pytest tests/test_linear4bit.py -v |
| 47 | +``` |
| 48 | + |
| 49 | +2. **Kernel microbenchmark**: |
| 50 | +```bash |
| 51 | +python bench_quick.py |
| 52 | +``` |
| 53 | + |
| 54 | +3. **vLLM serving** (requires vLLM installed with ROCm support): |
| 55 | +```bash |
| 56 | +export PYTHONPATH=<venv>/lib/python3.12/site-packages/_rocm_sdk_devel/share/amd_smi |
| 57 | +export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE |
| 58 | +export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 |
| 59 | + |
| 60 | +# For Phase A and B: |
| 61 | +python bench_vllm_sweep.py --limit 1 --model mistral7b |
| 62 | + |
| 63 | +# For Phase C: |
| 64 | +python bench_vllm_sweep.py --limit 16 --model mistral7b |
| 65 | +python bench_vllm_sweep.py --limit 16 --model llama8b |
| 66 | +``` |
| 67 | + |
| 68 | +The vLLM sweep tests concurrency 1,2,4,8,12,16,24,32,48,64. Pay special attention to: |
| 69 | +- reqs=2-8: expect 2-5x improvement in Phase C vs Phase A/B |
| 70 | +- reqs=24-64: must NOT regress vs Phase A (these use the split path in both) |
| 71 | + |
| 72 | +To change the dispatch threshold for Phase A/B/C, edit `FUSED_4BIT_M_LIMIT` in `bitsandbytes/autograd/_functions.py` before each vLLM run. vLLM forks worker processes that read the constant at import time, so the source file must be edited (monkey-patching doesn't propagate). |
| 73 | + |
| 74 | +### Phase transitions |
| 75 | + |
| 76 | +Between Phase A and B: |
| 77 | +```bash |
| 78 | +git stash pop # restore optimized code |
| 79 | +cmake --build build -j$(nproc) |
| 80 | +sed -i "s/FUSED_4BIT_M_LIMIT = [0-9]*/FUSED_4BIT_M_LIMIT = 1/" bitsandbytes/autograd/_functions.py |
| 81 | +``` |
| 82 | + |
| 83 | +Between Phase B and C: |
| 84 | +```bash |
| 85 | +sed -i "s/FUSED_4BIT_M_LIMIT = [0-9]*/FUSED_4BIT_M_LIMIT = 16/" bitsandbytes/autograd/_functions.py |
| 86 | +``` |
| 87 | + |
| 88 | +## Report format |
| 89 | + |
| 90 | +Produce a report with this structure: |
| 91 | + |
| 92 | +### 1. Environment |
| 93 | +- GPU name and architecture (gfx????) |
| 94 | +- ROCm version, PyTorch version |
| 95 | +- `python -c "import torch; p=torch.cuda.get_device_properties(0); print(f'{p.name} ({p.gcnArchName}), {p.total_memory//1024**3} GB')"` |
| 96 | + |
| 97 | +### 2. Correctness (per phase) |
| 98 | + |
| 99 | +| Phase | test_ops gemv_4bit | test_linear4bit | Status | |
| 100 | +|-------|-------------------|-----------------|--------| |
| 101 | +| A | xx/60 pass | xxx/243 pass | PASS/FAIL | |
| 102 | +| B | xx/60 pass | xxx/243 pass | PASS/FAIL | |
| 103 | +| C | xx/60 pass | xxx/243 pass | PASS/FAIL | |
| 104 | + |
| 105 | +### 3. Kernel Microbenchmark |
| 106 | + |
| 107 | +| Phase | Time (us) | BW (GB/s) | Speedup vs A | |
| 108 | +|-------|-----------|-----------|--------------| |
| 109 | +| A (baseline) | | | 1.00x | |
| 110 | +| B (kernel opt) | | | | |
| 111 | +| C (full opt) | | | | |
| 112 | + |
| 113 | +### 4. vLLM Serving Throughput (tok/s) |
| 114 | + |
| 115 | +| Reqs | Phase A (L=1) | Phase B (L=1) | Phase C (L=16) | C vs A | Regression? | |
| 116 | +|------|---------------|---------------|----------------|--------|-------------| |
| 117 | +| 1 | | | | | | |
| 118 | +| 2 | | | | | | |
| 119 | +| 4 | | | | | | |
| 120 | +| 8 | | | | | | |
| 121 | +| 16 | | | | | | |
| 122 | +| 24 | | | | | | |
| 123 | +| 32 | | | | | | |
| 124 | +| 48 | | | | | | |
| 125 | +| 64 | | | | | | |
| 126 | + |
| 127 | +Flag any row where Phase C is >5% slower than Phase A as a REGRESSION. |
| 128 | + |
| 129 | +### 5. Summary |
| 130 | +- Kernel speedup: Phase B vs A |
| 131 | +- vLLM peak improvement: best Phase C vs A speedup and at which concurrency |
| 132 | +- Any regressions found |
| 133 | +- Is FUSED_4BIT_M_LIMIT=16 safe on this GPU? (yes if no regressions at reqs>16) |
0 commit comments