Skip to content

Commit 13838fb

Browse files
committed
Add copy-paste agent testing prompt for other machines
Made-with: Cursor
1 parent eccb81f commit 13838fb

1 file changed

Lines changed: 133 additions & 0 deletions

File tree

AGENT_TESTING_PROMPT.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Agent Testing Prompt for ROCm 4-bit Kernel Optimizations
2+
3+
Copy everything below the line and paste it into a Cursor Agent session on the target machine.
4+
5+
---
6+
7+
I'm testing a ROCm 4-bit kernel optimization branch for bitsandbytes on this machine. The branch has kernel-level optimizations and a fused multi-batch dispatch that dramatically improves vLLM serving throughput.
8+
9+
## Setup
10+
11+
```bash
12+
git clone https://github.com/sstamenk/bitsandbytes
13+
cd bitsandbytes
14+
git checkout rocm-4bit-kernel-optimization
15+
```
16+
17+
Read TESTING_OPTIMIZATIONS.md and OPTIMIZATION_REPORT.md for full context on what was changed, why, and reference results from gfx1151.
18+
19+
## What to do
20+
21+
Execute all 3 phases (A, B, C) from TESTING_OPTIMIZATIONS.md in order. Each phase tests a different configuration:
22+
23+
- Phase A: Baseline (upstream kernel, no changes) -- stash all changes, rebuild, test
24+
- Phase B: Kernel optimizations only (FUSED_4BIT_M_LIMIT=1) -- optimized kernel but M>1 still uses dequant+GEMM
25+
- Phase C: Full optimization (FUSED_4BIT_M_LIMIT=16) -- fused kernel for M<=16, critical for vLLM serving
26+
27+
### Build instructions
28+
29+
```bash
30+
# Find your GPU arch
31+
rocminfo | grep "Name:" | head -5
32+
# or: python -c "import torch; print(torch.cuda.get_device_properties(0).gcnArchName)"
33+
34+
# Build (replace gfx arch and ROCm version as needed)
35+
cmake -B build -DBUILD_HIP=ON -DBNB_ROCM_ARCH="<your_arch>" -DROCM_VERSION="<version>"
36+
cmake --build build -j$(nproc)
37+
```
38+
39+
If you get a missing header error for hip_bfloat16.h, add `-DCMAKE_CXX_FLAGS="-I/opt/rocm/include"` to the cmake configure step.
40+
41+
### For each phase, run these tests:
42+
43+
1. **Correctness** (must pass -- any failure here is a regression):
44+
```bash
45+
python -m pytest tests/test_ops.py -k "test_gemv_4bit" -v
46+
python -m pytest tests/test_linear4bit.py -v
47+
```
48+
49+
2. **Kernel microbenchmark**:
50+
```bash
51+
python bench_quick.py
52+
```
53+
54+
3. **vLLM serving** (requires vLLM installed with ROCm support):
55+
```bash
56+
export PYTHONPATH=<venv>/lib/python3.12/site-packages/_rocm_sdk_devel/share/amd_smi
57+
export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
58+
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
59+
60+
# For Phase A and B:
61+
python bench_vllm_sweep.py --limit 1 --model mistral7b
62+
63+
# For Phase C:
64+
python bench_vllm_sweep.py --limit 16 --model mistral7b
65+
python bench_vllm_sweep.py --limit 16 --model llama8b
66+
```
67+
68+
The vLLM sweep tests concurrency 1,2,4,8,12,16,24,32,48,64. Pay special attention to:
69+
- reqs=2-8: expect 2-5x improvement in Phase C vs Phase A/B
70+
- reqs=24-64: must NOT regress vs Phase A (these use the split path in both)
71+
72+
To change the dispatch threshold for Phase A/B/C, edit `FUSED_4BIT_M_LIMIT` in `bitsandbytes/autograd/_functions.py` before each vLLM run. vLLM forks worker processes that read the constant at import time, so the source file must be edited (monkey-patching doesn't propagate).
73+
74+
### Phase transitions
75+
76+
Between Phase A and B:
77+
```bash
78+
git stash pop # restore optimized code
79+
cmake --build build -j$(nproc)
80+
sed -i "s/FUSED_4BIT_M_LIMIT = [0-9]*/FUSED_4BIT_M_LIMIT = 1/" bitsandbytes/autograd/_functions.py
81+
```
82+
83+
Between Phase B and C:
84+
```bash
85+
sed -i "s/FUSED_4BIT_M_LIMIT = [0-9]*/FUSED_4BIT_M_LIMIT = 16/" bitsandbytes/autograd/_functions.py
86+
```
87+
88+
## Report format
89+
90+
Produce a report with this structure:
91+
92+
### 1. Environment
93+
- GPU name and architecture (gfx????)
94+
- ROCm version, PyTorch version
95+
- `python -c "import torch; p=torch.cuda.get_device_properties(0); print(f'{p.name} ({p.gcnArchName}), {p.total_memory//1024**3} GB')"`
96+
97+
### 2. Correctness (per phase)
98+
99+
| Phase | test_ops gemv_4bit | test_linear4bit | Status |
100+
|-------|-------------------|-----------------|--------|
101+
| A | xx/60 pass | xxx/243 pass | PASS/FAIL |
102+
| B | xx/60 pass | xxx/243 pass | PASS/FAIL |
103+
| C | xx/60 pass | xxx/243 pass | PASS/FAIL |
104+
105+
### 3. Kernel Microbenchmark
106+
107+
| Phase | Time (us) | BW (GB/s) | Speedup vs A |
108+
|-------|-----------|-----------|--------------|
109+
| A (baseline) | | | 1.00x |
110+
| B (kernel opt) | | | |
111+
| C (full opt) | | | |
112+
113+
### 4. vLLM Serving Throughput (tok/s)
114+
115+
| Reqs | Phase A (L=1) | Phase B (L=1) | Phase C (L=16) | C vs A | Regression? |
116+
|------|---------------|---------------|----------------|--------|-------------|
117+
| 1 | | | | | |
118+
| 2 | | | | | |
119+
| 4 | | | | | |
120+
| 8 | | | | | |
121+
| 16 | | | | | |
122+
| 24 | | | | | |
123+
| 32 | | | | | |
124+
| 48 | | | | | |
125+
| 64 | | | | | |
126+
127+
Flag any row where Phase C is >5% slower than Phase A as a REGRESSION.
128+
129+
### 5. Summary
130+
- Kernel speedup: Phase B vs A
131+
- vLLM peak improvement: best Phase C vs A speedup and at which concurrency
132+
- Any regressions found
133+
- Is FUSED_4BIT_M_LIMIT=16 safe on this GPU? (yes if no regressions at reqs>16)

0 commit comments

Comments
 (0)