Add copy-paste agent testing prompt for other machines

sstamenk · sstamenk · commit 13838fb38f98 · 2026-04-08T14:00:56.000+02:00
Made-with: Cursor
diff --git a/AGENT_TESTING_PROMPT.md b/AGENT_TESTING_PROMPT.md
@@ -0,0 +1,133 @@
+# Agent Testing Prompt for ROCm 4-bit Kernel Optimizations
+
+Copy everything below the line and paste it into a Cursor Agent session on the target machine.
+
+---
+
+I'm testing a ROCm 4-bit kernel optimization branch for bitsandbytes on this machine. The branch has kernel-level optimizations and a fused multi-batch dispatch that dramatically improves vLLM serving throughput.
+
+## Setup
+
+```bash
+git clone https://github.com/sstamenk/bitsandbytes
+cd bitsandbytes
+git checkout rocm-4bit-kernel-optimization
+```
+
+Read TESTING_OPTIMIZATIONS.md and OPTIMIZATION_REPORT.md for full context on what was changed, why, and reference results from gfx1151.
+
+## What to do
+
+Execute all 3 phases (A, B, C) from TESTING_OPTIMIZATIONS.md in order. Each phase tests a different configuration:
+
+- Phase A: Baseline (upstream kernel, no changes) -- stash all changes, rebuild, test
+- Phase B: Kernel optimizations only (FUSED_4BIT_M_LIMIT=1) -- optimized kernel but M>1 still uses dequant+GEMM
+- Phase C: Full optimization (FUSED_4BIT_M_LIMIT=16) -- fused kernel for M<=16, critical for vLLM serving
+
+### Build instructions
+
+```bash
+# Find your GPU arch
+rocminfo | grep "Name:" | head -5
+# or: python -c "import torch; print(torch.cuda.get_device_properties(0).gcnArchName)"
+
+# Build (replace gfx arch and ROCm version as needed)
+cmake -B build -DBUILD_HIP=ON -DBNB_ROCM_ARCH="<your_arch>" -DROCM_VERSION="<version>"
+cmake --build build -j$(nproc)
+```
+
+If you get a missing header error for hip_bfloat16.h, add `-DCMAKE_CXX_FLAGS="-I/opt/rocm/include"` to the cmake configure step.
+
+### For each phase, run these tests:
+
+1. **Correctness** (must pass -- any failure here is a regression):
+```bash
+python -m pytest tests/test_ops.py -k "test_gemv_4bit" -v
+python -m pytest tests/test_linear4bit.py -v
+```
+
+2. **Kernel microbenchmark**:
+```bash
+python bench_quick.py
+```
+
+3. **vLLM serving** (requires vLLM installed with ROCm support):
+```bash
+export PYTHONPATH=<venv>/lib/python3.12/site-packages/_rocm_sdk_devel/share/amd_smi
+export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
+export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
+
+# For Phase A and B:
+python bench_vllm_sweep.py --limit 1 --model mistral7b
+
+# For Phase C:
+python bench_vllm_sweep.py --limit 16 --model mistral7b
+python bench_vllm_sweep.py --limit 16 --model llama8b
+```
+
+The vLLM sweep tests concurrency 1,2,4,8,12,16,24,32,48,64. Pay special attention to:
+- reqs=2-8: expect 2-5x improvement in Phase C vs Phase A/B
+- reqs=24-64: must NOT regress vs Phase A (these use the split path in both)
+
+To change the dispatch threshold for Phase A/B/C, edit `FUSED_4BIT_M_LIMIT` in `bitsandbytes/autograd/_functions.py` before each vLLM run. vLLM forks worker processes that read the constant at import time, so the source file must be edited (monkey-patching doesn't propagate).
+
+### Phase transitions
+
+Between Phase A and B:
+```bash
+git stash pop  # restore optimized code
+cmake --build build -j$(nproc)
+sed -i "s/FUSED_4BIT_M_LIMIT = [0-9]*/FUSED_4BIT_M_LIMIT = 1/" bitsandbytes/autograd/_functions.py
+```
+
+Between Phase B and C:
+```bash
+sed -i "s/FUSED_4BIT_M_LIMIT = [0-9]*/FUSED_4BIT_M_LIMIT = 16/" bitsandbytes/autograd/_functions.py
+```
+
+## Report format
+
+Produce a report with this structure:
+
+### 1. Environment
+- GPU name and architecture (gfx????)
+- ROCm version, PyTorch version
+- `python -c "import torch; p=torch.cuda.get_device_properties(0); print(f'{p.name} ({p.gcnArchName}), {p.total_memory//1024**3} GB')"`
+
+### 2. Correctness (per phase)
+
+| Phase | test_ops gemv_4bit | test_linear4bit | Status |
+|-------|-------------------|-----------------|--------|
+| A | xx/60 pass | xxx/243 pass | PASS/FAIL |
+| B | xx/60 pass | xxx/243 pass | PASS/FAIL |
+| C | xx/60 pass | xxx/243 pass | PASS/FAIL |
+
+### 3. Kernel Microbenchmark
+
+| Phase | Time (us) | BW (GB/s) | Speedup vs A |
+|-------|-----------|-----------|--------------|
+| A (baseline) | | | 1.00x |
+| B (kernel opt) | | | |
+| C (full opt) | | | |
+
+### 4. vLLM Serving Throughput (tok/s)
+
+| Reqs | Phase A (L=1) | Phase B (L=1) | Phase C (L=16) | C vs A | Regression? |
+|------|---------------|---------------|----------------|--------|-------------|
+| 1 | | | | | |
+| 2 | | | | | |
+| 4 | | | | | |
+| 8 | | | | | |
+| 16 | | | | | |
+| 24 | | | | | |
+| 32 | | | | | |
+| 48 | | | | | |
+| 64 | | | | | |
+
+Flag any row where Phase C is >5% slower than Phase A as a REGRESSION.
+
+### 5. Summary
+- Kernel speedup: Phase B vs A
+- vLLM peak improvement: best Phase C vs A speedup and at which concurrency
+- Any regressions found
+- Is FUSED_4BIT_M_LIMIT=16 safe on this GPU? (yes if no regressions at reqs>16)