perf: turbo VEC flash attention — +9% decode on CUDA via autoresearch #53
background
wait
wait-all
cancel
Loading