You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+16-17Lines changed: 16 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,6 +18,7 @@ Each operation is complete with a PyTorch-only reference implementation (and som
18
18
- SiLU and mul
19
19
- Attention
20
20
- Paged Attention (Flash-Decoding with Paged KV Cache)
21
+
- Varlen Attention (Prefill/decode attention with paged KV cache)
21
22
- Embedding
22
23
- Rotary embedding
23
24
- Normalization
@@ -42,29 +43,27 @@ The goal of Conch is not to claim that our operations are faster than CUDA imple
42
43
Our goal is to write Triton operations that are _as fast as_ the state-of-the-art CUDA implementations.
43
44
This allows developers on any hardware platform (Nvidia, AMD, etc.) access to the same, performant kernels.
44
45
45
-
Below is a table comparing the relative performance of our Triton kernels to CUDA baselines (on H100).
46
+
Below is a table comparing the relative performance of our Triton kernels to CUDA baselines (on NVIDIA A10).
46
47
The listed runtime is the median runtime from 10,000 iterations on our microbenchmarks.
47
48
**Note**: it's difficult to express the performance of a kernel with a single number (performance will vary with input sizes, data types, etc.).
48
49
We tried our best to choose representative parameters for a fair comparison.
49
50
Most relevant parameters are specified via CLI parameters to the microbenchmarks (`benchmarks/`), so feel free to collect your own results based on your use case.
50
-
CUDA runtimes collected via vLLM and bitsandbytes (`vllm==0.6.4` and `bitsandbytes==0.45.4`).
51
+
CUDA runtimes collected via vLLM and bitsandbytes (`vllm==0.8.5` and `bitsandbytes==0.45.5`).
0 commit comments