You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+17-14Lines changed: 17 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,27 +43,30 @@ The goal of Conch is not to claim that our operations are faster than CUDA imple
43
43
Our goal is to write Triton operations that are _as fast as_ the state-of-the-art CUDA implementations.
44
44
This allows developers on any hardware platform (Nvidia, AMD, etc.) access to the same, performant kernels.
45
45
46
-
Below is a table comparing the relative performance of our Triton kernels to CUDA baselines (on NVIDIA A10).
46
+
Below is a table comparing the relative performance of our Triton kernels to CUDA baselines (on NVIDIA H100).
47
47
The listed runtime is the median runtime from 10,000 iterations on our microbenchmarks.
48
48
**Note**: it's difficult to express the performance of a kernel with a single number (performance will vary with input sizes, data types, etc.).
49
49
We tried our best to choose representative parameters for a fair comparison.
50
50
Most relevant parameters are specified via CLI parameters to the microbenchmarks (`benchmarks/`), so feel free to collect your own results based on your use case.
51
-
CUDA runtimes collected via vLLM and bitsandbytes (`vllm==0.8.5` and `bitsandbytes==0.45.5`).
51
+
CUDA runtimes collected via vLLM and bitsandbytes (`vllm==0.9.1` and `bitsandbytes==0.46.0`).
0 commit comments