Skip to content

Commit 3685b96

Browse files
Merge pull request #19 from stackav-oss/feature/jmanning/conch-1.0.0
Update version to v1.0.0
2 parents 55254f3 + 0c284b0 commit 3685b96

123 files changed

Lines changed: 965 additions & 600 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.envrc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
# Any change to this file requires rerunning 'direnv allow'.
22
# Instead, farm out work to another file so that we can update it as needed.
3-
source ./env/direnv.sh
3+
source ./tools/env/direnv.sh

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
**/__pycache__/
33
.pytest_cache
44

5-
env/user.sh
5+
tools/env/user.sh
66
.direnv/
77

88
build/

.markdownlint-cli2.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
# Copyright 2025 Stack AV Co.
2+
# SPDX-License-Identifier: Apache-2.0
3+
14
config:
25
# Require the use of fenced code blocks (with ```) instead of indented code blocks.
36
code-block-style:

.pre-commit-config.yaml

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
# Copyright 2025 Stack AV Co.
2+
# SPDX-License-Identifier: Apache-2.0
3+
14
repos:
25
- repo: https://github.com/pre-commit/pre-commit-hooks
36
rev: v5.0.0
@@ -16,6 +19,16 @@ repos:
1619
hooks:
1720
- id: isort
1821

22+
- repo: https://github.com/google/addlicense
23+
rev: 499ed7f28389eb4a08c2d7e40b1637cfd7f65381 # master
24+
hooks:
25+
- id: addlicense
26+
args: ["-c=Stack AV Co.", "-l=Apache-2.0", "-s=only"]
27+
exclude: |
28+
(?x)^(
29+
|conch/third_party/.*
30+
)$
31+
1932
- repo: https://github.com/astral-sh/ruff-pre-commit
2033
rev: v0.11.2
2134
hooks:
@@ -31,6 +44,6 @@ repos:
3144
hooks:
3245
- id: mypy-local
3346
name: mypy-local
34-
entry: ./scripts/mypy.sh
47+
entry: ./tools/mypy.sh
3548
language: system
3649
files: \.py$

README.md

Lines changed: 16 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ Each operation is complete with a PyTorch-only reference implementation (and som
1818
- SiLU and mul
1919
- Attention
2020
- Paged Attention (Flash-Decoding with Paged KV Cache)
21+
- Varlen Attention (Prefill/decode attention with paged KV cache)
2122
- Embedding
2223
- Rotary embedding
2324
- Normalization
@@ -42,29 +43,27 @@ The goal of Conch is not to claim that our operations are faster than CUDA imple
4243
Our goal is to write Triton operations that are _as fast as_ the state-of-the-art CUDA implementations.
4344
This allows developers on any hardware platform (Nvidia, AMD, etc.) access to the same, performant kernels.
4445

45-
Below is a table comparing the relative performance of our Triton kernels to CUDA baselines (on H100).
46+
Below is a table comparing the relative performance of our Triton kernels to CUDA baselines (on NVIDIA A10).
4647
The listed runtime is the median runtime from 10,000 iterations on our microbenchmarks.
4748
**Note**: it's difficult to express the performance of a kernel with a single number (performance will vary with input sizes, data types, etc.).
4849
We tried our best to choose representative parameters for a fair comparison.
4950
Most relevant parameters are specified via CLI parameters to the microbenchmarks (`benchmarks/`), so feel free to collect your own results based on your use case.
50-
CUDA runtimes collected via vLLM and bitsandbytes (`vllm==0.6.4` and `bitsandbytes==0.45.4`).
51+
CUDA runtimes collected via vLLM and bitsandbytes (`vllm==0.8.5` and `bitsandbytes==0.45.5`).
5152

5253
| Operation | CUDA Runtime | Triton Runtime | Triton Speedup |
5354
| --- | --- | --- | --- |
54-
| GeLU, Tanh, and Mul | 0.493 ms | 0.466 ms | 1.06 |
55-
| SiLU and Mul | 0.063 ms | 0.047 ms | 1.34 |
56-
| Paged Attention | 0.090 ms | 0.083 ms | 1.08 |
57-
| Rotary Embedding | 0.107 ms | 0.103 ms | 1.04 |
58-
| RMS Norm (Gemma-style) | 0.392 ms | 0.029 ms | 13.52 |
59-
| RMS Norm (Llama-style) | 0.044 ms | 0.018 ms | 2.44 |
60-
| bitsandbytes: Dequantize | 0.074 ms | 4.487 ms | 0.02 |
61-
| bitsandbytes: Quantize | 0.377 ms | 4.819 ms | 0.08 |
62-
| FP8 Static Quantization | 0.035 ms | 0.090 ms | 0.39 |
63-
| Int8 Static Quantization | 0.056 ms | 0.094 ms | 0.60 |
64-
| Mixed-precision GEMM [Int4 x FP16] | 0.432 ms | 1.437 ms | 0.30 |
65-
| Scaled GEMM [Int8 x BF16] | 0.204 ms | 0.285 ms | 0.72 |
66-
| vLLM: Copy Blocks | 2.231 ms | 1.807 ms | 1.23 |
67-
| vLLM: Reshape and Cache | 0.057 ms | 0.010 ms | 5.70 |
55+
| GeLU, Tanh, and Mul | 2.835 ms | 2.851 ms | 0.99 |
56+
| SiLU and Mul | 0.260 ms | 0.209 ms | 1.24 |
57+
| Paged Attention | 0.374 ms | 0.344 ms | 1.09 |
58+
| Rotary Embedding | 0.579 ms | 0.600 ms | 0.96 |
59+
| RMS Norm (Gemma-style) | 1.392 ms | 0.141 ms | 9.87 |
60+
| RMS Norm (Llama-style) | 0.117 ms | 0.072 ms | 1.63 |
61+
| bitsandbytes: Dequantize | 0.175 ms | 10.950 ms | 0.02 |
62+
| bitsandbytes: Quantize | 0.671 ms | 12.667 ms | 0.05 |
63+
| Int8 Static Quantization | 0.167 ms | 0.164 ms | 1.02 |
64+
| Scaled GEMM [Int8 x BF16] | 2.130 ms | 4.441 ms | 0.48 |
65+
| vLLM: Copy Blocks | 8.550 ms | 9.933 ms | 0.86 |
66+
| vLLM: Reshape and Cache | 0.245 ms | 0.024 ms | 10.21 |
6867

6968
For additional analysis of kernel performance, check out our [performance docs](./docs/performance/).
7069

@@ -74,7 +73,7 @@ Supported platforms:
7473

7574
- Nvidia A10, CUDA 12.2
7675
- Nvidia H100, CUDA 12.2
77-
- AMD MI300X, ROCm 6.2.2
76+
- AMD MI300X, ROCm 6.2.4
7877

7978
Work-in-progress platforms:
8079

benchmarks/bnb_dequantize_blockwise_benchmark.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# Copyright (C) 2025 Stack AV Co. - All Rights Reserved.
1+
# Copyright 2025 Stack AV Co.
2+
# SPDX-License-Identifier: Apache-2.0
23

34
"""Bitsandbytes dequantize blockwise benchmark."""
45

benchmarks/bnb_quantize_blockwise_benchmark.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# Copyright (C) 2025 Stack AV Co. - All Rights Reserved.
1+
# Copyright 2025 Stack AV Co.
2+
# SPDX-License-Identifier: Apache-2.0
23

34
"""Bitsandbytes quantize blockwise benchmark."""
45

benchmarks/copy_blocks_benchmark.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# Copyright (C) 2025 Stack AV Co. - All Rights Reserved.
1+
# Copyright 2025 Stack AV Co.
2+
# SPDX-License-Identifier: Apache-2.0
23

34
"""Triton copy_blocks benchmark."""
45

benchmarks/fused_add_rms_norm_benchmark.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# Copyright (C) 2025 Stack AV Co. - All Rights Reserved.
1+
# Copyright 2025 Stack AV Co.
2+
# SPDX-License-Identifier: Apache-2.0
23

34
"""Triton rms_norm benchmark."""
45

benchmarks/gelu_tanh_and_mul_benchmark.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# Copyright (C) 2025 Stack AV Co. - All Rights Reserved.
1+
# Copyright 2025 Stack AV Co.
2+
# SPDX-License-Identifier: Apache-2.0
23

34
"""Triton gelu_tanh_and_mul benchmark."""
45

0 commit comments

Comments
 (0)