stackav-oss
diff --git a/‎.envrc‎
Lines changed: 1 addition & 1 deletion b/‎.envrc‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 1 deletion b/‎.gitignore‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.markdownlint-cli2.yaml‎
Lines changed: 3 additions & 0 deletions b/‎.markdownlint-cli2.yaml‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 14 additions & 1 deletion b/‎.pre-commit-config.yaml‎
Lines changed: 14 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 16 additions & 17 deletions b/‎README.md‎
Lines changed: 16 additions & 17 deletions
diff --git a/‎benchmarks/bnb_dequantize_blockwise_benchmark.py‎
Lines changed: 2 additions & 1 deletion b/‎benchmarks/bnb_dequantize_blockwise_benchmark.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎benchmarks/bnb_quantize_blockwise_benchmark.py‎
Lines changed: 2 additions & 1 deletion b/‎benchmarks/bnb_quantize_blockwise_benchmark.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎benchmarks/copy_blocks_benchmark.py‎
Lines changed: 2 additions & 1 deletion b/‎benchmarks/copy_blocks_benchmark.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎benchmarks/fused_add_rms_norm_benchmark.py‎
Lines changed: 2 additions & 1 deletion b/‎benchmarks/fused_add_rms_norm_benchmark.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎benchmarks/gelu_tanh_and_mul_benchmark.py‎
Lines changed: 2 additions & 1 deletion b/‎benchmarks/gelu_tanh_and_mul_benchmark.py‎
Lines changed: 2 additions & 1 deletion
@@ -1,3 +1,3 @@
 # Any change to this file requires rerunning 'direnv allow'.
 # Instead, farm out work to another file so that we can update it as needed.
-source ./env/direnv.sh
+source ./tools/env/direnv.sh
@@ -2,7 +2,7 @@
 **/__pycache__/
 .pytest_cache
 
-env/user.sh
+tools/env/user.sh
 .direnv/
 
 build/
 
@@ -1,3 +1,6 @@
+# Copyright 2025 Stack AV Co.
+# SPDX-License-Identifier: Apache-2.0
+
 config:
   # Require the use of fenced code blocks (with ```) instead of indented code blocks.
   code-block-style:
 
@@ -1,3 +1,6 @@
+# Copyright 2025 Stack AV Co.
+# SPDX-License-Identifier: Apache-2.0
+
 repos:
 - repo: https://github.com/pre-commit/pre-commit-hooks
   rev: v5.0.0
@@ -16,6 +19,16 @@ repos:
   hooks:
     - id: isort
 
+- repo: https://github.com/google/addlicense
+  rev: 499ed7f28389eb4a08c2d7e40b1637cfd7f65381 # master
+  hooks:
+    - id: addlicense
+      args: ["-c=Stack AV Co.", "-l=Apache-2.0", "-s=only"]
+      exclude: |
+        (?x)^(
+          |conch/third_party/.*
+        )$
+
 - repo: https://github.com/astral-sh/ruff-pre-commit
   rev: v0.11.2
   hooks:
@@ -31,6 +44,6 @@ repos:
   hooks:
     - id: mypy-local
       name: mypy-local
-      entry: ./scripts/mypy.sh
+      entry: ./tools/mypy.sh
       language: system
       files: \.py$
@@ -18,6 +18,7 @@ Each operation is complete with a PyTorch-only reference implementation (and som
   - SiLU and mul
 - Attention
   - Paged Attention (Flash-Decoding with Paged KV Cache)
+  - Varlen Attention (Prefill/decode attention with paged KV cache)
 - Embedding
   - Rotary embedding
 - Normalization
@@ -42,29 +43,27 @@ The goal of Conch is not to claim that our operations are faster than CUDA imple
 Our goal is to write Triton operations that are _as fast as_ the state-of-the-art CUDA implementations.
 This allows developers on any hardware platform (Nvidia, AMD, etc.) access to the same, performant kernels.
 
-Below is a table comparing the relative performance of our Triton kernels to CUDA baselines (on H100).
+Below is a table comparing the relative performance of our Triton kernels to CUDA baselines (on NVIDIA A10).
 The listed runtime is the median runtime from 10,000 iterations on our microbenchmarks.
 **Note**: it's difficult to express the performance of a kernel with a single number (performance will vary with input sizes, data types, etc.).
 We tried our best to choose representative parameters for a fair comparison.
 Most relevant parameters are specified via CLI parameters to the microbenchmarks (`benchmarks/`), so feel free to collect your own results based on your use case.
-CUDA runtimes collected via vLLM and bitsandbytes (`vllm==0.6.4` and `bitsandbytes==0.45.4`).
+CUDA runtimes collected via vLLM and bitsandbytes (`vllm==0.8.5` and `bitsandbytes==0.45.5`).
 
 | Operation | CUDA Runtime | Triton Runtime | Triton Speedup |
 | --- | --- | --- | --- |
-| GeLU, Tanh, and Mul | 0.493 ms | 0.466 ms | 1.06 |
-| SiLU and Mul | 0.063 ms | 0.047 ms | 1.34 |
-| Paged Attention | 0.090 ms | 0.083 ms | 1.08 |
-| Rotary Embedding | 0.107 ms | 0.103 ms | 1.04 |
-| RMS Norm (Gemma-style) | 0.392 ms | 0.029 ms | 13.52 |
-| RMS Norm (Llama-style) | 0.044 ms | 0.018 ms | 2.44 |
-| bitsandbytes: Dequantize | 0.074 ms | 4.487 ms | 0.02 |
-| bitsandbytes: Quantize | 0.377 ms | 4.819 ms | 0.08 |
-| FP8 Static Quantization | 0.035 ms | 0.090 ms | 0.39 |
-| Int8 Static Quantization | 0.056 ms | 0.094 ms | 0.60 |
-| Mixed-precision GEMM [Int4 x FP16] | 0.432 ms | 1.437 ms | 0.30 |
-| Scaled GEMM [Int8 x BF16] | 0.204 ms | 0.285 ms | 0.72 |
-| vLLM: Copy Blocks | 2.231 ms | 1.807 ms | 1.23 |
-| vLLM: Reshape and Cache | 0.057 ms | 0.010 ms | 5.70 |
+| GeLU, Tanh, and Mul | 2.835 ms | 2.851 ms | 0.99 |
+| SiLU and Mul | 0.260 ms | 0.209 ms | 1.24 |
+| Paged Attention | 0.374 ms | 0.344 ms | 1.09 |
+| Rotary Embedding | 0.579 ms | 0.600 ms | 0.96 |
+| RMS Norm (Gemma-style) | 1.392 ms | 0.141 ms | 9.87 |
+| RMS Norm (Llama-style) | 0.117 ms | 0.072 ms | 1.63 |
+| bitsandbytes: Dequantize | 0.175 ms | 10.950 ms | 0.02 |
+| bitsandbytes: Quantize | 0.671 ms | 12.667 ms | 0.05 |
+| Int8 Static Quantization | 0.167 ms | 0.164 ms | 1.02 |
+| Scaled GEMM [Int8 x BF16] | 2.130 ms | 4.441 ms | 0.48 |
+| vLLM: Copy Blocks | 8.550 ms | 9.933 ms | 0.86 |
+| vLLM: Reshape and Cache | 0.245 ms | 0.024 ms | 10.21 |
 
 For additional analysis of kernel performance, check out our [performance docs](./docs/performance/).
 
@@ -74,7 +73,7 @@ Supported platforms:
 
 - Nvidia A10, CUDA 12.2
 - Nvidia H100, CUDA 12.2
-- AMD MI300X, ROCm 6.2.2
+- AMD MI300X, ROCm 6.2.4
 
 Work-in-progress platforms:
 
 
@@ -1,4 +1,5 @@
-# Copyright (C) 2025 Stack AV Co. - All Rights Reserved.
+# Copyright 2025 Stack AV Co.
+# SPDX-License-Identifier: Apache-2.0
 
 """Bitsandbytes dequantize blockwise benchmark."""
 
 
@@ -1,4 +1,5 @@
-# Copyright (C) 2025 Stack AV Co. - All Rights Reserved.
+# Copyright 2025 Stack AV Co.
+# SPDX-License-Identifier: Apache-2.0
 
 """Bitsandbytes quantize blockwise benchmark."""
 
 
@@ -1,4 +1,5 @@
-# Copyright (C) 2025 Stack AV Co. - All Rights Reserved.
+# Copyright 2025 Stack AV Co.
+# SPDX-License-Identifier: Apache-2.0
 
 """Triton copy_blocks benchmark."""
 
 
@@ -1,4 +1,5 @@
-# Copyright (C) 2025 Stack AV Co. - All Rights Reserved.
+# Copyright 2025 Stack AV Co.
+# SPDX-License-Identifier: Apache-2.0
 
 """Triton rms_norm benchmark."""
 
 
@@ -1,4 +1,5 @@
-# Copyright (C) 2025 Stack AV Co. - All Rights Reserved.
+# Copyright 2025 Stack AV Co.
+# SPDX-License-Identifier: Apache-2.0
 
 """Triton gelu_tanh_and_mul benchmark."""