Skip to content

Commit 0e990a6

Browse files
TimDettmersclaude
andcommitted
style: Apply pre-commit auto-formatting (ruff, clang-format, typos)
Fixes from ruff format, clang-format, end-of-file-fixer, and typos hooks. Also fixes RUF059 (unused variable) warnings in test files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent e78a28c commit 0e990a6

20 files changed

+707
-484
lines changed

agents/flute_kernel_guide.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -490,7 +490,7 @@ Copy operations:
490490
G2SCopySizeA, G2SCopySizeQ, etc. — transfer granularity
491491
492492
MMA configuration:
493-
MmaThrM, MmaThrN, MmaThrK — thread layout within MMA
493+
MmaTheM, MmaTheN, MmaTheK — thread layout within MMA
494494
MmaPrmM, MmaPrmN, MmaPrmK — permutation within MMA
495495
```
496496

@@ -965,7 +965,7 @@ Both kernels use the same fundamental MMA instruction: `m16n8k16` with FP16
965965
inputs and FP32 accumulation.
966966

967967
**FLUTE**: CuTe's `SM80_16x8x16_F32F16F16F32` atom, configured via `TiledMma`
968-
with customizable thread layout (`MmaThrM × MmaThrN × MmaThrK`) and
968+
with customizable thread layout (`MmaTheM × MmaTheN × MmaTheK`) and
969969
permutation (`MmaPrmM × MmaPrmN × MmaPrmK`).
970970

971971
**kbit**: Direct inline PTX `mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32`

benchmarks/bench_absmax_format.py

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,17 +8,17 @@
88
Uses representative shapes from Qwen3-Coder-Next 70B.
99
"""
1010

11-
import torch
1211
import time
13-
import math
14-
import bitsandbytes # noqa: F401 — registers torch ops
1512

16-
from bitsandbytes.functional import create_normal_float_codebook
13+
import torch
1714

15+
import bitsandbytes # noqa: F401 — registers torch ops
16+
from bitsandbytes.functional import create_normal_float_codebook
1817

1918
# ---- E4M4 encode (Python, matching CUDA encode_e4m4_absmax) ----
2019
E4M4_BIAS = 11
2120

21+
2222
def encode_e4m4_absmax(vals: torch.Tensor) -> torch.Tensor:
2323
"""Encode float32 absmax values to uint8 E4M4 format."""
2424
out = torch.zeros(vals.shape, dtype=torch.uint8, device=vals.device)
@@ -44,11 +44,11 @@ def encode_e4m4_absmax(vals: torch.Tensor) -> torch.Tensor:
4444

4545
# ---- Benchmark config ----
4646
SHAPES = [
47-
("gateup", 7168, 18944),
48-
("down", 18944, 7168),
49-
("Q", 7168, 7168),
50-
("O", 7168, 7168),
51-
("KV", 7168, 1024),
47+
("gateup", 7168, 18944),
48+
("down", 18944, 7168),
49+
("Q", 7168, 7168),
50+
("O", 7168, 7168),
51+
("KV", 7168, 1024),
5252
]
5353
K_BITS_LIST = [2, 3, 4, 5]
5454
M_VALS = [1, 2, 3, 4]
@@ -74,10 +74,12 @@ def bench():
7474

7575
# float32 absmax
7676
fn_f32 = lambda: torch.ops.bitsandbytes.kbit_scalar_gemv(
77-
A, packed_flat, absmax_flat, codebook, K_dim, N, k)
77+
A, packed_flat, absmax_flat, codebook, K_dim, N, k
78+
)
7879
# uint8 E4M4 absmax
7980
fn_u8 = lambda: torch.ops.bitsandbytes.kbit_scalar_gemv_u8(
80-
A, packed_flat, absmax_u8, codebook, K_dim, N, k)
81+
A, packed_flat, absmax_u8, codebook, K_dim, N, k
82+
)
8183

8284
# Warmup
8385
for _ in range(WARMUP):
@@ -99,7 +101,7 @@ def bench():
99101
torch.cuda.synchronize()
100102
t_u8 = (time.perf_counter() - start) / ITERS * 1e6
101103

102-
ratio = t_f32 / t_u8 if t_u8 > 0 else float('inf')
104+
ratio = t_f32 / t_u8 if t_u8 > 0 else float("inf")
103105
print(f"{name:>8s} {k:>2d} {M:>2d} {t_f32:>12.1f} {t_u8:>11.1f} {ratio:>5.2f}x")
104106

105107

0 commit comments

Comments
 (0)