Skip to content

Commit a057095

Browse files
committed
Disable hipBLASLt auto-tune by default, fix warm prompt regression
Auto-tuning benchmarks 8 GEMM algorithms per shape on first call, adding ~200ms startup overhead. For quantized models the regular GEMM path is rarely used, so the overhead is wasted. Disable by default; enable with MLX_ROCM_HIPBLASLT_TUNE=1 for non-quantized. Warm prompt restored: Qwen3-8B 1092 tok/s, Qwen3.5-35B 795 tok/s.
1 parent 25f5912 commit a057095

1 file changed

Lines changed: 4 additions & 3 deletions

File tree

mlx/backend/rocm/gemms/hipblaslt_gemm.cpp

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -351,9 +351,10 @@ void hipblaslt_gemm_impl(
351351
int best_algo_idx = 0;
352352

353353
// Auto-tuning: benchmark all algorithms to find the fastest for each shape.
354-
// Runs automatically for new shapes. Once cached, uses the winner with zero overhead.
355-
// Tuning adds ~10ms per unique (M,N,K) shape, amortized over the session.
356-
static constexpr bool do_tune = true;
354+
// Disabled by default — for quantized models the GEMM path is rarely used
355+
// and the tuning overhead causes warm prompt regression.
356+
// Enable with MLX_ROCM_HIPBLASLT_TUNE=1 for non-quantized models.
357+
static bool do_tune = std::getenv("MLX_ROCM_HIPBLASLT_TUNE") != nullptr;
357358

358359
auto it = tune_cache.find(key);
359360
if (it != tune_cache.end()) {

0 commit comments

Comments
 (0)