Disable hipBLASLt auto-tune by default, fix warm prompt regression

Geramy · Geramy · commit a057095a4f64 · 2026-03-30T19:40:04.000-07:00
Auto-tuning benchmarks 8 GEMM algorithms per shape on first call,
adding ~200ms startup overhead. For quantized models the regular
GEMM path is rarely used, so the overhead is wasted. Disable by
default; enable with MLX_ROCM_HIPBLASLT_TUNE=1 for non-quantized.

Warm prompt restored: Qwen3-8B 1092 tok/s, Qwen3.5-35B 795 tok/s.
diff --git a/mlx/backend/rocm/gemms/hipblaslt_gemm.cpp b/mlx/backend/rocm/gemms/hipblaslt_gemm.cpp
@@ -351,9 +351,10 @@ void hipblaslt_gemm_impl(
   int best_algo_idx = 0;
 
   // Auto-tuning: benchmark all algorithms to find the fastest for each shape.
-  // Runs automatically for new shapes. Once cached, uses the winner with zero overhead.
-  // Tuning adds ~10ms per unique (M,N,K) shape, amortized over the session.
-  static constexpr bool do_tune = true;
+  // Disabled by default — for quantized models the GEMM path is rarely used
+  // and the tuning overhead causes warm prompt regression.
+  // Enable with MLX_ROCM_HIPBLASLT_TUNE=1 for non-quantized models.
+  static bool do_tune = std::getenv("MLX_ROCM_HIPBLASLT_TUNE") != nullptr;
 
   auto it = tune_cache.find(key);
   if (it != tune_cache.end()) {