Flash Attention failed: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED

Trying to upscale a single image with an old low-spec system, no luck getting Phase 2 to work at all. Tried with a Q3_K_M and the official Q4_K_M of SeedVR2, with and without offload to CPU, it's the same error every time.

<details>
<summary>Error log of an attempt with Q3, full offload</summary>

```
⚠️  SeedVR2 optimizations check: Flash Attention ❌ | Triton ❌
💡 For best performance: pip install flash-attn triton

...

got prompt
[13:40:20.649]  
[13:40:20.650]   ╔══════════════════════════════════════════════════════════╗
[13:40:20.651]   ║ ███████ ███████ ███████ ██████  ██    ██ ██████  ███████ ║
[13:40:20.651]   ║ ██      ██      ██      ██   ██ ██    ██ ██   ██      ██ ║
[13:40:20.651]   ║ ███████ █████   █████   ██   ██ ██    ██ ██████  █████   ║
[13:40:20.652]   ║      ██ ██      ██      ██   ██  ██  ██  ██   ██ ██      ║
[13:40:20.652]   ║ ███████ ███████ ███████ ██████    ████   ██   ██ ███████ ║
[13:40:20.652]   ║ v2.5.10                 © ByteDance Seed · NumZ · AInVFX ║
[13:40:20.652]   ╚══════════════════════════════════════════════════════════╝
[13:40:20.652]  
[13:40:20.653] 🔧 Validating seedvr2_ema_3b-Q3_K_M.gguf...
[13:40:32.326] 🏃 Creating new runner: DiT=seedvr2_ema_3b-Q3_K_M.gguf, VAE=ema_vae_fp16.safetensors
[13:40:32.417] 🚀 Creating DiT model structure on meta device
[13:40:32.966] 🎨 Creating VAE model structure on meta device
[13:40:33.363]  
[13:40:33.364] 🎬 Starting upscaling generation...
[13:40:33.364] 🎬   Input: 1 frame, 512x512px → Output: 576x576px (shortest edge: 576px)
[13:40:33.364] 🎬   Batch size: 1, Seed: 42, Channels: RGB
[13:40:33.364]  
[13:40:33.365]  ━━━━━━━━ Phase 1: VAE encoding ━━━━━━━━
[13:40:33.365] 🎨 Materializing VAE weights to CPU (offload device): /home/hum/comfy-0_3_68/ComfyUI-0.3.68/models/SEEDVR2/ema_vae_fp16.safetensors
[13:40:36.782] 🎨 Encoding batch 1/1
[13:40:36.814] 📹   Sequence of 1 frames
[13:40:40.843]  
[13:40:40.843]  ━━━━━━━━ Phase 2: DiT upscaling ━━━━━━━━
[13:40:40.863] 🚀 Materializing DiT weights to CPU (offload device): /home/hum/comfy-0_3_68/ComfyUI-0.3.68/models/SEEDVR2/seedvr2_ema_3b-Q3_K_M.gguf
[13:40:41.289] 🔀 BlockSwap: 32 transformer blocks + I/O components offloaded to CPU
[13:40:41.360] 🎬 Upscaling batch 1/1
EulerSampler:   0%|                                                                                                                                            | 0/1 [00:00<?, ?it/s][13:40:42.822] ⚠️ [WARNING] Flash Attention failed for blocks.0.attn.attn, using original: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
[13:40:42.824] ⚠️ [WARNING] Flash Attention failed for blocks.0.attn, using original: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
[13:40:42.884] ⚠️ [WARNING] Flash Attention failed for blocks.0.attn.attn, using original: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
[13:40:42.887] ❌ [ERROR] Forward pass error: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
[13:40:42.887] ℹ️ torch.float16 model - no conversion applied
[13:40:42.887] ❌ [ERROR] Error in Phase 2 (Upscaling): CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
!!! Exception during processing !!! CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
Traceback (most recent call last):
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/optimization/compatibility.py", line 437, in flash_attention_forward
    return self._sdpa_attention_forward(original_forward, module, *args, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/optimization/compatibility.py", line 472, in _sdpa_attention_forward
    return original_forward(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/models/dit_3b/attention.py", line 154, in forward
    return pytorch_varlen_attention(
        q, k, v, cu_seqlens_q, cu_seqlens_k,
        max_seqlen_q, max_seqlen_k, **kwargs
    )
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/models/dit_3b/attention.py", line 50, in pytorch_varlen_attention
    output_i = F.scaled_dot_product_attention(
        q_i, k_i, v_i,
        dropout_p=dropout_p if not deterministic else 0.0,
        is_causal=causal
    )
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/optimization/compatibility.py", line 437, in flash_attention_forward
    return self._sdpa_attention_forward(original_forward, module, *args, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/optimization/compatibility.py", line 472, in _sdpa_attention_forward
    return original_forward(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/models/dit_3b/nablocks/attention/mmattn.py", line 245, in forward
    out = self.attn(
          ~~~~~~~~~^
        q=concat_win(vid_q, txt_q),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
        max_seqlen_k=cache_win("vid_max_seqlen_k", lambda: all_len_win.max()),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ).type_as(vid_q)
    ^
  File "/home/hum/comfy-0_3_68/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/hum/comfy-0_3_68/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/optimization/compatibility.py", line 441, in flash_attention_forward
    return original_forward(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/models/dit_3b/attention.py", line 154, in forward
    return pytorch_varlen_attention(
        q, k, v, cu_seqlens_q, cu_seqlens_k,
        max_seqlen_q, max_seqlen_k, **kwargs
    )
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/models/dit_3b/attention.py", line 50, in pytorch_varlen_attention
    output_i = F.scaled_dot_product_attention(
        q_i, k_i, v_i,
        dropout_p=dropout_p if not deterministic else 0.0,
        is_causal=causal
    )
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/optimization/compatibility.py", line 437, in flash_attention_forward
    return self._sdpa_attention_forward(original_forward, module, *args, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/optimization/compatibility.py", line 472, in _sdpa_attention_forward
    return original_forward(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/models/dit_3b/attention.py", line 154, in forward
    return pytorch_varlen_attention(
        q, k, v, cu_seqlens_q, cu_seqlens_k,
        max_seqlen_q, max_seqlen_k, **kwargs
    )
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/models/dit_3b/attention.py", line 50, in pytorch_varlen_attention
    output_i = F.scaled_dot_product_attention(
        q_i, k_i, v_i,
        dropout_p=dropout_p if not deterministic else 0.0,
        is_causal=causal
    )
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/execution.py", line 510, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/execution.py", line 324, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/execution.py", line 298, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/execution.py", line 286, in process_inputs
    result = f(**inputs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/comfy_api/internal/__init__.py", line 149, in wrapped_func
    return method(locked_class, **inputs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/comfy_api/latest/_io.py", line 1270, in EXECUTE_NORMALIZED
    to_return = cls.execute(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/interfaces/video_upscaler.py", line 569, in execute
    raise e
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/interfaces/video_upscaler.py", line 481, in execute
    ctx = upscale_all_batches(
        runner,
    ...<5 lines>...
        cache_model=dit_cache
    )
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/core/generation_phases.py", line 715, in upscale_all_batches
    upscaled_latents = runner.inference(
        noises=noises,
        conditions=conditions,
        **ctx['text_embeds'],
    )
  File "/home/hum/comfy-0_3_68/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/core/infer.py", line 365, in inference
    latents = self.sampler.sample(
        x=latents,
    ...<22 lines>...
        ),
    )
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/common/diffusion/samplers/euler.py", line 61, in sample
    pred = f(SamplerModelArgs(x, t, i))
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/core/infer.py", line 367, in <lambda>
    f=lambda args: classifier_free_guidance_dispatcher(
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        pos=lambda: self.dit(
        ^^^^^^^^^^^^^^^^^^^^^
    ...<19 lines>...
        rescale=self.config.diffusion.cfg.rescale,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ),
    ^
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/common/diffusion/utils.py", line 76, in classifier_free_guidance_dispatcher
    return pos()
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/core/infer.py", line 368, in <lambda>
    pos=lambda: self.dit(
                ~~~~~~~~^
        vid=torch.cat([args.x_t, latents_cond], dim=-1),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
        timestep=args.t.repeat(batch_size),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ).vid_sample,
    ^
  File "/home/hum/comfy-0_3_68/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/hum/comfy-0_3_68/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/optimization/compatibility.py", line 598, in forward
    return self.dit_model(*args, **kwargs)
           ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/hum/comfy-0_3_68/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/hum/comfy-0_3_68/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/models/dit_3b/nadit.py", line 222, in forward
    vid, txt, vid_shape, txt_shape = gradient_checkpointing(
                                     ~~~~~~~~~~~~~~~~~~~~~~^
        enabled=(self.gradient_checkpointing and self.training),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<6 lines>...
        cache=cache,
        ^^^^^^^^^^^^
    )
    ^
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/models/dit_3b/nadit.py", line 32, in gradient_checkpointing
    return module(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/hum/comfy-0_3_68/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/optimization/blockswap.py", line 438, in wrapped_forward
    output = original_forward(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/models/dit_3b/nablocks/mmsr_block.py", line 112, in forward
    vid_attn, txt_attn = self.attn(vid_attn, txt_attn, vid_shape, txt_shape, cache)
                         ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hum/comfy-0_3_68/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/hum/comfy-0_3_68/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/optimization/compatibility.py", line 441, in flash_attention_forward
    return original_forward(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/models/dit_3b/nablocks/attention/mmattn.py", line 245, in forward
    out = self.attn(
          ~~~~~~~~~^
        q=concat_win(vid_q, txt_q),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
        max_seqlen_k=cache_win("vid_max_seqlen_k", lambda: all_len_win.max()),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ).type_as(vid_q)
    ^
  File "/home/hum/comfy-0_3_68/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/hum/comfy-0_3_68/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/optimization/compatibility.py", line 441, in flash_attention_forward
    return original_forward(*args, **kwargs)
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/models/dit_3b/attention.py", line 154, in forward
    return pytorch_varlen_attention(
        q, k, v, cu_seqlens_q, cu_seqlens_k,
        max_seqlen_q, max_seqlen_k, **kwargs
    )
  File "/home/hum/comfy-0_3_68/ComfyUI-0.3.68/custom_nodes/ComfyUI-SeedVR2-VideoUpscaler/src/models/dit_3b/attention.py", line 50, in pytorch_varlen_attention
    output_i = F.scaled_dot_product_attention(
        q_i, k_i, v_i,
        dropout_p=dropout_p if not deterministic else 0.0,
        is_causal=causal
    )
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

Prompt executed in 23.46 seconds
EulerSampler:   0%|                                                                                                                                            | 0/1 [00:03<?, ?it/s]
```

</details>


<details>
<summary>Partial log of another try with Q4 and no offload</summary>

```
[14:11:43.154]  ━━━━━━━━ Phase 1: VAE encoding ━━━━━━━━                                                                                                                              
[14:11:43.154] 🎨 Materializing VAE weights to CUDA:0: /home/hum/comfy-0_3_68/ComfyUI-0.3.68/models/SEEDVR2/ema_vae_fp16.safetensors                                                 
[14:11:43.793] 🎨 Encoding batch 1/1                                                                                                                                                 
[14:11:43.801] 📹   Sequence of 1 frames                                                                                                                                             
[14:11:46.326]                                                                                                                                                                       
[14:11:46.327]  ━━━━━━━━ Phase 2: DiT upscaling ━━━━━━━━                                                                                                                             
[14:11:46.414] 🚀 Materializing DiT weights to CUDA:0: /home/hum/comfy-0_3_68/ComfyUI-0.3.68/models/SEEDVR2/seedvr2_ema_3b-Q4_K_M.gguf                                               
[14:11:50.001] 🎬 Upscaling batch 1/1                                                                                                                                                
EulerSampler:   0%|                                                                                                                                            | 0/1 [00:00<?, ?it/s]
[14:11:50.080] ⚠️ [WARNING] Flash Attention failed for blocks.0.attn.attn, using original: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmStrided BatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, sridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
```

</details>


<details>
<summary>nvidia-smi</summary>

```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 970         Off |   00000000:02:00.0 Off |                  N/A |
| 75%   63C    P2             62W /  200W |    3673MiB /   4096MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      4372      C   python                                       3667MiB |
+-----------------------------------------------------------------------------------------+
```

</details>


<details>
<summary>Some relevant installed packages</summary>

```
pip:
rotary-embedding-torch     0.8.9
torch                      2.6.0a0+gitunknown
torchaudio                 2.6.0a0
torchsde                   0.2.6
torchvision                0.21.0

apt:
ii  nvidia-cuda-toolkit                  12.4.131~12.4.1-2                    amd64        NVIDIA CUDA development toolkit
ii  libcublas12:amd64                    12.4.5.8~12.4.1-2                    amd64        NVIDIA cuBLAS Library
```

</details>

<details>
<summary>Pytorch config</summary>

```
PyTorch built with:
  - GCC 14.2
  - C++ Version: 201703
  - Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: NO AVX
  - CUDA Runtime 12.4
  - NVCC architecture flags: -gencode;arch=compute_52,code=sm_52
  - CuDNN 90.0  (built against CUDA 12.3)
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, COMMIT_SHA=Unknown, CUDA_VERSION=12.4, CUDNN_VERSION=9.0.0, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-error=dangling-reference -Wno-error=redundant-move -Wno-stringop-overflow, LAPACK_INFO=mkl, TORCH_VERSION=2.6.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 
```

</details>

My PyTorch is a bit of a hacky thing due to the official builds not supporting systems without AVX, However, I am able to run most models in ComfyUI without any problems; Flux/Chroma, Qwen-Image-Edit, Wan 2.2 5B/14B all work, and so do LLMs, VLMs, LoRA training, image upscalers, detector/detailers, etc.

So far only FlashVSR has been a no-go because block-sparse-attention or something required newer compute, and now I don't know about SeedVR2, is my hardware entirely insufficient for it or should I build a new PyTorch and try enabling some library which is now disabled (fbgemm/xnnpack/mkl-dnn)?

Or is there some SDPA implementation which I could try to enable or disable according to the following?

<details>
<summary>From the comments in torch/nn/functional.py</summary>

```
        All implementations are enabled by default. Scaled dot product attention attempts to automatically select the
        most optimal implementation based on the inputs. In order to provide more fine-grained control over what implementation
        is used, the following functions are provided for enabling and disabling implementations.
        The context manager is the preferred mechanism:

            - :func:`torch.nn.attention.sdpa_kernel`: A context manager used to enable or disable any of the implementations.
            - :func:`torch.backends.cuda.enable_flash_sdp`: Globally enables or disables FlashAttention.
            - :func:`torch.backends.cuda.enable_mem_efficient_sdp`: Globally enables or disables  Memory-Efficient Attention.
            - :func:`torch.backends.cuda.enable_math_sdp`: Globally enables or disables  the PyTorch C++ implementation.
```

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flash Attention failed: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED #314

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Flash Attention failed: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED #314

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions