NVIDIA
diff --git a/‎CHANGELOG.md‎
Lines changed: 42 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 42 additions & 1 deletion
diff --git a/‎LICENSE.txt‎
Lines changed: 1 addition & 1 deletion b/‎LICENSE.txt‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 45 additions & 4 deletions b/‎README.md‎
Lines changed: 45 additions & 4 deletions
diff --git a/‎examples/13_two_tensor_op_fusion/CMakeLists.txt‎
Lines changed: 1 addition & 1 deletion b/‎examples/13_two_tensor_op_fusion/CMakeLists.txt‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/63_hopper_gemm_with_weight_prefetch/kernel/sm90_gemm_tma_warpspecialized_with_prefetch.hpp‎
Lines changed: 1 addition & 1 deletion b/‎examples/63_hopper_gemm_with_weight_prefetch/kernel/sm90_gemm_tma_warpspecialized_with_prefetch.hpp‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/74_blackwell_gemm_streamk/blackwell_gemm_streamk.cu‎
Lines changed: 10 additions & 17 deletions b/‎examples/74_blackwell_gemm_streamk/blackwell_gemm_streamk.cu‎
Lines changed: 10 additions & 17 deletions
diff --git a/‎examples/77_blackwell_fmha/collective/sm100_fmha_fwd_mainloop_tma_warpspecialized.hpp‎
Lines changed: 12 additions & 0 deletions b/‎examples/77_blackwell_fmha/collective/sm100_fmha_fwd_mainloop_tma_warpspecialized.hpp‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎examples/77_blackwell_fmha/collective/sm100_fmha_gen_mainloop_warpspecialized.hpp‎
Lines changed: 12 additions & 0 deletions b/‎examples/77_blackwell_fmha/collective/sm100_fmha_gen_mainloop_warpspecialized.hpp‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎examples/92_blackwell_moe_gemm/92_blackwell_moe_gemm_fp4_grouped.cu‎
Lines changed: 34 additions & 9 deletions b/‎examples/92_blackwell_moe_gemm/92_blackwell_moe_gemm_fp4_grouped.cu‎
Lines changed: 34 additions & 9 deletions
@@ -2,18 +2,59 @@
 
 # CUTLASS 4.x
 
-## [4.5.0](https://github.com/NVIDIA/cutlass/tree/main) (2026-03-27)
+## [4.5.0](https://github.com/NVIDIA/cutlass/releases/tag/v4.5.0) (2026-05-01)
 
 ### CuTe DSL
+* New features
+  - New Block API `block_copy()` to simplify TMA and S2T copy. Users can ignore detail about multicast and 2CTA partition for TMA by `block_copy()` and need not to invoke `tma_partition()`. And users can remove bulk of S2T initialization to simplify S2T copy.
+  - MXF8F6F4 mixed precision supoort
+	- BlockScaled MMA now supports MXF8*MXF4 or MXF8*MXF6
+  - Block Scaled MMA for SM120 now works on Spark
+  - EFC broadcast semantics support
+	-  EFC epilogue functions can now broadcast and remap tensor modes via `C.remap_modes[:, 0, 1]` subscript syntax (where `:` marks a broadcast dimension and integers select source mode indices). Covers scalar broadcast, row/column broadcast, and arbitrary mode permutations (e.g. transpose). The PyTorch reference evaluator mirrors the same transformations.
+  - Initial linter support: Improved type hints on CuTe DSL APIs to support static type checkers like MyPy
+  - dataclasses.dataclass is now supported for JIT compilaton and cute.compile for both plain and tvm-ffi path
+  - cute.copy now supports user specified loop unrolling
+
 * Bug fixing and improvements
   - Improved source code correlation for profiling/debugging
+  - Fixed an aarch64 segfault issue with tvm-ffi
+  - Re-organization for CuTe DSL examples/tutorials for better discoverability
+ 
+* More examples of authorizing peak-performance kernels
+  - MOE examles
+	- A new style of grouped-gemm that aligns to torch's grouped_mm and scaled_groued_mm interface.
+	- Expert-wise tensormap descriptor setup by a cheap helper kernel (~2us) to avoid long latency in tile switching, kernel structure is much more closer to a normal GEMM.
+	- Compared to torch_210_cu13, very few problem has worse perf in B200.
+		- mxfp8_2dx3d: avg 1.29 speedup;
+		- mxfp8_2dx2d: avg 1.41 speedup;
+    		- nvfp4_2dx3d: avg 1.11 speedup;
+		- nvfp4_2dx2d: avg 1.12 speedup (worst case 0.98)
+		- bf16_2dx3d: avg 1.15 speedup (worst case 0.98)
+		- bf16_2dx2d: avg 1.17 speedup (worst case 0.96)
+		- Note: The perf is measured from torch profiler, this impl includes the helper kernel + main kernel, while torch's includes its setup kernel and cutlass_cpp main kernel.
+
+* API changes
+  - ab_dtype is deprecated in make_trivial_tiled_mma and make_blockscaled_trivial_tiled_mma from blackwell_helpers.py. Please specify a_dtype and b_dtype separately instead.
 
 ### CUTLASS C++
+* Add 2SM MMA instruction support to mixed TMA+CpAsync SM100 vanilla GEMM kernels.
+  - Mixed TMA+CpAsync can now accept static, but non trivial cluster shapes.
+  - Uses TMA multicast for A tile when using non-trivial cluster size along N mode.
+  - Uses an additional barrier (mma_trampoline_barrier) to track cp.async arrivals in both CTAs.
+  - Changes included in [example 92](https://github.com/NVIDIA/cutlass/tree/main/examples/92_blackwell_moe_gemm).
+* Add support for 128x32xK and 128x64xK tile sizes for SM120 blockscaled MMA collective builders, yielding up to 30% performance improvement on Blackwell SM121 related kernels.
+* Add static load to tensor memory support, included in [example 77](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/).
+* Use 64-bit adds for SM100 MMA descriptor offsets and reduce move instructions for improved code generation.
 * Add [example 95](https://github.com/NVIDIA/cutlass/tree/main/examples/95_blackwell_gemm_green_context) to support green context SM partition
   - Enables launching GEMM on stream with partial SM allocation.
 * Fix some kernel issues:
   - Fix l2_capacity=0 handling in Blackwell SM100/SM120 kernel templates
   - Fix CUTLASS clang build issues
+  - Fix atomicCAS read-modify-write loop in `ConstSubbyteReference`
+  - Replace `__nv_atomic_load_n` with `volatile` for CUDA 11.4 compatibility in subbyte reference
+  - Remove `PipelineStorage` shadowing in SM100 complex epilogue
+  - Fix build issue in SM90 epilogue fusion visitor TMA warpspecialized
 * Fix some profiler issues:
   - Add missing reference kernels for blockwise GEMM profiler
 * Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
 
@@ -30,5 +30,5 @@ Certain files within this repository are subject to separate licensing terms:
 
 - The files located in the `python/CuTeDSL` directory are licensed under the
   NVIDIA End User License Agreement (EULA). Please refer to
-  https://docs.nvidia.com/cutlass/media/docs/pythonDSL/license.html
+  https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
   for the full terms.
@@ -3,7 +3,7 @@
 
 # CUTLASS 4.5.0
 
-_CUTLASS 4.5.0 - March 2026_
+_CUTLASS 4.5.0 - May 2026_
 
 CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM)
 and related computations at all levels and scales within CUDA. It incorporates strategies for
@@ -45,16 +45,57 @@ To get started quickly - please refer :
 
 # What's New in CUTLASS 4.5
 
-### CuTe DSL
+## CuTe DSL
+* New features
+  - New Block API `block_copy()` to simplify TMA and S2T copy. Users can ignore detail about multicast and 2CTA partition for TMA by `block_copy()` and need not to invoke `tma_partition()`. And users can remove bulk of S2T initialization to simplify S2T copy.
+  - MXF8F6F4 mixed precision supoort
+    - BlockScaled MMA now supports MXF8*MXF4 or MXF8*MXF6
+  - Block Scaled MMA for SM120 now works on Spark
+  - EFC broadcast semantics support
+    -  EFC epilogue functions can now broadcast and remap tensor modes via `C.remap_modes[:, 0, 1]` subscript syntax (where `:` marks a broadcast dimension and integers select source mode indices). Covers scalar broadcast, row/column broadcast, and arbitrary mode permutations (e.g. transpose). The PyTorch reference evaluator mirrors the same transformations.
+  - Initial linter support: Improved type hints on CuTe DSL APIs to support static type checkers like MyPy
+  - dataclasses.dataclass is now supported for JIT compilaton and cute.compile for both plain and tvm-ffi path
+  - cute.copy now supports user specified loop unrolling
+
 * Bug fixing and improvements
   - Improved source code correlation for profiling/debugging
-
-### CUTLASS C++
+  - Fixed an aarch64 segfault issue with tvm-ffi
+  - Re-organization for CuTe DSL examples/tutorials for better discoverability
+
+* More examples of authorizing peak-performance kernels
+  - MOE examles
+    - A new style of grouped-gemm that aligns to torch's grouped_mm and scaled_groued_mm interface.
+    - Expert-wise tensormap descriptor setup by a cheap helper kernel (~2us) to avoid long latency in tile switching, kernel structure is much more closer to a normal GEMM.
+    - Compared to torch_210_cu13, very few problem has worse perf in B200.
+        - mxfp8_2dx3d: avg 1.29 speedup;
+        - mxfp8_2dx2d: avg 1.41 speedup;
+            - nvfp4_2dx3d: avg 1.11 speedup;
+        - nvfp4_2dx2d: avg 1.12 speedup (worst case 0.98)
+        - bf16_2dx3d: avg 1.15 speedup (worst case 0.98)
+        - bf16_2dx2d: avg 1.17 speedup (worst case 0.96)
+        - Note: The perf is measured from torch profiler, this impl includes the helper kernel + main kernel, while torch's includes its setup kernel and cutlass_cpp main kernel.
+
+* API changes
+  - ab_dtype is deprecated in make_trivial_tiled_mma and make_blockscaled_trivial_tiled_mma from blackwell_helpers.py. Please specify a_dtype and b_dtype separately instead.
+
+## CUTLASS C++
+* Add 2SM MMA instruction support to mixed TMA+CpAsync SM100 vanilla GEMM kernels.
+  - Mixed TMA+CpAsync can now accept static, but non trivial cluster shapes.
+  - Uses TMA multicast for A tile when using non-trivial cluster size along N mode.
+  - Uses an additional barrier (mma_trampoline_barrier) to track cp.async arrivals in both CTAs.
+  - Changes included in [example 92](https://github.com/NVIDIA/cutlass/tree/main/examples/92_blackwell_moe_gemm).
+* Add support for 128x32xK and 128x64xK tile sizes for SM120 blockscaled MMA collective builders, yielding up to 30% performance improvement on Blackwell SM121 related kernels.
+* Add static load to tensor memory support, included in [example 77](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/).
+* Use 64-bit adds for SM100 MMA descriptor offsets and reduce move instructions for improved code generation.
 * Add [example 95](https://github.com/NVIDIA/cutlass/tree/main/examples/95_blackwell_gemm_green_context) to support green context SM partition
   - Enables launching GEMM on stream with partial SM allocation.
 * Fix some kernel issues:
   - Fix l2_capacity=0 handling in Blackwell SM100/SM120 kernel templates
   - Fix CUTLASS clang build issues
+  - Fix atomicCAS read-modify-write loop in `ConstSubbyteReference`
+  - Replace `__nv_atomic_load_n` with `volatile` for CUDA 11.4 compatibility in subbyte reference
+  - Remove `PipelineStorage` shadowing in SM100 complex epilogue
+  - Fix build issue in SM90 epilogue fusion visitor TMA warpspecialized
 * Fix some profiler issues:
   - Add missing reference kernels for blockwise GEMM profiler
 * Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
 
@@ -48,7 +48,7 @@ foreach(FUSION_CONV_EXAMPLE
   fused_two_convs_f16_sm80_shmem
   fused_two_convs_s8_sm75_rf
   fused_two_convs_s8_sm75_shmem
-  fused_two_convs_s8_sm80_rf
+  # fused_two_convs_s8_sm80_rf  # disabled: fails to build
   fused_two_convs_s8_sm80_shmem
 )
 
 
@@ -195,7 +195,7 @@ class GemmUniversal<
     }
     implementable &= CollectiveMainloop::can_implement(args.problem_shape, args.mainloop);
     implementable &= CollectiveEpilogue::can_implement(args.problem_shape, args.epilogue);
-    implementable &= TileScheduler::can_implement(args.scheduler);
+    implementable &= TileScheduler::can_implement(args.scheduler, args.hw_info);
 
     return implementable;
   }
 
@@ -216,7 +216,7 @@ struct Options {
   float alpha, beta;
   int iterations;
   int m, n, k;
-  int preferred_cluster_m, preferred_cluster_n, fallback_cluster_m, fallback_cluster_n;
+  int cluster_m, cluster_n;
   using DecompositionMode = cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90StreamKParams::DecompositionMode;
   using ReductionMode = cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90StreamKParams::ReductionMode;
   DecompositionMode decomposition_mode;
@@ -240,10 +240,8 @@ struct Options {
     m(256), n(256), k(16384),
     alpha(1.f), beta(0.f),
     iterations(10),
-    preferred_cluster_m(4),
-    preferred_cluster_n(4),
-    fallback_cluster_m(2),
-    fallback_cluster_n(1),
+    cluster_m(2),
+    cluster_n(1),
     decomposition_mode(DecompositionMode::Heuristic),
     reduction_mode(ReductionMode::Deterministic),
     splits(1)
@@ -265,10 +263,8 @@ struct Options {
     cmd.get_cmd_line_argument("beta", beta, 0.f);
     cmd.get_cmd_line_argument("iterations", iterations);
     cmd.get_cmd_line_argument("splits", splits, 1);
-    cmd.get_cmd_line_argument("preferred_cluster_m", preferred_cluster_m, 4);
-    cmd.get_cmd_line_argument("preferred_cluster_n", preferred_cluster_n, 4);
-    cmd.get_cmd_line_argument("fallback_cluster_m", fallback_cluster_m, 2);
-    cmd.get_cmd_line_argument("fallback_cluster_n", fallback_cluster_n, 1);
+    cmd.get_cmd_line_argument("cluster_m", cluster_m, 2);
+    cmd.get_cmd_line_argument("cluster_n", cluster_n, 1);
 
     // Parse decompsition mode
     std::string decomp_mode;
@@ -303,10 +299,8 @@ struct Options {
       << "  --k=<int>                   Sets the K extent of the GEMM\n"
       << "  --alpha=<f32>               Epilogue scalar alpha\n"
       << "  --beta=<f32>                Epilogue scalar beta\n"
-      << "  --preferred_cluster_m=<str> Sets the M extent of preferred cluster shape\n"
-      << "  --preferred_cluster_n=<str> Sets the N extent of preferred cluster shape\n"
-      << "  --fallback_cluster_m=<str>  Sets the M extent of fallback cluster shape\n"
-      << "  --fallback_cluster_n=<str>  Sets the N extent of fallback cluster shape\n"
+      << "  --cluster_m=<str>           Sets the M extent of the cluster shape\n"
+      << "  --cluster_n=<str>           Sets the N extent of the cluster shape\n"
       << "  --decomposition=<str>       Mode in which the stream-K kernel should decompose the problem. Options: Heuristic (default), SplitK, StreamK, DataParallel\n"
       << "  --reduction=<str>           Mode in which the stream-K kernel's reduction should be performed. Options: Deterministic (default), Nondeterministic\n"
       << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
@@ -424,8 +418,8 @@ typename Gemm::Arguments args_from_options(const Options &options) {
     {{options.alpha, options.beta}, block_C.get(), stride_C, block_D.get(), stride_D}
   };
 
-  arguments.hw_info.cluster_shape = dim3(options.preferred_cluster_m, options.preferred_cluster_n,1);
-  arguments.hw_info.cluster_shape_fallback = dim3(options.fallback_cluster_m, options.fallback_cluster_n,1);
+  arguments.hw_info.cluster_shape = dim3(options.cluster_m, options.cluster_n, 1);
+  arguments.hw_info.cluster_shape_fallback = dim3(options.cluster_m, options.cluster_n, 1);
 
   arguments.scheduler.splits = options.splits;
   arguments.scheduler.decomposition_mode = options.decomposition_mode;
@@ -498,8 +492,7 @@ int run(Options &options) {
 
   std::cout << "Stream-K GEMM with"
             << " Problem Size: " << options.m << 'x' << options.n << 'x' << options.k
-            << " Preferred Cluster = (" << options.preferred_cluster_m << ", " << options.preferred_cluster_n << ", 1)"
-            << " Fallback Cluster = (" << options.fallback_cluster_m << ", " << options.fallback_cluster_n << ", 1)\n"
+            << " Cluster = (" << options.cluster_m << ", " << options.cluster_n << ", 1)\n"
             << " Decomposition_mode=" << options.decomposition_mode_str()
             << " Split_count=" << options.splits
             << " Reduction_mode=" << options.reduction_mode_str()
 
@@ -536,7 +536,11 @@ struct Sm100FmhaFwdMainloopTmaWarpspecialized {
     Tensor tScS_P = tScS.compose(make_layout(make_shape(_128{}, tilePlikeFP32)));
 
     // Each thread owns a single row
+    #if defined CUTE_ARCH_TCGEN05_TMEM_STAT_ENABLED
+      using TMEM_LOAD = SM100_TMEM_LOAD_STAT_32dp32b32x;
+    #else
       using TMEM_LOAD = SM100_TMEM_LOAD_32dp32b32x; // 4x32 threads with 128 cols of 32b elem
+    #endif
     using TMEM_STORE = SM100_TMEM_STORE_32dp32b32x;  // 4x32 threads with 128 cols of 8b elem
     using TMEM_STORE_V = SM100_TMEM_STORE_32dp32b2x;   // 4x32 threads with 2 cols of 32b elem
 
@@ -573,6 +577,14 @@ struct Sm100FmhaFwdMainloopTmaWarpspecialized {
     }
 
     ElementQK old_row_max = row_max;
+    #if defined CUTE_ARCH_TCGEN05_TMEM_STAT_ENABLED
+      auto pos = tTMEM_LOADcS(0);
+      if (!need_apply_mask || (need_apply_mask && (get<0>(pos) >= get<1>(pos) + 12) && (get<1>(pos) < get<1>(problem_shape)))) {
+        float curr_max = tiled_tmem_load.get_max();
+        row_max = ::fmax(row_max, curr_max);
+      }
+      else
+    #endif
     {
       // compute rowmax
       float row_max_0 = row_max;
 
@@ -540,7 +540,11 @@ struct Sm100FmhaGenMainloopWarpspecialized {
     Tensor tScS_P = tScS.compose(make_layout(make_shape(_128{}, tilePlikeFP32)));
 
     // Each thread owns a single row
+    #if defined CUTE_ARCH_TCGEN05_TMEM_STAT_ENABLED
+      using TMEM_LOAD = SM100_TMEM_LOAD_STAT_32dp32b32x;
+    #else
     using TMEM_LOAD = conditional_t<size<1>(TileShapeQK{}) < _128{}, SM100_TMEM_LOAD_32dp32b8x, SM100_TMEM_LOAD_32dp32b32x>;  // 4x32 threads with 128 cols of 8b elem
+    #endif
     using TMEM_STORE = conditional_t<size<1>(TileShapeQK{}) < _128{}, SM100_TMEM_STORE_32dp32b8x, SM100_TMEM_STORE_32dp32b32x>;  // 4x32 threads with 128 cols of 8b elem
     using TMEM_STORE_V = SM100_TMEM_STORE_32dp32b2x;   // 4x32 threads with 2 cols of 32b elem
 
@@ -577,6 +581,14 @@ struct Sm100FmhaGenMainloopWarpspecialized {
     }
 
     ElementQK old_row_max = row_max;
+    #if defined CUTE_ARCH_TCGEN05_TMEM_STAT_ENABLED
+      auto pos = tTMEM_LOADcS(0);
+      if (!need_apply_mask || (need_apply_mask && (get<0>(pos) >= get<1>(pos) + 12) && (get<1>(pos) < get<1>(problem_shape)))) {
+        float curr_max = tiled_tmem_load.get_max();
+        row_max = ::fmax(row_max, curr_max);
+      }
+      else
+    #endif
     {
       // compute rowmax
       float row_max_0 = row_max;
 
@@ -213,12 +213,15 @@ auto make_iterator(T* ptr) {
 
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 
-struct ExampleRunner {
+template <
   // Type of kernel schedule to generate
-  using MainloopScheduleType = cutlass::gemm::KernelMixedTmaCpAsyncWarpSpecialized1SmBlockScaledSm100;
+  class MainloopScheduleType = cutlass::gemm::KernelMixedTmaCpAsyncWarpSpecialized1SmBlockScaledSm100,
   // Type of epilogue schedule to generate
-  using EpilogueScheduleType = cutlass::epilogue::collective::EpilogueScheduleAuto;
-  static constexpr bool FuseQuantization = false;
+  class EpilogueScheduleType = cutlass::epilogue::collective::EpilogueScheduleAuto,
+  class ClusterShapeMNK = Shape<_1, _1, _1>,
+  bool FuseQuantization = false
+>
+struct ExampleRunner {
 
   using LayoutATag = cutlass::layout::RowMajor;
   using LayoutBTag = cutlass::layout::ColumnMajor;
@@ -238,10 +241,8 @@ struct ExampleRunner {
   using ElementCompute = float;
   using ElementScalar = float;
 
-  
-
-  using ClusterShapeMNK = Shape<_1,_1,_1>;
-  using MmaTileMNK    = Shape<_128,_64,_256>;  // use tile size of N=64 to match real use cases (N is typically very small in decoding stage)
+  static constexpr int TileM = cute::is_base_of_v<cutlass::gemm::KernelSchedule2Sm, MainloopScheduleType> ? 256 : 128;
+  using MmaTileMNK    = Shape<Int<TileM>,_64,_256>;  // use tile size of N=64 to match real use cases (N is typically very small in decoding stage)
 
   static constexpr int AlignmentA = 32;
   static constexpr int AlignmentB = 32;
@@ -712,10 +713,34 @@ int main(int argc, char const **args) {
   hw_info.device_id = 0;
   hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
 
-  std::cout << "Running kernel with mixed TMA+CPASYNC load:" << std::endl;
+  std::cout << "Running kernel with mixed TMA+CPASYNC load, 1SM:" << std::endl;
   ExampleRunner runner_mixed_tma_cpasync;
   runner_mixed_tma_cpasync.run(options, hw_info);
 
+  std::cout << "\n\n\nRunning kernel with mixed TMA+CPASYNC load, 1SM, 2x2 cluster:" << std::endl;
+  ExampleRunner<
+    cutlass::gemm::KernelMixedTmaCpAsyncWarpSpecialized1SmBlockScaledSm100,
+    cutlass::epilogue::collective::EpilogueScheduleAuto,
+    Shape<_2, _2, _1>
+  > runner_mixed_tma_cpasync_1sm_2x2;
+  runner_mixed_tma_cpasync_1sm_2x2.run(options, hw_info);
+
+  std::cout << "\n\n\nRunning 2SM kernel with mixed TMA+CPASYNC load, 2x1 cluster:" << std::endl;
+  ExampleRunner<
+    cutlass::gemm::KernelMixedTmaCpAsyncWarpSpecialized2SmBlockScaledSm100,
+    cutlass::epilogue::collective::EpilogueScheduleAuto,
+    Shape<_2, _1, _1>
+  > runner_mixed_tma_cpasync_2sm_2x1;
+  runner_mixed_tma_cpasync_2sm_2x1.run(options, hw_info);
+
+  std::cout << "\n\n\nRunning 2SM kernel with mixed TMA+CPASYNC load, 2x4 cluster:" << std::endl;
+  ExampleRunner<
+    cutlass::gemm::KernelMixedTmaCpAsyncWarpSpecialized2SmBlockScaledSm100,
+    cutlass::epilogue::collective::EpilogueScheduleAuto,
+    Shape<_2, _4, _1>
+  > runner_mixed_tma_cpasync_2sm_2x4;
+  runner_mixed_tma_cpasync_2sm_2x4.run(options, hw_info);
+
 #endif
 
   return 0;
Original file line number	Diff line number	Diff line change
`@@ -48,7 +48,7 @@ foreach(FUSION_CONV_EXAMPLE`
`48`	`48`	`fused_two_convs_f16_sm80_shmem`
`49`	`49`	`fused_two_convs_s8_sm75_rf`
`50`	`50`	`fused_two_convs_s8_sm75_shmem`
`51`		`- fused_two_convs_s8_sm80_rf`
	`51`	`+ # fused_two_convs_s8_sm80_rf # disabled: fails to build`
`52`	`52`	`fused_two_convs_s8_sm80_shmem`
`53`	`53`	`)`
`54`	`54`
Original file line number	Diff line number	Diff line change
`@@ -195,7 +195,7 @@ class GemmUniversal<`
`195`	`195`	`}`
`196`	`196`	`implementable &= CollectiveMainloop::can_implement(args.problem_shape, args.mainloop);`
`197`	`197`	`implementable &= CollectiveEpilogue::can_implement(args.problem_shape, args.epilogue);`
`198`		`- implementable &= TileScheduler::can_implement(args.scheduler);`
	`198`	`+ implementable &= TileScheduler::can_implement(args.scheduler, args.hw_info);`
`199`	`199`
`200`	`200`	`return implementable;`
`201`	`201`	`}`