Support subquadratic-ops kernels in evo2 autoregressive inference (#1565)

farhadrgh · web-flow · commit 5497faa6ea71 · 2026-04-30T21:31:03.000Z
### Description Closes the gap noted in `hyena_mixer.py` (`# todo: support inference_context for b2b_kernel`) and the README caveat that `--use-subquadratic-ops` "does not apply to autoregressive inference (`infer_evo2`)". After this PR, the same fused kernels that accelerate training and batch prediction also accelerate the prefill phase of autoregressive inference. Summary of change: 1. **`engine.parallel_fir`** now accepts `use_subquadratic_ops` and routes to `fft_causal_conv1d` (filters ≥ 128) or `causal_conv1d` (short filters), wired through both call sites in `hyena_utils.py`. 2. **`HyenaMixer.forward`** detects prefill (no FIR cache yet) and runs `b2b_causal_conv1d` for the fused proj+mixer convolution. The kernel doesn't expose its intermediate, so we run a tiny windowed proj-conv on the last `K_proj + K_mixer − 2` input positions to materialize the `(x2*v)` tail and seed the mixer's FIR cache. Works for both `hyena_short_conv` and `hyena_medium_conv`. 3. Removed the `del self._parameters["short_conv_weight"]` micro-optimization in `ParallelCausalDepthwiseConv1dWithState._get_weight()` — `B2BCausalConv1dModule` reads that raw param on every prefill, so deleting it after first decode broke multi-prompt inference. Memory cost is ~4 MB for a 1B model. `infer_evo2` gets a `--use-subquadratic-ops` flag. ## Testing - New parametrization `test_forward_manual[1b-8k-bf16-subquadratic-ops-flash]` covers the `(flash_decode=True, subquadratic_ops=True)` combination that was previously skipped. - New `test_subquadratic_ops_matches_baseline` runs greedy autoregressive generation with and without `--use-subquadratic-ops` and asserts identical output — this is the strict check that Phase 2 state population is correct (a wrong cache would diverge during decode). - Existing kernel comparison tests (`test_hyena_mixer_kernel.py`) and inference-context unit tests pass unchanged. ## Performance `infer_evo2`, evo2/1b-8k-bf16, single A6000, multiple identical prompts in one process to amortize the one-time JIT compile cost (~15 s the first time each subq-ops kernel sees a new shape). Steady-state numbers from batches 3+: | Prompt | Generation | Baseline | Subq-ops | Speedup | |---|---|---|---|---| | 4 096 tokens | 5 tokens | 0.57 s | 0.51 s | ~10% | | 8 000 tokens | 1 token | 1.02 s | 0.87 s | ~15% | The speedup is concentrated in prefill. The relative improvement grows with prompt length and shrinks as more decode tokens are amortized in. ### Type of changes  - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Refactor - [x] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run. - [ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip) - Skip all CI tests for this PR - [ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks) - Run Jupyter notebooks execution tests - [ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow) - Run slow single GPU integration tests marked as @pytest.mark.slow - [ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all) - Run all tests (unit tests, slow tests, and notebooks). This label can be used to enforce running all framework tests. - [ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes) - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes. Unit tests marked as `@pytest.mark.multi_gpu` or `@pytest.mark.distributed` are not run in the PR pipeline. For more details, see [CONTRIBUTING](CONTRIBUTING.md) > [!NOTE] > By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. - If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) - If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. #### Triggering Code Rabbit AI Review To trigger a code review from code rabbit, comment on a pull request with one of these commands: - @coderabbitai review - Triggers a standard review - @coderabbitai full review - Triggers a comprehensive review See https://docs.coderabbit.ai/reference/review-commands for a full list of commands. ### Pre-submit Checklist - [x] I have tested these changes locally - [x] I have updated the documentation accordingly - [x] I have added/updated tests as needed - [x] All existing tests pass successfully  ## Summary by CodeRabbit * **New Features** * Added `--use-subquadratic-ops` CLI option to optimize prompt/prefill processing during inference while leaving per-token decode unchanged. * **Documentation** * Clarified subquadratic-ops kernel behavior and performance impact on prefill throughput. * **Tests** * Added end-to-end test confirming subquadratic-ops generates identical inference results as baseline.  --------- Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>
diff --git a/bionemo-recipes/recipes/evo2_megatron/README.md b/bionemo-recipes/recipes/evo2_megatron/README.md
@@ -67,12 +67,12 @@ torchrun --nproc-per-node 2 --no-python \
   --use-subquadratic-ops
 ```
 
-> **Tip:** The `--use-subquadratic-ops` flag enables a fused back-to-back
-> causal convolution CUDA kernel for the Hyena short-conv layers. This
-> provides a meaningful speed-up for training and prediction and is
-> recommended for all production runs. It does not apply to autoregressive
-> inference (`infer_evo2`). There is a one-time compilation cost on first
-> use.
+> **Tip:** The `--use-subquadratic-ops` flag enables fused subquadratic-ops
+> CUDA kernels (`b2b_causal_conv1d` for proj+mixer fusion in prefill,
+> `fft_causal_conv1d` / `causal_conv1d` inside `engine.parallel_fir`). It
+> applies to training, batch prediction (`predict_evo2`), and the prefill
+> phase of autoregressive inference (`infer_evo2`); per-token decode is
+> already in optimal recurrent form and is unaffected.
 
 ### Autoregressive generation (`infer_evo2`)
 
@@ -97,6 +97,9 @@ Options:
 - `--top-k` / `--top-p` — top-k or nucleus sampling (0 = disabled).
 - `--tensor-parallel-size` — tensor parallelism for large models (default: 1).
 - `--max-seq-length` — maximum sequence length (default: 8192).
+- `--use-subquadratic-ops` — use fused subquadratic-ops kernels for prefill
+  (b2b causal conv, FFT/causal conv1d in `parallel_fir`). Recommended when
+  processing many prompts in one process.
 
 ### Batch sequence scoring (`predict_evo2`)
 
diff --git a/bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/engine.py b/bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/engine.py
@@ -20,6 +20,15 @@
 from einops import rearrange
 
 
+try:
+    from subquadratic_ops_torch.causal_conv1d import causal_conv1d as _subq_causal_conv1d
+    from subquadratic_ops_torch.fft_causal_conv1d import fft_causal_conv1d as _subq_fft_causal_conv1d
+except ImportError as _subq_import_error:
+    _subq_causal_conv1d = None
+    _subq_fft_causal_conv1d = None
+    _subq_error_msg = f"subquadratic_ops_torch not available: {_subq_import_error}"
+
+
 def adjust_filter_shape_for_broadcast(u, h):
     """Adjust filter shape for broadcasting compatibility with input tensor."""
     h = h.squeeze()  # Standardize to [D, L] from [1, D, L] and [D, 1, L]
@@ -63,27 +72,47 @@ def parallel_fir(
     gated_bias,
     fir_length,
     compute_state,
+    use_subquadratic_ops=False,
 ):
     """Compute parallel finite impulse response filtering with optional state computation."""
     L = u.shape[1]  # noqa: N806
     u = rearrange(u, "b l d -> b d l")
 
+    if use_subquadratic_ops and _subq_fft_causal_conv1d is None:
+        raise ImportError(_subq_error_msg)
+
     if fir_length >= 128:
-        with torch.autocast("cuda"):
-            z = fftconv_func(
-                u=u.to(torch.float32),
-                k=weight[:, :, :L].to(torch.float32),
-                D=bias,
-            ).to(dtype=u.dtype)
+        if use_subquadratic_ops:
+            # subq-ops fft_causal_conv1d expects [B, D, L] input and [D, L] filter; dtypes must match
+            k = weight[:, :, :L].squeeze(1) if weight.dim() == 3 else weight[:, :L]
+            u_fp32 = u.to(torch.float32)
+            z = _subq_fft_causal_conv1d(u_fp32, k.to(torch.float32))
+            if bias is not None:
+                z = z + u_fp32 * bias.unsqueeze(-1)
+            z = z.to(u.dtype)
+        else:
+            with torch.autocast("cuda"):
+                z = fftconv_func(
+                    u=u.to(torch.float32),
+                    k=weight[:, :, :L].to(torch.float32),
+                    D=bias,
+                ).to(dtype=u.dtype)
     else:
-        z = F.conv1d(
-            u.to(torch.float32),
-            weight.to(torch.float32),
-            bias=None,
-            stride=1,
-            padding=fir_length - 1,
-            groups=u.shape[1],  # always set to D, regardless of filter grouping
-        )[..., :L]
+        if use_subquadratic_ops:
+            # subq-ops causal_conv1d expects pre-padded [B, D, L+pad] input and [D, K] weight; dtypes must match
+            pad_size = fir_length - 1
+            x_padded = F.pad(u.to(torch.float32), (pad_size, 0))
+            w = weight.squeeze(1) if weight.dim() == 3 else weight
+            z = _subq_causal_conv1d(x_padded, w.to(torch.float32))[..., pad_size:]
+        else:
+            z = F.conv1d(
+                u.to(torch.float32),
+                weight.to(torch.float32),
+                bias=None,
+                stride=1,
+                padding=fir_length - 1,
+                groups=u.shape[1],  # always set to D, regardless of filter grouping
+            )[..., :L]
 
         z = z.to(u.dtype)
 
diff --git a/bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py b/bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py
@@ -22,6 +22,7 @@
 
 import torch
 import torch.nn as nn
+import torch.nn.functional as F  # noqa: N812
 from einops import rearrange
 from megatron.core.process_groups_config import ProcessGroupCollection
 from megatron.core.transformer.module import MegatronModule
@@ -307,14 +308,20 @@ def forward(self, x, layer_past=None, inference_context=None, _hyena_use_cp=True
         else:
             features = rearrange(features, "l b d -> b d l").contiguous()
 
-        if (
-            self.use_subquadratic_ops
-            and self.operator_type in ["hyena_short_conv", "hyena_medium_conv"]
-            and inference_context is None
-        ):
-            # todo: support inference_context for b2b_kernel
-            # Use the B2BCausalConv1dModule wrapper with the existing weights from the original model
+        is_b2b_eligible = self.use_subquadratic_ops and self.operator_type in [
+            "hyena_short_conv",
+            "hyena_medium_conv",
+        ]
+        # b2b runs during training (no inference_context) or during prefill (no FIR cache yet).
+        # During decode (cache populated, L=1) we fall back to the regular per-token step path.
+        is_prefill = inference_context is not None and id(self.hyena_proj_conv) not in getattr(
+            inference_context, "fir_filter_state_dict", {}
+        )
+
+        if is_b2b_eligible and (inference_context is None or is_prefill):
             z = self.b2b_kernel(features, _use_cp=_proj_use_cp)
+            if is_prefill:
+                self._populate_b2b_inference_state(features, inference_context)
         else:
             features = self.hyena_proj_conv(
                 features, _use_cp=_proj_use_cp, inference_context=inference_context
@@ -330,3 +337,59 @@ def forward(self, x, layer_past=None, inference_context=None, _hyena_use_cp=True
             z = rearrange(z, "b d l -> l b d").contiguous()
         y, bias = self.dense(z)
         return y, bias
+
+    def _populate_b2b_inference_state(self, features, inference_context):
+        """Populate FIR state for proj_conv and mixer after a b2b prefill.
+
+        The b2b kernel doesn't expose its post-projection intermediate, but subsequent
+        decode steps need (a) the proj_conv input tail and (b) the tail of `x2 * v`
+        — the gated stream that mixer's short_conv operates on. We get (b) by running
+        a windowed proj_conv on just the last (K_proj + K_mixer - 2) input positions.
+        """
+        proj_kernel_size = self.hyena_proj_conv.kernel_size
+
+        # (a) proj_conv FIR state: input tail in [B, D, K_proj-1]
+        proj_state = features[..., -(proj_kernel_size - 1) :].contiguous()
+        proj_dict = getattr(inference_context, "fir_filter_state_dict", {})
+        proj_dict[id(self.hyena_proj_conv)] = proj_state
+        setattr(inference_context, "fir_filter_state_dict", proj_dict)
+
+        # (b) mixer FIR state: tail of (x2 * v), the gated post-projection stream
+        if self.operator_type == "hyena_short_conv":
+            mixer_kernel_size = self.mixer.short_conv.kernel_size
+        else:  # hyena_medium_conv
+            mixer_kernel_size = self.mixer.kernel_size
+
+        tail_in_len = proj_kernel_size + mixer_kernel_size - 2
+        if features.shape[-1] < tail_in_len:
+            tail_in = F.pad(features, (tail_in_len - features.shape[-1], 0))
+        else:
+            tail_in = features[..., -tail_in_len:].contiguous()
+
+        # Reuse the cached transformed weight from get_weight() (lru_cache'd).
+        proj_weight = self.hyena_proj_conv.get_weight()
+
+        intermediate = F.conv1d(
+            F.pad(tail_in.to(torch.float32), (proj_kernel_size - 1, 0)),
+            proj_weight,
+            bias=None,
+            stride=1,
+            padding=0,
+            groups=tail_in.shape[1],
+        )[..., -(mixer_kernel_size - 1) :].to(features.dtype)
+
+        x1, x2, v = rearrange(intermediate, "b (g dg p) l -> b (g dg) p l", p=3, g=self.num_groups_per_tp_rank).unbind(
+            dim=2
+        )
+        mixer_input_tail = (x2 * v).contiguous()  # [B, D, K_mixer-1]
+
+        if self.operator_type == "hyena_short_conv":
+            mixer_state_owner_id = id(self.mixer.short_conv)
+            mixer_dict_key = "fir_filter_state_dict"
+        else:  # hyena_medium_conv
+            mixer_state_owner_id = id(self.mixer)
+            mixer_dict_key = "inner_fir_filter_state_dict"
+
+        mixer_dict = getattr(inference_context, mixer_dict_key, {})
+        mixer_dict[mixer_state_owner_id] = mixer_input_tail
+        setattr(inference_context, mixer_dict_key, mixer_dict)
diff --git a/bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_utils.py b/bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_utils.py
@@ -1051,6 +1051,7 @@ def get_filter_state(filter_name):
                 L=L,
                 fir_length=self.kernel_size,  # self.short_filter_length,
                 compute_state=inference_context is not None,
+                use_subquadratic_ops=self.use_subquadratic_ops,
             )
             y = rearrange(y, "b d l -> b l d")
             y = y * x1
@@ -1656,12 +1657,16 @@ def __init__(self, *args, **kwargs):
         self.get_weight = lru_cache(maxsize=1)(self._get_weight)
 
     def _get_weight(self):
-        """Expand and cache the convolution weight, freeing the raw parameter."""
+        """Expand and cache the convolution weight in inference-friendly form."""
+        # previously deleted self._parameters["short_conv_weight"] here as a
+        # memory micro-optimization, but the raw param is also read directly by
+        # B2BCausalConv1dModule on every prefill call. With subq-ops enabled in
+        # inference, the second prompt's b2b call fails after decode triggers
+        # this method on the first prompt
         weight = self.short_conv_weight
         if len(weight.shape) == 2:
             weight = weight.unsqueeze(1)
         weight = weight.repeat_interleave(self.group_dim, dim=0).to(torch.float32)
-        del self._parameters["short_conv_weight"]
         return weight
 
     def forward(self, x, inference_context=None, _use_cp=True):  # noqa: D102
@@ -1697,6 +1702,7 @@ def get_filter_state(filter_name):
                 gated_bias=False,
                 fir_length=self.kernel_size,  # self.short_filter_length,
                 compute_state=inference_context is not None,
+                use_subquadratic_ops=self.use_subquadratic_ops,
             )
         else:
             if len(u.shape) > 2:
diff --git a/bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py b/bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py
@@ -358,6 +358,7 @@ def setup_inference_engine(
     vortex_style_fp8: bool = False,
     random_seed: int = 1234,
     prompt_segmentation_threshold: Optional[int] = None,
+    use_subquadratic_ops: bool = False,
 ) -> Evo2InferenceComponents:
     """Setup the Evo2 inference engine and related components.
 
@@ -379,6 +380,9 @@ def setup_inference_engine(
             segmented during prefill to reduce peak memory. The first segment
             runs as a normal prefill; remaining tokens are processed one at a
             time before generation begins.
+        use_subquadratic_ops: Use fused subquadratic-ops kernels (b2b causal
+            conv1d in prefill, fft_causal_conv1d / causal_conv1d in
+            parallel_fir).
 
     Returns:
         Evo2InferenceComponents containing all inference components.
@@ -412,6 +416,7 @@ def setup_inference_engine(
     model_provider.sequence_parallel = False
 
     model_provider.flash_decode = True
+    model_provider.use_subquadratic_ops = use_subquadratic_ops
 
     if vortex_style_fp8:
         model_provider.vortex_style_fp8 = True
@@ -808,6 +813,14 @@ def parse_args() -> argparse.Namespace:
         "generation begins. Useful for long prompts that would otherwise OOM. "
         "Also settable via EVO2_PST env var.",
     )
+    ap.add_argument(
+        "--use-subquadratic-ops",
+        action="store_true",
+        default=False,
+        help="Use fused subquadratic-ops CUDA kernels (b2b causal conv1d in prefill, "
+        "fft_causal_conv1d / causal_conv1d in parallel_fir). Speeds up prompt processing "
+        "but has no effect on per-token decode throughput.",
+    )
 
     return ap.parse_args()
 
@@ -831,6 +844,7 @@ def infer(
     max_seq_length: int = 8192,
     max_batch_size: int = 1,
     prompt_segmentation_threshold: Optional[int] = None,
+    use_subquadratic_ops: bool = False,
 ) -> List[Dict[str, Any]]:
     """Run autoregressive text generation with Evo2 using MCore inference.
 
@@ -858,6 +872,7 @@ def infer(
             GPU memory proportional to this value. For large models, only 1 may fit.
         prompt_segmentation_threshold: If set, prompts longer than this are segmented
             during prefill to reduce peak memory.
+        use_subquadratic_ops: Use fused subquadratic-ops kernels in the inference path.
 
     Returns:
         List of JSONL-serialisable result dicts.
@@ -878,6 +893,7 @@ def infer(
         vortex_style_fp8=vortex_style_fp8,
         random_seed=random_seed,
         prompt_segmentation_threshold=prompt_segmentation_threshold,
+        use_subquadratic_ops=use_subquadratic_ops,
     )
 
     mem_after_setup_gb = torch.cuda.max_memory_allocated() / (1024**3)
@@ -1003,6 +1019,7 @@ def main() -> None:
         max_seq_length=max_seq_length,
         max_batch_size=args.max_batch_size,
         prompt_segmentation_threshold=prompt_segmentation_threshold,
+        use_subquadratic_ops=args.use_subquadratic_ops,
     )
 
 
diff --git a/bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py b/bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py
@@ -284,6 +284,7 @@ def run_infer_subprocess(
     temperature: float = 1.0,
     top_k: int = 1,
     seed: int = 42,
+    use_subquadratic_ops: bool = False,
 ):
     """Helper function to run inference as a subprocess.
 
@@ -295,6 +296,7 @@ def run_infer_subprocess(
         temperature: Sampling temperature
         top_k: Top-k sampling parameter (1 for greedy)
         seed: Random seed for reproducibility
+        use_subquadratic_ops: Pass --use-subquadratic-ops to the CLI.
 
     Returns:
         The generated completion text from the first JSONL record
@@ -326,6 +328,8 @@ def run_infer_subprocess(
         "--seed",
         str(seed),
     ]
+    if use_subquadratic_ops:
+        cmd.append("--use-subquadratic-ops")
 
     env = copy.deepcopy(PRETEST_ENV)
 
@@ -517,6 +521,47 @@ def test_identical_prompts_should_be_identical(mbridge_checkpoint_path, tmp_path
     )
 
 
+def test_subquadratic_ops_matches_baseline(mbridge_checkpoint_path, tmp_path):
+    """Greedy generation with --use-subquadratic-ops must match the standard path.
+
+    This is the end-to-end correctness check for the subq-ops inference path:
+    Phase 1 routes engine.parallel_fir through subq-ops kernels during prefill,
+    Phase 2 fuses proj+mixer convs via b2b_causal_conv1d during prefill and
+    populates FIR caches for the subsequent decode steps. With greedy decoding
+    (top_k=1) and the same seed, both paths must produce identical output.
+    """
+    output_baseline = tmp_path / "output_baseline.jsonl"
+    output_subq = tmp_path / "output_subq.jsonl"
+
+    generated_baseline = run_infer_subprocess(
+        mbridge_checkpoint_path,
+        prompt=PROMPT_1,
+        output_file=output_baseline,
+        max_new_tokens=20,
+        temperature=1.0,
+        top_k=1,
+        seed=42,
+        use_subquadratic_ops=False,
+    )
+
+    generated_subq = run_infer_subprocess(
+        mbridge_checkpoint_path,
+        prompt=PROMPT_1,
+        output_file=output_subq,
+        max_new_tokens=20,
+        temperature=1.0,
+        top_k=1,
+        seed=42,
+        use_subquadratic_ops=True,
+    )
+
+    assert len(generated_baseline) > 0, "Baseline generation produced empty output"
+    assert len(generated_subq) > 0, "Subq-ops generation produced empty output"
+    assert generated_baseline == generated_subq, (
+        f"Subq-ops path diverged from baseline:\nBaseline: {generated_baseline}\nSubq-ops: {generated_subq}"
+    )
+
+
 def test_different_prompts_produce_different_outputs(mbridge_checkpoint_path, tmp_path):
     """Test that different prompts produce different sequences.
 
diff --git a/bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py b/bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py