Skip to content

Commit 05e1356

Browse files
committed
update README
Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>
1 parent c0a3e30 commit 05e1356

1 file changed

Lines changed: 9 additions & 6 deletions

File tree

  • bionemo-recipes/recipes/evo2_megatron

bionemo-recipes/recipes/evo2_megatron/README.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -67,12 +67,12 @@ torchrun --nproc-per-node 2 --no-python \
6767
--use-subquadratic-ops
6868
```
6969

70-
> **Tip:** The `--use-subquadratic-ops` flag enables a fused back-to-back
71-
> causal convolution CUDA kernel for the Hyena short-conv layers. This
72-
> provides a meaningful speed-up for training and prediction and is
73-
> recommended for all production runs. It does not apply to autoregressive
74-
> inference (`infer_evo2`). There is a one-time compilation cost on first
75-
> use.
70+
> **Tip:** The `--use-subquadratic-ops` flag enables fused subquadratic-ops
71+
> CUDA kernels (`b2b_causal_conv1d` for proj+mixer fusion in prefill,
72+
> `fft_causal_conv1d` / `causal_conv1d` inside `engine.parallel_fir`). It
73+
> applies to training, batch prediction (`predict_evo2`), and the prefill
74+
> phase of autoregressive inference (`infer_evo2`); per-token decode is
75+
> already in optimal recurrent form and is unaffected.
7676
7777
### Autoregressive generation (`infer_evo2`)
7878

@@ -97,6 +97,9 @@ Options:
9797
- `--top-k` / `--top-p` — top-k or nucleus sampling (0 = disabled).
9898
- `--tensor-parallel-size` — tensor parallelism for large models (default: 1).
9999
- `--max-seq-length` — maximum sequence length (default: 8192).
100+
- `--use-subquadratic-ops` — use fused subquadratic-ops kernels for prefill
101+
(b2b causal conv, FFT/causal conv1d in `parallel_fir`). Recommended when
102+
processing many prompts in one process.
100103

101104
### Batch sequence scoring (`predict_evo2`)
102105

0 commit comments

Comments
 (0)