@@ -67,12 +67,12 @@ torchrun --nproc-per-node 2 --no-python \
6767 --use-subquadratic-ops
6868```
6969
70- > ** Tip:** The ` --use-subquadratic-ops ` flag enables a fused back-to-back
71- > causal convolution CUDA kernel for the Hyena short-conv layers. This
72- > provides a meaningful speed-up for training and prediction and is
73- > recommended for all production runs. It does not apply to autoregressive
74- > inference (` infer_evo2 ` ). There is a one-time compilation cost on first
75- > use .
70+ > ** Tip:** The ` --use-subquadratic-ops ` flag enables fused subquadratic-ops
71+ > CUDA kernels ( ` b2b_causal_conv1d ` for proj+mixer fusion in prefill,
72+ > ` fft_causal_conv1d ` / ` causal_conv1d ` inside ` engine.parallel_fir ` ). It
73+ > applies to training, batch prediction ( ` predict_evo2 ` ), and the prefill
74+ > phase of autoregressive inference (` infer_evo2 ` ); per-token decode is
75+ > already in optimal recurrent form and is unaffected .
7676
7777### Autoregressive generation (` infer_evo2 ` )
7878
@@ -97,6 +97,9 @@ Options:
9797- ` --top-k ` / ` --top-p ` — top-k or nucleus sampling (0 = disabled).
9898- ` --tensor-parallel-size ` — tensor parallelism for large models (default: 1).
9999- ` --max-seq-length ` — maximum sequence length (default: 8192).
100+ - ` --use-subquadratic-ops ` — use fused subquadratic-ops kernels for prefill
101+ (b2b causal conv, FFT/causal conv1d in ` parallel_fir ` ). Recommended when
102+ processing many prompts in one process.
100103
101104### Batch sequence scoring (` predict_evo2 ` )
102105
0 commit comments