Commit c66a379
Add vLLM support for NeMo SpeechLM (#15520)
* Add vLLM plugin for NeMo Speech LLM inference
Register NeMo Speech LM models into vLLM via the general_plugins entry point.
Supports hybrid (NemotronH) and standard transformer (Qwen3) backbones.
- NeMoSpeechLMHybridForConditionalGeneration: hybrid Mamba+MoE models
- NeMoSpeechLMForConditionalGeneration: standard transformer models
- NeMoSpeechLMStdForConditionalGeneration: legacy alias for standard
- Audio preprocessing with automatic resampling to 16kHz mono
- Thread-safe tokenizer patch for vLLM's concurrent encoding
- Includes unit tests
* vllm plugin: rename classes so standard owns the base name
Swap the naming convention so it follows "unqualified base name = default
variant, qualified name = specialization":
NeMoSpeechLMForConditionalGeneration -> standard (Qwen3, Parakeet)
NeMoSpeechLMHybridForConditionalGeneration -> hybrid Mamba+MoE (NemotronH)
Previously the unqualified base name was the hybrid class, which made
to_hf.py's arch auto-detection point non-hybrid checkpoints at the wrong
implementation. Keep to_hf.py as the contract and rename the plugin
classes to match.
Legacy alias NeMoSpeechLMStdForConditionalGeneration now points at the
new base-named class so checkpoints exported under the old name load.
Made-with: Cursor
* vllm plugin: drop NeMoSpeechLMStdForConditionalGeneration legacy alias
No checkpoints in circulation use this name -- to_hf.py is the single
source of truth for exported architecture names, and it only emits the
two canonical names.
Made-with: Cursor
* vllm plugin: document that nemotron_v3 name is historical
The package covers every SpeechLM backbone (Qwen3, NemotronH, ...); the
folder name is a historical artifact from when the plugin started as a
NemotronH-only experiment.
Made-with: Cursor
* vllm plugin config: tighten validation + document the quirks
- Fail fast in `NeMoSpeechLMConfig.__init__` when the backbone config's
`architectures` list isn't length-1: mixed or missing architectures
currently route silently (mixed -> hybrid-if-any-match; missing ->
treated as standard). A raised ValueError catches malformed ckpts at
plugin load time instead of serving wrong weights.
- Name the magic +10 on `text_config.vocab_size`: new constant
`_SPEECHLM_EMBED_EXTRA_ROWS` with a block comment explaining it must
match training-time vocab additions (audio locator + padding) so the
embedding matrix in model.safetensors loads without shape mismatch.
- Document the `architectures = ["NemotronHForCausalLM"]` normalization
on hybrid backends (different checkpoints list different aliases; only
the canonical name is in vLLM's registry).
- Add a docstring on `__getattr__` explaining the guard list: prevents
infinite recursion when plugin-specific fields are queried before
`__init__` finishes, and prevents accidental delegation to same-named
attributes on the wrapped `text_config`.
- Drop the redundant `_ATTR_ALIASES` entry from the guard tuple: it
starts with `_` so `startswith("_")` already catches it.
Made-with: Cursor
* vllm plugin: drop dead output_dim fallback in _load_nemo_perception
Every training YAML in speechlm-2026h1/ sets perception.output_dim
explicitly, so the 'if "output_dim" not in cfg' fallback never fired.
Remove it and the now-unused output_dim parameter (plus the callsite's
llm_hidden derivation). If a terse perception config lands here later,
AudioPerceptionModule will fail on its own with a clearer error.
Made-with: Cursor
* vllm plugin: rename _pad_vocab_tensor -> _pad_to_vocab_size
Verb-led name spells out what the helper does ('pad [the tensor] to
[vocab_size]') instead of the ambiguous 'vocab tensor'. Pure rename,
no behavior change.
Made-with: Cursor
* vllm plugin: add missing type hints per NeMo PR checklist
Five signatures were missing hints and tripped the 'every exposed
method needs Python 3 type hints' rule from the NeMo contributor
checklist: _ensure_special_tokens, _init_perception, and the three
Mamba-state classmethods. Uses PreTrainedTokenizerBase for the
tokenizer, VllmConfig for vllm_config args, and Any for Mamba return
types + _init_perception's config (NeMoSpeechLMConfig is same-package
and brings import-cycle risk not worth the precision).
Made-with: Cursor
* vllm plugin: document _estimate_audio_tokens + add drift unit test
The hand-rolled _estimate_audio_tokens function mirrors FastConformer's
preprocessing chain (STFT + 3x Conv subsampling) but in pure Python to
avoid ~90x tensor-ops overhead on the scheduler hotpath (measured 0.18
us vs 16 us per call via calc_length). Added:
- Full docstring on _estimate_audio_tokens explaining what it mirrors,
why it is hand-rolled, and a pointer to the drift test.
- tests/collections/speechlm2/test_vllm_audio_token_estimator.py that
asserts the estimator equals NeMo calc_length-based reference on 9
canonical audio lengths. Breaks when FastConformer's downsampling
stack changes upstream, forcing a rewrite of the hand-rolled math.
Made-with: Cursor
* vllm plugin: hoist audios lookup out of get_replacement closure
_get_prompt_updates's inner get_replacement closure previously re-ran
mm_items.get_items('audio', ...) on every call. The lookup is O(1) and
mm_items is already finalized at this point, so pulling it out once
saves a redundant dict access per <|audio|> match and makes the closure
body one line shorter. Pure cleanup, no behavior change.
Made-with: Cursor
* vllm plugin: validate + clean up placeholder/audio pairing
_call_hf_processor silently accepted mismatches between the number of
<|audio|> placeholders in the prompt and the number of audios in
mm_data. The old loop processed first-N of whichever was shorter and
left the surplus for a shape-mismatch crash deep in
get_input_embeddings at forward time. Now:
- Pre-loop length check: raises ValueError with a clear message when
counts differ, so the error surfaces at the processor stage where
the caller can see it.
- Loop iterates ph_positions zipped with audios instead of walking all
split parts and skipping text chunks; no audio_idx counter, no per-
iteration <|audio|> branch, same behavior.
- Short comment documents the positional pairing invariant.
Made-with: Cursor
* vllm plugin: fail loud when audio_signal_length is missing
_parse_audio_input had two defensive branches that duplicated or
contradicted pipeline behavior:
- if audio_signal_length is None: [shape[-1]] * batch
- elif not isinstance(..., Tensor): torch.tensor(...)
The None branch was latently wrong. By the time execution reaches it,
audio_signal has been zero-padded to max batch length via the
list-stacking block above, so audio_signal.shape[-1] is the padded
length, not the true audio length. Handing that to the perception
encoder as input_signal_length means the encoder treats trailing
zeros as real audio and emits extra output frames, silently breaking
placeholder/feature alignment.
In the real pipeline, _call_hf_processor always emits
audio_signal_length as a 1D torch.Tensor of true per-audio lengths
alongside audio_signal (both declared batched in
_get_mm_fields_config), so neither branch is reachable.
Replaced both with a single type check that raises ValueError when
the invariant is violated.
Made-with: Cursor
* vllm plugin: use explicit params in _parse_audio_input
_parse_audio_input had a **kwargs-only signature and popped audio
fields inside the body. It mirrored vLLM's embed_multimodal(**kwargs)
style but leaked the pattern into an internal helper that has a
well-defined contract: exactly two inputs from the TensorSchema.
Switched to explicit params (audio_signal, audio_signal_length) with
**kwargs kept for forward-compat absorbing unexpected fields. Lets
type checkers catch wrong-type callers and documents the contract in
the signature itself.
Made-with: Cursor
* vllm plugin: document no-op device guard in _process_audio
self.perception = self.perception.to(device) reads perception's own
device and moves perception there -- always a no-op in the TP, PP, and
single-GPU paths. Fragmented placement is the only case it would
trigger, and there it silently moves all params to the first-by-
iteration param's device, which is not controllable from the caller.
Real device placement is established at init time by _mark_tower_model
and declared structurally via get_mm_mapping. Added a short comment
so future readers don't assume the line is doing real work and plan
multi-GPU changes around it.
Made-with: Cursor
* vllm plugin: lift load_weights into base class with hooks
Both NeMoSpeechLMHybridForConditionalGeneration and
NeMoSpeechLMForConditionalGeneration had near-identical load_weights
methods. The only real difference was one extra step for Standard:
LoRA merge before HF-name rename.
Refactored:
- _NeMoSpeechLMBase.load_weights orchestrates the full pipeline
(split -> perception load -> preprocess -> rename -> vLLM load).
- _preprocess_llm_weights on base returns identity; Standard overrides
to run _merge_lora_weights. Hybrid doesn't override.
- _nemo_to_hf_llm_weights declared on base with NotImplementedError
so a future subclass that forgets to override fails loudly with a
clear message instead of AttributeError deep in load_weights.
Subclasses now only hold the bits that differ (backbone-specific name
mapping, LoRA merge). Future pipeline changes go in one place.
Made-with: Cursor
* vllm plugin: address CodeQL findings
Four nit-level findings flagged by github-advanced-security on the latest
PR push. Behavior unchanged.
- __init__.py: add comment to the empty `except Exception: pass` around
the NemotronH config patch — best-effort patch, silently skipped when
the model class isn't reachable so other backbones still load.
- model.py: drop redundant in-function `import re` in
_normalize_lora_name (already imported at module top, line 31).
- test_vllm_plugin.py: probe vLLM via importlib.util.find_spec instead
of `import vllm` (CodeQL flags it even with `# noqa: F401`); drop
stale `output_dim=256` kwarg from _load_nemo_perception call
(parameter removed in 105a3dd).
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Made-with: Cursor
* vllm plugin: rely on upstream tokenizer concurrency fix
Remove the global HuggingFace fast-tokenizer monkey patch now that modern vLLM isolates multimodal tokenizer use, and keep the plugin compatible with the moved vLLM multimodal input type.
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
* vllm plugin: require exported SpeechLM config fields
Fail fast when exported SpeechLM checkpoints omit fields that define the backbone, ASR source, prompt format, audio token, or pretrained-weight contract instead of silently falling back to local defaults.
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
* vllm plugin: simplify NemotronH architecture comment
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
* vllm plugin: harden config and registration tests
Make plugin tests hermetic by mocking backbone config loading, skip estimator drift checks cleanly when vLLM is unavailable, and cover request-time invariants plus no tokenizer monkey patch behavior.
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
* vllm plugin: allow HF default config construction
Allow HuggingFace to instantiate NeMoSpeechLMConfig without checkpoint fields for internal serialization while preserving validation for real exported SpeechLM configs.
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Made-with: Cursor
* Apply isort and black reformatting
Signed-off-by: DongjiGao <DongjiGao@users.noreply.github.com>
* vllm plugin: simplify fake tokenizer callbacks
Use the fake tokenizer class directly instead of wrapping it in no-op lambdas to satisfy CodeQL without changing test behavior.
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Made-with: Cursor
* vllm plugin: rename nemotron_v3 to salm, collapse to single model class with backend composition
Address Piotr's review (PR #15520): the nemotron_v3 folder name no longer matches the scope (the plugin handles both standard transformer and hybrid Mamba+MoE backbones), and the _NeMoSpeechLMBase + two-class inheritance pattern duplicates wiring across backbones. Rename the package to salm and replace the inheritance structure with composition: a single NeMoSpeechLMForConditionalGeneration class delegates LLM-specific work (architecture name, weight rename, optional LoRA merge, mamba state passthroughs) to a TransformerBackend or HybridBackend selected by make_backend(config).
Single-class registration is feasible because vLLM's runtime ModelConfig.is_hybrid property uses text_config.layer_types as an escape hatch (the granite-4.0-micro path): we declare IsHybrid on the model class for the NemotronH backbone, and config.py populates layer_types=['attention']*N for transformer backbones so vLLM treats them as attention-only at runtime. There is no runtime isinstance(model, IsHybrid) check anywhere in vLLM that bypasses the property, so this collapses cleanly.
salm/ now splits into:
config.py NeMoSpeechLMConfig + the layer_types shim
multimodal.py audio helpers and vLLM processor/info/dummy-inputs trio
backends.py _BaseBackend, TransformerBackend, HybridBackend, make_backend
model.py single NeMoSpeechLMForConditionalGeneration class
__init__.py register() with one architecture name
examples/speechlm2/to_hf.py emits the unified architecture name. Tests are updated for the new import paths and add coverage for layer_types shim wiring and make_backend dispatch (45 unit tests pass).
Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* vllm plugin: avoid loading backbone config during registration
Keep SALM plugin registration side-effect light by removing the NemotronH runtime monkey patch and relying on NeMoSpeechLMConfig to normalize rms_norm_eps on the wrapped backbone config.
Add a registration test that fails if register() starts loading remote backbone configs again.
Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
* vllm plugin: simplify salm config comments
Clarify the transformer layer_types shim and trim stale embedding-row commentary after review.
Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
* vllm plugin: normalize salm audio parser channels
Ask vLLM's multimodal parser to reduce audio inputs to mono alongside the existing 16 kHz resampling, and pin that parser contract in the plugin tests.
Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
* vllm plugin: simplify salm model comments
Remove stale inline commentary from the audio parsing path after review.
Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
* vllm plugin: remove unused audio typing import
Drop the stale MultiModalEmbeddings re-export from audio.py; model.py imports the type directly where it is used.
Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
* vllm plugin: clarify transformer backend docs
Remove incorrect Parakeet-TDT examples from SALM transformer-backend documentation and describe the path as decoder-only LLM backbones.
Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
* vllm plugin: remove audio duration limit from processing info
Stop exposing a 40s max audio length through NeMoSpeechLMProcessingInfo; keep the finite 40s length only for vLLM dummy/profiling inputs.
Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
* vllm plugin: seed text config before base init
Newer Transformers may call get_text_config during PretrainedConfig initialization, before the SALM wrapper has loaded the real backbone config. Seed an inert text_config first and keep the real checkpoint path unchanged.
Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
* vllm plugin: skip backend tests without vllm
Backend selection imports salm.backends, which depends on vLLM symbols. Guard those tests on vLLM availability so CPU SpeechLM2 shards without vLLM skip them like the other plugin runtime tests.
Made-with: Cursor
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
---------
Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Signed-off-by: DongjiGao <DongjiGao@users.noreply.github.com>
Co-authored-by: DongjiGao <DongjiGao@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>1 parent fb9a6c4 commit c66a379
11 files changed
Lines changed: 1702 additions & 12 deletions
File tree
- examples/speechlm2
- nemo/collections/speechlm2/vllm
- salm
- tests/collections/speechlm2
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
107 | 107 | | |
108 | 108 | | |
109 | 109 | | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | 110 | | |
114 | | - | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
115 | 117 | | |
116 | 118 | | |
117 | 119 | | |
| |||
131 | 133 | | |
132 | 134 | | |
133 | 135 | | |
134 | | - | |
135 | | - | |
136 | 136 | | |
137 | 137 | | |
138 | 138 | | |
| |||
Whitespace-only changes.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
0 commit comments