Add vLLM support for NeMo SpeechLM by DongjiGao · Pull Request #15520 · NVIDIA-NeMo/NeMo

DongjiGao · 2026-03-18T19:19:33Z

What does this PR do?

Adds a vLLM plugin for NeMo SpeechLM/SALM checkpoints so speech encoder + projection + LLM backbones can be served through vLLM with multimodal audio inputs, PagedAttention, and horizontal NeMo-Skills evaluation.

The plugin now registers one architecture, NeMoSpeechLMForConditionalGeneration, and selects the transformer or NemotronH hybrid backend at model initialization time.

Collection: speechlm2

Change log

Add nemo/collections/speechlm2/vllm/salm/ plugin package.
- config.py: NeMoSpeechLMConfig, exported-field validation, backbone config wrapping, transformer all-attention layer_types shim for runtime non-hybrid KV cache.
- audio.py: vLLM multimodal audio parser/processor/dummy inputs, 16 kHz mono audio normalization, placeholder expansion, audio token estimation.
- backends.py: transformer vs. NemotronH backend composition, LoRA merge, NeMo-to-HF weight-name mapping, NemotronH mamba state delegation.
- model.py: single vLLM model class combining the NeMo audio tower with a vLLM-native language model.
- __init__.py: plugin registration only; no registration-time remote backbone config load.
Register the plugin via vllm.general_plugins in pyproject.toml.
Update examples/speechlm2/to_hf.py to export the unified vLLM architecture name.
Add/update unit tests for config validation, backend selection, runtime hybrid detection, audio processing, registration side effects, and HF export behavior.

Usage

import os
import soundfile as sf
from vllm import LLM, SamplingParams

os.environ["VLLM_PLUGINS"] = "nemo_speechlm"

llm = LLM(
    model="/path/to/salm-hf-checkpoint",
    hf_overrides={
        "architectures": ["NeMoSpeechLMForConditionalGeneration"],
        "model_type": "nemo_speechlm",
        "pretrained_llm": "Qwen/Qwen3-1.7B",
        "pretrained_asr": "nvidia/canary-1b-flash",
        "audio_locator_tag": "<|audio|>",
        "prompt_format": "qwen",
        "pretrained_weights": True,
    },
    trust_remote_code=True,
    dtype="bfloat16",
    limit_mm_per_prompt={"audio": 1},
)

tokenizer = llm.get_tokenizer()
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Transcribe the following: <|audio|>"}],
    tokenize=False,
    add_generation_prompt=True,
)

audio, sr = sf.read("audio.wav", dtype="float32")
outputs = llm.generate(
    [{"prompt": prompt, "multi_modal_data": {"audio": (audio, sr)}}],
    SamplingParams(max_tokens=256, temperature=0.0),
)
print(outputs[0].outputs[0].text)

Validation

Unit / style

python -m pytest tests/collections/speechlm2/test_vllm_plugin.py tests/collections/speechlm2/test_to_hf.py tests/collections/speechlm2/test_vllm_audio_token_estimator.py -q
- 55 passed
python setup.py style --scope <changed file> on edited SALM plugin/test files.

Real vLLM inference

Open ASR Leaderboard (asr-leaderboard), 8 chunks, NeMo-Skills + vLLM server, tokens_to_generate=256.

Backend	vLLM	Mode	Entries	WER	Generation seconds
Qwen3 transformer	0.19.1	compile/default	83,212	11.33	522
Qwen3 transformer	0.20.0	`--enforce-eager`	83,212	11.39	1474
NemotronH hybrid	0.19.1	compile/default	83,212	6.25	472
NemotronH hybrid	0.20.0	`--enforce-eager`	83,212	6.24	476

The transformer runs validate that the single model class can remain hybrid-capable while using vLLM's runtime non-hybrid KV-cache path. The NemotronH runs validate hybrid backend dispatch and mamba state allocation.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

Additional Information

Requires a vLLM runtime container with the NeMo plugin checkout installed editable and VLLM_PLUGINS=nemo_speechlm.
Current plugin path is salm; the older nemotron_v3 plugin path/classes were removed.
NemotronH to_hf.py conversion from older training checkpoints may fail if the checkpoint predates recent NeMo distributed-checkpoint state-dict changes; inference validation above used an already converted HF checkpoint for the hybrid path.

DongjiGao · 2026-03-19T18:03:34Z

@pzelasko

Cherry-picked from DongjiGao/NeMo vllm-nemo-speechlm branch. Adds vLLM plugin that registers NeMo SpeechLM models into vLLM's model registry via vllm.general_plugins entry point. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Register NeMo Speech LM models into vLLM via the general_plugins entry point. Supports hybrid (NemotronH) and standard transformer (Qwen3) backbones. - NeMoSpeechLMHybridForConditionalGeneration: hybrid Mamba+MoE models - NeMoSpeechLMForConditionalGeneration: standard transformer models - NeMoSpeechLMStdForConditionalGeneration: legacy alias for standard - Audio preprocessing with automatic resampling to 16kHz mono - Thread-safe tokenizer patch for vLLM's concurrent encoding - Includes unit tests

Swap the naming convention so it follows "unqualified base name = default variant, qualified name = specialization": NeMoSpeechLMForConditionalGeneration -> standard (Qwen3, Parakeet) NeMoSpeechLMHybridForConditionalGeneration -> hybrid Mamba+MoE (NemotronH) Previously the unqualified base name was the hybrid class, which made to_hf.py's arch auto-detection point non-hybrid checkpoints at the wrong implementation. Keep to_hf.py as the contract and rename the plugin classes to match. Legacy alias NeMoSpeechLMStdForConditionalGeneration now points at the new base-named class so checkpoints exported under the old name load. Made-with: Cursor

No checkpoints in circulation use this name -- to_hf.py is the single source of truth for exported architecture names, and it only emits the two canonical names. Made-with: Cursor

The package covers every SpeechLM backbone (Qwen3, NemotronH, ...); the folder name is a historical artifact from when the plugin started as a NemotronH-only experiment. Made-with: Cursor

- Fail fast in `NeMoSpeechLMConfig.__init__` when the backbone config's `architectures` list isn't length-1: mixed or missing architectures currently route silently (mixed -> hybrid-if-any-match; missing -> treated as standard). A raised ValueError catches malformed ckpts at plugin load time instead of serving wrong weights. - Name the magic +10 on `text_config.vocab_size`: new constant `_SPEECHLM_EMBED_EXTRA_ROWS` with a block comment explaining it must match training-time vocab additions (audio locator + padding) so the embedding matrix in model.safetensors loads without shape mismatch. - Document the `architectures = ["NemotronHForCausalLM"]` normalization on hybrid backends (different checkpoints list different aliases; only the canonical name is in vLLM's registry). - Add a docstring on `__getattr__` explaining the guard list: prevents infinite recursion when plugin-specific fields are queried before `__init__` finishes, and prevents accidental delegation to same-named attributes on the wrapped `text_config`. - Drop the redundant `_ATTR_ALIASES` entry from the guard tuple: it starts with `_` so `startswith("_")` already catches it. Made-with: Cursor

Every training YAML in speechlm-2026h1/ sets perception.output_dim explicitly, so the 'if "output_dim" not in cfg' fallback never fired. Remove it and the now-unused output_dim parameter (plus the callsite's llm_hidden derivation). If a terse perception config lands here later, AudioPerceptionModule will fail on its own with a clearer error. Made-with: Cursor

Verb-led name spells out what the helper does ('pad [the tensor] to [vocab_size]') instead of the ambiguous 'vocab tensor'. Pure rename, no behavior change. Made-with: Cursor

Five signatures were missing hints and tripped the 'every exposed method needs Python 3 type hints' rule from the NeMo contributor checklist: _ensure_special_tokens, _init_perception, and the three Mamba-state classmethods. Uses PreTrainedTokenizerBase for the tokenizer, VllmConfig for vllm_config args, and Any for Mamba return types + _init_perception's config (NeMoSpeechLMConfig is same-package and brings import-cycle risk not worth the precision). Made-with: Cursor

The hand-rolled _estimate_audio_tokens function mirrors FastConformer's preprocessing chain (STFT + 3x Conv subsampling) but in pure Python to avoid ~90x tensor-ops overhead on the scheduler hotpath (measured 0.18 us vs 16 us per call via calc_length). Added: - Full docstring on _estimate_audio_tokens explaining what it mirrors, why it is hand-rolled, and a pointer to the drift test. - tests/collections/speechlm2/test_vllm_audio_token_estimator.py that asserts the estimator equals NeMo calc_length-based reference on 9 canonical audio lengths. Breaks when FastConformer's downsampling stack changes upstream, forcing a rewrite of the hand-rolled math. Made-with: Cursor

_get_prompt_updates's inner get_replacement closure previously re-ran mm_items.get_items('audio', ...) on every call. The lookup is O(1) and mm_items is already finalized at this point, so pulling it out once saves a redundant dict access per <|audio|> match and makes the closure body one line shorter. Pure cleanup, no behavior change. Made-with: Cursor

_call_hf_processor silently accepted mismatches between the number of <|audio|> placeholders in the prompt and the number of audios in mm_data. The old loop processed first-N of whichever was shorter and left the surplus for a shape-mismatch crash deep in get_input_embeddings at forward time. Now: - Pre-loop length check: raises ValueError with a clear message when counts differ, so the error surfaces at the processor stage where the caller can see it. - Loop iterates ph_positions zipped with audios instead of walking all split parts and skipping text chunks; no audio_idx counter, no per- iteration <|audio|> branch, same behavior. - Short comment documents the positional pairing invariant. Made-with: Cursor

_parse_audio_input had two defensive branches that duplicated or contradicted pipeline behavior: - if audio_signal_length is None: [shape[-1]] * batch - elif not isinstance(..., Tensor): torch.tensor(...) The None branch was latently wrong. By the time execution reaches it, audio_signal has been zero-padded to max batch length via the list-stacking block above, so audio_signal.shape[-1] is the padded length, not the true audio length. Handing that to the perception encoder as input_signal_length means the encoder treats trailing zeros as real audio and emits extra output frames, silently breaking placeholder/feature alignment. In the real pipeline, _call_hf_processor always emits audio_signal_length as a 1D torch.Tensor of true per-audio lengths alongside audio_signal (both declared batched in _get_mm_fields_config), so neither branch is reachable. Replaced both with a single type check that raises ValueError when the invariant is violated. Made-with: Cursor

_parse_audio_input had a **kwargs-only signature and popped audio fields inside the body. It mirrored vLLM's embed_multimodal(**kwargs) style but leaked the pattern into an internal helper that has a well-defined contract: exactly two inputs from the TensorSchema. Switched to explicit params (audio_signal, audio_signal_length) with **kwargs kept for forward-compat absorbing unexpected fields. Lets type checkers catch wrong-type callers and documents the contract in the signature itself. Made-with: Cursor

self.perception = self.perception.to(device) reads perception's own device and moves perception there -- always a no-op in the TP, PP, and single-GPU paths. Fragmented placement is the only case it would trigger, and there it silently moves all params to the first-by- iteration param's device, which is not controllable from the caller. Real device placement is established at init time by _mark_tower_model and declared structurally via get_mm_mapping. Added a short comment so future readers don't assume the line is doing real work and plan multi-GPU changes around it. Made-with: Cursor

Both NeMoSpeechLMHybridForConditionalGeneration and NeMoSpeechLMForConditionalGeneration had near-identical load_weights methods. The only real difference was one extra step for Standard: LoRA merge before HF-name rename. Refactored: - _NeMoSpeechLMBase.load_weights orchestrates the full pipeline (split -> perception load -> preprocess -> rename -> vLLM load). - _preprocess_llm_weights on base returns identity; Standard overrides to run _merge_lora_weights. Hybrid doesn't override. - _nemo_to_hf_llm_weights declared on base with NotImplementedError so a future subclass that forgets to override fails loudly with a clear message instead of AttributeError deep in load_weights. Subclasses now only hold the bits that differ (backbone-specific name mapping, LoRA merge). Future pipeline changes go in one place. Made-with: Cursor

Drop the stale MultiModalEmbeddings re-export from audio.py; model.py imports the type directly where it is used. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>

pzelasko · 2026-03-19T18:10:33Z

 llm = "nemo.collections.llm"

+[project.entry-points."vllm.general_plugins"]
+nemo_speechlm = "nemo.collections.speechlm2.vllm.nemotron_v3:register"


What does it do?

This registers nemo_speechlm as a vLLM plugin within the Python package. When vLLM starts, it will search for and find this plugin and its corresponding register function.

pzelasko · 2026-05-04T22:05:45Z

+
+
+class TransformerBackend(_BaseBackend):
+    """Standard transformer backbones (e.g. Qwen3, Parakeet-TDT).


Parakeet-TDT is not a transformer. I saw this in another comment elsewhere. Let's remove this.

pzelasko · 2026-05-04T22:08:35Z

+
+_SAMPLING_RATE = 16000
+_AUDIO_CHANNELS = 1
+_MAX_AUDIO_DURATION_S = 40.0


Remove max audio duration limitation - let's make it unlimited by default.

Remove incorrect Parakeet-TDT examples from SALM transformer-backend documentation and describe the path as decoder-only LLM backbones. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Stop exposing a 40s max audio length through NeMoSpeechLMProcessingInfo; keep the finite 40s length only for vLLM dummy/profiling inputs. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>

pzelasko · 2026-05-04T22:49:48Z

/ok to test 5618128

pzelasko · 2026-05-04T22:50:33Z

@chtruong814 for pyproject.toml

Newer Transformers may call get_text_config during PretrainedConfig initialization, before the SALM wrapper has loaded the real backbone config. Seed an inert text_config first and keep the real checkpoint path unchanged. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>

pzelasko · 2026-05-05T13:26:52Z

/ok to test afcfa30

Backend selection imports salm.backends, which depends on vLLM symbols. Guard those tests on vLLM availability so CPU SpeechLM2 shards without vLLM skip them like the other plugin runtime tests. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>

pzelasko · 2026-05-05T17:17:13Z



-@pytest.mark.skipif(not _HAS_CONFIG, reason="NeMoSpeechLMConfig not available")
+@pytest.mark.skipif(not (_HAS_CONFIG and _HAS_VLLM), reason="NeMoSpeechLMConfig or vLLM not available")


Note to self: need to add vLLM to the container to be able to run some tests

pzelasko · 2026-05-05T17:18:03Z

/ok to test cbf4305

github-actions · 2026-05-05T21:27:08Z

[🤖]: Hi @DongjiGao 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

DongjiGao changed the title ~~Vllm nemo speechlm~~ Add vllm support for nemo speechlm Mar 18, 2026

github-actions Bot added the community-request label Mar 19, 2026

github-advanced-security AI found potential problems Mar 19, 2026

View reviewed changes

Comment thread nemo/collections/speechlm2/vllm/salm/__init__.py Fixed

Comment thread tests/collections/speechlm2/test_vllm_plugin.py Fixed

chtruong814 added the needs-follow-up Issue needs follow-up label Mar 21, 2026

DongjiGao force-pushed the vllm-nemo-speechlm branch from b260d27 to ee3a26f Compare April 20, 2026 06:14

github-actions Bot added core Changes to NeMo Core common labels Apr 20, 2026

DongjiGao added 15 commits April 19, 2026 23:17

vllm plugin: drop NeMoSpeechLMStdForConditionalGeneration legacy alias

0cd6044

No checkpoints in circulation use this name -- to_hf.py is the single source of truth for exported architecture names, and it only emits the two canonical names. Made-with: Cursor

vllm plugin: document that nemotron_v3 name is historical

fc4525c

The package covers every SpeechLM backbone (Qwen3, NemotronH, ...); the folder name is a historical artifact from when the plugin started as a NemotronH-only experiment. Made-with: Cursor

vllm plugin: rename _pad_vocab_tensor -> _pad_to_vocab_size

5e2635f

Verb-led name spells out what the helper does ('pad [the tensor] to [vocab_size]') instead of the ambiguous 'vocab tensor'. Pure rename, no behavior change. Made-with: Cursor

DongjiGao force-pushed the vllm-nemo-speechlm branch from 5f1a253 to 3119030 Compare April 20, 2026 06:18

github-actions Bot removed core Changes to NeMo Core common labels Apr 20, 2026

github-advanced-security AI found potential problems Apr 20, 2026

View reviewed changes

Comment thread tests/collections/speechlm2/test_vllm_plugin.py Fixed

Comment thread nemo/collections/speechlm2/vllm/salm/__init__.py Fixed

Comment thread nemo/collections/speechlm2/vllm/nemotron_v3/model.py Fixed

DongjiGao force-pushed the vllm-nemo-speechlm branch from 43e8987 to 5be34e6 Compare April 20, 2026 18:28

github-actions Bot added the common label Apr 20, 2026

vllm plugin: remove unused audio typing import

43d278d

Drop the stale MultiModalEmbeddings re-export from audio.py; model.py imports the type directly where it is used. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>

DongjiGao requested a review from a team as a code owner May 4, 2026 17:11

pzelasko reviewed May 4, 2026

View reviewed changes

DongjiGao and others added 3 commits May 4, 2026 15:29

vllm plugin: clarify transformer backend docs

1de3dfc

Remove incorrect Parakeet-TDT examples from SALM transformer-backend documentation and describe the path as decoder-only LLM backbones. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>

vllm plugin: remove audio duration limit from processing info

a76d380

Stop exposing a 40s max audio length through NeMoSpeechLMProcessingInfo; keep the finite 40s length only for vLLM dummy/profiling inputs. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Merge branch 'main' into vllm-nemo-speechlm

5618128

pzelasko added the skip-linting label May 4, 2026

pzelasko enabled auto-merge (squash) May 4, 2026 22:49

pzelasko previously approved these changes May 4, 2026

View reviewed changes

pzelasko requested a review from chtruong814 May 4, 2026 22:50

copy-pr-bot Bot temporarily deployed to test May 4, 2026 22:52 Inactive

svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 5, 2026

auto-merge was automatically disabled May 5, 2026 02:11
Head branch was pushed to by a user without write access

DongjiGao dismissed pzelasko’s stale review via afcfa30 May 5, 2026 02:11

pzelasko previously approved these changes May 5, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to test May 5, 2026 13:29 Inactive

DongjiGao dismissed pzelasko’s stale review via cbf4305 May 5, 2026 15:49

pzelasko reviewed May 5, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to test May 5, 2026 17:20 Inactive

pzelasko approved these changes May 5, 2026

View reviewed changes

pzelasko enabled auto-merge (squash) May 5, 2026 21:44

chtruong814 approved these changes May 5, 2026

View reviewed changes

pzelasko merged commit c66a379 into NVIDIA-NeMo:main May 5, 2026
215 of 217 checks passed



		class TransformerBackend(_BaseBackend):
		"""Standard transformer backbones (e.g. Qwen3, Parakeet-TDT).



		@pytest.mark.skipif(not _HAS_CONFIG, reason="NeMoSpeechLMConfig not available")
		@pytest.mark.skipif(not (_HAS_CONFIG and _HAS_VLLM), reason="NeMoSpeechLMConfig or vLLM not available")

Conversation

DongjiGao commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Change log

Usage

Validation

Unit / style

Real vLLM inference

Before your PR is "Ready for review"

Additional Information

Uh oh!

DongjiGao commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pzelasko Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

DongjiGao May 4, 2026

Choose a reason for hiding this comment

Uh oh!

pzelasko May 4, 2026

Choose a reason for hiding this comment

Uh oh!

DongjiGao May 4, 2026

Choose a reason for hiding this comment

Uh oh!

pzelasko May 4, 2026

Choose a reason for hiding this comment

Uh oh!

DongjiGao May 4, 2026

Choose a reason for hiding this comment

Uh oh!

pzelasko commented May 4, 2026

Uh oh!

pzelasko commented May 4, 2026

Uh oh!

pzelasko commented May 5, 2026

Uh oh!

pzelasko May 5, 2026

Choose a reason for hiding this comment

Uh oh!

pzelasko commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DongjiGao commented Mar 18, 2026 •

edited

Loading