Add vLLM support for NeMo SpeechLM#15520
Conversation
Cherry-picked from DongjiGao/NeMo vllm-nemo-speechlm branch. Adds vLLM plugin that registers NeMo SpeechLM models into vLLM's model registry via vllm.general_plugins entry point. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b260d27 to
ee3a26f
Compare
Register NeMo Speech LM models into vLLM via the general_plugins entry point. Supports hybrid (NemotronH) and standard transformer (Qwen3) backbones. - NeMoSpeechLMHybridForConditionalGeneration: hybrid Mamba+MoE models - NeMoSpeechLMForConditionalGeneration: standard transformer models - NeMoSpeechLMStdForConditionalGeneration: legacy alias for standard - Audio preprocessing with automatic resampling to 16kHz mono - Thread-safe tokenizer patch for vLLM's concurrent encoding - Includes unit tests
Swap the naming convention so it follows "unqualified base name = default variant, qualified name = specialization": NeMoSpeechLMForConditionalGeneration -> standard (Qwen3, Parakeet) NeMoSpeechLMHybridForConditionalGeneration -> hybrid Mamba+MoE (NemotronH) Previously the unqualified base name was the hybrid class, which made to_hf.py's arch auto-detection point non-hybrid checkpoints at the wrong implementation. Keep to_hf.py as the contract and rename the plugin classes to match. Legacy alias NeMoSpeechLMStdForConditionalGeneration now points at the new base-named class so checkpoints exported under the old name load. Made-with: Cursor
No checkpoints in circulation use this name -- to_hf.py is the single source of truth for exported architecture names, and it only emits the two canonical names. Made-with: Cursor
The package covers every SpeechLM backbone (Qwen3, NemotronH, ...); the folder name is a historical artifact from when the plugin started as a NemotronH-only experiment. Made-with: Cursor
- Fail fast in `NeMoSpeechLMConfig.__init__` when the backbone config's
`architectures` list isn't length-1: mixed or missing architectures
currently route silently (mixed -> hybrid-if-any-match; missing ->
treated as standard). A raised ValueError catches malformed ckpts at
plugin load time instead of serving wrong weights.
- Name the magic +10 on `text_config.vocab_size`: new constant
`_SPEECHLM_EMBED_EXTRA_ROWS` with a block comment explaining it must
match training-time vocab additions (audio locator + padding) so the
embedding matrix in model.safetensors loads without shape mismatch.
- Document the `architectures = ["NemotronHForCausalLM"]` normalization
on hybrid backends (different checkpoints list different aliases; only
the canonical name is in vLLM's registry).
- Add a docstring on `__getattr__` explaining the guard list: prevents
infinite recursion when plugin-specific fields are queried before
`__init__` finishes, and prevents accidental delegation to same-named
attributes on the wrapped `text_config`.
- Drop the redundant `_ATTR_ALIASES` entry from the guard tuple: it
starts with `_` so `startswith("_")` already catches it.
Made-with: Cursor
Every training YAML in speechlm-2026h1/ sets perception.output_dim explicitly, so the 'if "output_dim" not in cfg' fallback never fired. Remove it and the now-unused output_dim parameter (plus the callsite's llm_hidden derivation). If a terse perception config lands here later, AudioPerceptionModule will fail on its own with a clearer error. Made-with: Cursor
Verb-led name spells out what the helper does ('pad [the tensor] to
[vocab_size]') instead of the ambiguous 'vocab tensor'. Pure rename,
no behavior change.
Made-with: Cursor
Five signatures were missing hints and tripped the 'every exposed method needs Python 3 type hints' rule from the NeMo contributor checklist: _ensure_special_tokens, _init_perception, and the three Mamba-state classmethods. Uses PreTrainedTokenizerBase for the tokenizer, VllmConfig for vllm_config args, and Any for Mamba return types + _init_perception's config (NeMoSpeechLMConfig is same-package and brings import-cycle risk not worth the precision). Made-with: Cursor
The hand-rolled _estimate_audio_tokens function mirrors FastConformer's preprocessing chain (STFT + 3x Conv subsampling) but in pure Python to avoid ~90x tensor-ops overhead on the scheduler hotpath (measured 0.18 us vs 16 us per call via calc_length). Added: - Full docstring on _estimate_audio_tokens explaining what it mirrors, why it is hand-rolled, and a pointer to the drift test. - tests/collections/speechlm2/test_vllm_audio_token_estimator.py that asserts the estimator equals NeMo calc_length-based reference on 9 canonical audio lengths. Breaks when FastConformer's downsampling stack changes upstream, forcing a rewrite of the hand-rolled math. Made-with: Cursor
_get_prompt_updates's inner get_replacement closure previously re-ran
mm_items.get_items('audio', ...) on every call. The lookup is O(1) and
mm_items is already finalized at this point, so pulling it out once
saves a redundant dict access per <|audio|> match and makes the closure
body one line shorter. Pure cleanup, no behavior change.
Made-with: Cursor
_call_hf_processor silently accepted mismatches between the number of <|audio|> placeholders in the prompt and the number of audios in mm_data. The old loop processed first-N of whichever was shorter and left the surplus for a shape-mismatch crash deep in get_input_embeddings at forward time. Now: - Pre-loop length check: raises ValueError with a clear message when counts differ, so the error surfaces at the processor stage where the caller can see it. - Loop iterates ph_positions zipped with audios instead of walking all split parts and skipping text chunks; no audio_idx counter, no per- iteration <|audio|> branch, same behavior. - Short comment documents the positional pairing invariant. Made-with: Cursor
_parse_audio_input had two defensive branches that duplicated or contradicted pipeline behavior: - if audio_signal_length is None: [shape[-1]] * batch - elif not isinstance(..., Tensor): torch.tensor(...) The None branch was latently wrong. By the time execution reaches it, audio_signal has been zero-padded to max batch length via the list-stacking block above, so audio_signal.shape[-1] is the padded length, not the true audio length. Handing that to the perception encoder as input_signal_length means the encoder treats trailing zeros as real audio and emits extra output frames, silently breaking placeholder/feature alignment. In the real pipeline, _call_hf_processor always emits audio_signal_length as a 1D torch.Tensor of true per-audio lengths alongside audio_signal (both declared batched in _get_mm_fields_config), so neither branch is reachable. Replaced both with a single type check that raises ValueError when the invariant is violated. Made-with: Cursor
_parse_audio_input had a **kwargs-only signature and popped audio fields inside the body. It mirrored vLLM's embed_multimodal(**kwargs) style but leaked the pattern into an internal helper that has a well-defined contract: exactly two inputs from the TensorSchema. Switched to explicit params (audio_signal, audio_signal_length) with **kwargs kept for forward-compat absorbing unexpected fields. Lets type checkers catch wrong-type callers and documents the contract in the signature itself. Made-with: Cursor
self.perception = self.perception.to(device) reads perception's own device and moves perception there -- always a no-op in the TP, PP, and single-GPU paths. Fragmented placement is the only case it would trigger, and there it silently moves all params to the first-by- iteration param's device, which is not controllable from the caller. Real device placement is established at init time by _mark_tower_model and declared structurally via get_mm_mapping. Added a short comment so future readers don't assume the line is doing real work and plan multi-GPU changes around it. Made-with: Cursor
Both NeMoSpeechLMHybridForConditionalGeneration and NeMoSpeechLMForConditionalGeneration had near-identical load_weights methods. The only real difference was one extra step for Standard: LoRA merge before HF-name rename. Refactored: - _NeMoSpeechLMBase.load_weights orchestrates the full pipeline (split -> perception load -> preprocess -> rename -> vLLM load). - _preprocess_llm_weights on base returns identity; Standard overrides to run _merge_lora_weights. Hybrid doesn't override. - _nemo_to_hf_llm_weights declared on base with NotImplementedError so a future subclass that forgets to override fails loudly with a clear message instead of AttributeError deep in load_weights. Subclasses now only hold the bits that differ (backbone-specific name mapping, LoRA merge). Future pipeline changes go in one place. Made-with: Cursor
5f1a253 to
3119030
Compare
43e8987 to
5be34e6
Compare
Drop the stale MultiModalEmbeddings re-export from audio.py; model.py imports the type directly where it is used. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>
| llm = "nemo.collections.llm" | ||
|
|
||
| [project.entry-points."vllm.general_plugins"] | ||
| nemo_speechlm = "nemo.collections.speechlm2.vllm.nemotron_v3:register" |
There was a problem hiding this comment.
This registers nemo_speechlm as a vLLM plugin within the Python package. When vLLM starts, it will search for and find this plugin and its corresponding register function.
|
|
||
|
|
||
| class TransformerBackend(_BaseBackend): | ||
| """Standard transformer backbones (e.g. Qwen3, Parakeet-TDT). |
There was a problem hiding this comment.
Parakeet-TDT is not a transformer. I saw this in another comment elsewhere. Let's remove this.
|
|
||
| _SAMPLING_RATE = 16000 | ||
| _AUDIO_CHANNELS = 1 | ||
| _MAX_AUDIO_DURATION_S = 40.0 |
There was a problem hiding this comment.
Remove max audio duration limitation - let's make it unlimited by default.
Remove incorrect Parakeet-TDT examples from SALM transformer-backend documentation and describe the path as decoder-only LLM backbones. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Stop exposing a 40s max audio length through NeMoSpeechLMProcessingInfo; keep the finite 40s length only for vLLM dummy/profiling inputs. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>
|
/ok to test 5618128 |
|
@chtruong814 for pyproject.toml |
Newer Transformers may call get_text_config during PretrainedConfig initialization, before the SALM wrapper has loaded the real backbone config. Seed an inert text_config first and keep the real checkpoint path unchanged. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Head branch was pushed to by a user without write access
|
/ok to test afcfa30 |
Backend selection imports salm.backends, which depends on vLLM symbols. Guard those tests on vLLM availability so CPU SpeechLM2 shards without vLLM skip them like the other plugin runtime tests. Made-with: Cursor Signed-off-by: Dongji Gao <dongjig@nvidia.com>
|
|
||
|
|
||
| @pytest.mark.skipif(not _HAS_CONFIG, reason="NeMoSpeechLMConfig not available") | ||
| @pytest.mark.skipif(not (_HAS_CONFIG and _HAS_VLLM), reason="NeMoSpeechLMConfig or vLLM not available") |
There was a problem hiding this comment.
Note to self: need to add vLLM to the container to be able to run some tests
|
/ok to test cbf4305 |
|
[🤖]: Hi @DongjiGao 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
What does this PR do?
Adds a vLLM plugin for NeMo SpeechLM/SALM checkpoints so speech encoder + projection + LLM backbones can be served through vLLM with multimodal audio inputs, PagedAttention, and horizontal NeMo-Skills evaluation.
The plugin now registers one architecture,
NeMoSpeechLMForConditionalGeneration, and selects the transformer or NemotronH hybrid backend at model initialization time.Collection: speechlm2
Change log
nemo/collections/speechlm2/vllm/salm/plugin package.config.py:NeMoSpeechLMConfig, exported-field validation, backbone config wrapping, transformer all-attentionlayer_typesshim for runtime non-hybrid KV cache.audio.py: vLLM multimodal audio parser/processor/dummy inputs, 16 kHz mono audio normalization, placeholder expansion, audio token estimation.backends.py: transformer vs. NemotronH backend composition, LoRA merge, NeMo-to-HF weight-name mapping, NemotronH mamba state delegation.model.py: single vLLM model class combining the NeMo audio tower with a vLLM-native language model.__init__.py: plugin registration only; no registration-time remote backbone config load.vllm.general_pluginsinpyproject.toml.examples/speechlm2/to_hf.pyto export the unified vLLM architecture name.Usage
Validation
Unit / style
python -m pytest tests/collections/speechlm2/test_vllm_plugin.py tests/collections/speechlm2/test_to_hf.py tests/collections/speechlm2/test_vllm_audio_token_estimator.py -q55 passedpython setup.py style --scope <changed file>on edited SALM plugin/test files.Real vLLM inference
Open ASR Leaderboard (
asr-leaderboard), 8 chunks, NeMo-Skills + vLLM server,tokens_to_generate=256.--enforce-eager--enforce-eagerThe transformer runs validate that the single model class can remain hybrid-capable while using vLLM's runtime non-hybrid KV-cache path. The NemotronH runs validate hybrid backend dispatch and mamba state allocation.
Before your PR is "Ready for review"
Pre checks:
PR Type:
Additional Information
VLLM_PLUGINS=nemo_speechlm.salm; the oldernemotron_v3plugin path/classes were removed.to_hf.pyconversion from older training checkpoints may fail if the checkpoint predates recent NeMo distributed-checkpoint state-dict changes; inference validation above used an already converted HF checkpoint for the hybrid path.