Skip to content

Add M2M100/NLLB support (nllb-200-distilled-600M, 1.3B, 3.3B)#19

Open
dschulmeist wants to merge 1 commit into
vllm-project:masterfrom
dschulmeist:add-nllb-m2m100-support
Open

Add M2M100/NLLB support (nllb-200-distilled-600M, 1.3B, 3.3B)#19
dschulmeist wants to merge 1 commit into
vllm-project:masterfrom
dschulmeist:add-nllb-m2m100-support

Conversation

@dschulmeist
Copy link
Copy Markdown
Contributor

@dschulmeist dschulmeist commented Apr 15, 2026

Summary

Adds M2M100ForConditionalGeneration support for Meta's NLLB distilled translation models:

  • facebook/nllb-200-distilled-600M
  • facebook/nllb-200-distilled-1.3B
  • facebook/nllb-200-3.3B

All three share model_type=m2m_100 and are registered under M2M100ForConditionalGeneration.

Depends on #20

M2M100MultiModalProcessor inherits create_encoder_prompt from BartMultiModalProcessor, so this feature requires the vLLM 0.18 compatibility fix in #20 to function under vLLM >=0.18. The generic compatibility changes were split into #20 per maintainer request.

Architecture differences from BART

Feature BART M2M100/NLLB
Positional embeddings Learned Fixed sinusoidal
LayerNorm position POST-norm PRE-norm
Post-stack layer norm No Yes (encoder + decoder)
Activation function GELU ReLU
final_logits_bias Yes No

Language routing

from vllm_bart_plugin.nllb import make_nllb_prompt

prompt = make_nllb_prompt(
    "The United Nations was founded in 1945.",
    src_lang="eng_Latn",
    tgt_lang="fra_Latn",
)
out = llm.generate([prompt], SamplingParams(temperature=0.0, max_tokens=60))
  • Encoder: source language token prepended via src_lang in mm_processor_kwargs (default eng_Latn).
  • Decoder: create_decoder_prompt resolves the FLORES-200 target language code to its token ID via tokenizer.convert_tokens_to_ids.

Tests

  • 12 unit tests (tests/test_nllb_model_structure.py) — no GPU required
  • 13 integration tests (tests/test_nllb_inference.py) — 4 target scripts, 3 non-English sources, batch, determinism, max_tokens

All 13 integration tests pass on NVIDIA GB10 (DGX Spark) with vLLM 0.18.0.

Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dschulmeist thanks for contributing!

Would you mind pushing the v0.18 fixes in a separate PR?
I thought we were vllm-0.18 compatible with the latest release, if that's not the case I could use your fix to issue a separate patch release.

@dschulmeist
Copy link
Copy Markdown
Contributor Author

Makes sense. I split the v0.18 compatibility changes into a separate PR and keep this one on M2M100/NLLB support

Copy link
Copy Markdown
Contributor Author

Split out the generic vLLM 0.18 compatibility changes into a separate PR: #20.

I also updated this branch so #19 now stays focused on the M2M100 / NLLB feature work and no longer carries the generic bart.py compatibility patch.

Adds M2M100ForConditionalGeneration support for the three NLLB
distilled translation models: facebook/nllb-200-distilled-600M,
1.3B, and 3.3B. All three share model_type=m2m_100.

Architecture differences from BART implemented in nllb.py:
- Sinusoidal (fixed) positional embeddings instead of learned
- PRE-LayerNorm (norm before sublayer) instead of POST-LayerNorm
- Additional layer_norm after all encoder/decoder layers
- ReLU activation instead of GELU
- No final_logits_bias

Language routing:
- Decoder starts with target language token via create_decoder_prompt,
  which resolves the FLORES-200 code (e.g. "fra_Latn") via
  convert_tokens_to_ids for reliable special-token handling.
- Source language token is prepended to the encoder input via src_lang
  in mm_processor_kwargs (default "eng_Latn"); a make_nllb_prompt
  helper is provided.

Depends on the BART processor vLLM 0.18 compatibility fix (PR vllm-project#20):
M2M100MultiModalProcessor inherits create_encoder_prompt from
BartMultiModalProcessor and needs the [0] placeholder behavior to
function under vLLM >=0.18.

Tests:
- 12 unit tests (tests/test_nllb_model_structure.py), no GPU required
- 13 integration tests (tests/test_nllb_inference.py) covering 4
  target scripts, 3 non-English sources, batching, determinism, and
  max_tokens. All 13 pass on NVIDIA GB10 (DGX Spark) with vLLM 0.18.0.

Signed-off-by: David Schulmeister <dschulmeist@users.noreply.github.com>
@dschulmeist
Copy link
Copy Markdown
Contributor Author

Updated the branch: v0.18 compatibility fixes are now in #20, this PR is squashed to a single clean commit with DCO signoff, and the PR body reflects the dependency on #20. Both PRs now pass DCO and are mergeable.

@dschulmeist
Copy link
Copy Markdown
Contributor Author

Hey @NickLucche both PRs are ready whenever you get a chance. #20 (v0.18 compat fix) first, then this one builds on top.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants