[None][feat] Add AutoDeploy custom model for InternLM3 by govind-ramnarayan · Pull Request #218 · nv-auto-deploy/TensorRT-LLM

govind-ramnarayan · 2026-03-12T05:15:08Z

Summary

Adds a prefill-only AutoDeploy custom model for internlm/internlm3-8b-instruct (InternLM3 family)
Bundles InternLM3Config since the model uses auto_map and is not in standard transformers (the checkpoint's modeling_internlm3.py imports LossKwargs which is absent in transformers ≥4.50)
Uses canonical AD ops: torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention; GQA handled natively (no repeat_kv)
Dynamic NTK RoPE (factor=6.0) precomputed at init; full cos/sin table returned from RotaryEmbedding.forward(), sliced per-layer in each attention block
Registers config via AutoConfig.register("internlm3", ..., exist_ok=True) so existing registry entry works out-of-the-box

End-to-end verification

Reproducible command:

python examples/auto_deploy/build_and_run_ad.py \
  --model internlm/internlm3-8b-instruct \
  --use-registry

(Requires 2 GPUs — model is registered with world_size_2.yaml)

Result: Coherent generation confirmed on all 10 prompts with 48 full layers on 2×GPU (H100/A100). All graph transforms (insert_cached_attention) matched all 48 attention layers. CUDA graphs captured for batch sizes 64, 48, 32, 16, 1.

Unit tests

pytest tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py -v

Tests cover (hierarchical):

Block equivalence — RMSNorm, MLP, Attention vs inline HF reference classes
Layer equivalence — full decoder layer
Full model equivalence — end-to-end logits on CPU and CUDA
Export — torch_export_to_gm with dynamic batch and sequence dimensions, verified at two input shapes

Test plan

Run unit tests: pytest tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py -v
Run end-to-end: python examples/auto_deploy/build_and_run_ad.py --model internlm/internlm3-8b-instruct --use-registry on 2 GPUs
Verify generation is coherent (not garbled)

🤖 Generated with Claude Code

govind-ramnarayan · 2026-03-12T18:32:34Z

Unit Test Results

Ran tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py locally.

Result: 16/16 passed ✅

Full output

tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_rmsnorm_equivalence[dtype0-2-6] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_rmsnorm_equivalence[dtype0-1-8] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_mlp_equivalence[dtype0-2-6] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_mlp_equivalence[dtype0-1-8] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_attention_equivalence[dtype0-2-6] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_attention_equivalence[dtype0-1-8] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_decoder_layer_equivalence[dtype0-2-6] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_decoder_layer_equivalence[dtype0-1-8] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_full_model_equivalence[cpu-dtype0-2-6] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_full_model_equivalence[cpu-dtype0-1-8] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_full_model_equivalence[cuda-dtype0-2-6] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_full_model_equivalence[cuda-dtype0-1-8] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_model_can_be_exported PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_config_registration PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_gqa_structure PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_state_dict_keys PASSED

======================== 16 passed, 4 warnings in 3.02s ========================

Adds a prefill-only AutoDeploy custom model for internlm/internlm3-8b-instruct (and any InternLM3-family model sharing the same architecture). Key implementation details: - Bundles InternLM3Config since it is not part of standard transformers (model uses auto_map; the checkpoint's modeling_internlm3.py imports LossKwargs which is absent from transformers >=4.50) - Registers config via AutoConfig.register("internlm3", ..., exist_ok=True) - Uses canonical AD ops: torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention - GQA (32Q / 2KV heads) handled natively by torch_attention — no repeat_kv - Dynamic NTK RoPE (factor=6.0) precomputed at init; full cos/sin table returned from RotaryEmbedding.forward(), sliced per-layer in attention - Prefill-only: no KV cache, no attention mask, no training paths Also adds hierarchical equivalence tests (RMSNorm, MLP, Attention, DecoderLayer, full model, torch.export) using inline HF reference classes. Verified with: python examples/auto_deploy/build_and_run_ad.py \ --model internlm/internlm3-8b-instruct --use-registry Coherent generation confirmed on all 10 prompts with 48 full layers on 2xGPU. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

… masking The _RefInternLM3Attention used non-causal (full) attention in its reference implementation, while the custom InternLM3Attention uses is_causal=True in torch_attention. Replace the manual matmul/softmax with F.scaled_dot_product_attention(is_causal=True) to correctly match the causal behavior of the AD custom model. Signed-off-by: Govind Ramnarayan <gramnarayan@nvidia.com> Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

- Remove bundled InternLM3Config: AutoConfig.from_pretrained with trust_remote_code=True loads it from the HF snapshot, and the AD factory lookup uses type(config).__name__ which equals "InternLM3Config" in both cases. Also removes keys_to_ignore_at_inference which is not needed for this prefill-only model. - Update unit tests to load HF reference classes directly from the local HF snapshot via importlib.util (synthetic package to handle relative imports). Stubs LossKwargs in transformers.utils to work around the version mismatch in the installed transformers. Signed-off-by: Govind Ramnarayan <gramnarayan@nvidia.com> Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

…ons in InternLM3 tests Remove the HF snapshot path dependency (importlib.util + hardcoded /lustre/.../snapshots/... path) and LossKwargs stub. Replace with self-contained inline reference classes (_RefInternLM3Config, _RefInternLM3RMSNorm, _RefInternLM3RotaryEmbedding, _RefInternLM3MLP, _RefInternLM3Attention, _RefInternLM3DecoderLayer, _RefInternLM3ForCausalLM) copied from the HF source. All 16 unit tests continue to pass. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

… use Add InternLM3Config back to modeling_internlm3.py so tests and tooling can instantiate it directly without AutoConfig.from_pretrained or the HuggingFace Hub. The class matches the HF checkpoint config fields (vocab_size, hidden_size, qkv_bias, head_dim, rope_scaling, etc.) and runs rope_config_validation on init. At inference time the config loaded via trust_remote_code also has __name__ == "InternLM3Config", so the AD factory registration is unaffected. No AutoConfig.register call — AutoConfig continues to work out of the box via trust_remote_code. Update tests to import InternLM3Config from the modeling module directly, removing the standalone _RefInternLM3Config workaround. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

InternLM3Config does not belong in the modeling file — the AD factory looks up configs by class name at runtime (from the real checkpoint's trust_remote_code config), so bundling a duplicate config class in the modeling file is unnecessary and misleading. Move it to the test file where it is only used to instantiate small synthetic configs for unit tests without hitting AutoConfig or the HuggingFace Hub. Also update all type hints in the modeling file back to PretrainedConfig and restore config_class = PretrainedConfig on InternLM3PreTrainedModel. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

…config) Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

…needed) Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

…ther AD custom models) Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

govind-ramnarayan · 2026-03-13T01:55:24Z

+    custom_sd = {}
+    for key, value in ref_sd.items():
+        if key.startswith("lm_head"):
+            custom_sd[key] = value


Is this bc the hf checkpoint is missing the lm_head attribute?

lucaslie · 2026-03-13T01:57:38Z

merged #222

govind-ramnarayan force-pushed the gramnarayan/intern-lm branch from 7c178af to b52a2dc Compare March 12, 2026 17:40

govind-ramnarayan force-pushed the gramnarayan/intern-lm branch from b52a2dc to 1c97346 Compare March 12, 2026 18:45

govind-ramnarayan added 2 commits March 12, 2026 12:32

govind-ramnarayan force-pushed the gramnarayan/intern-lm branch from 1c97346 to 6b1270e Compare March 12, 2026 20:29