Skip to content

[None][feat] Add AutoDeploy custom model for InternLM3#218

Closed
govind-ramnarayan wants to merge 9 commits into
feat/paperclip_maximizerfrom
gramnarayan/intern-lm
Closed

[None][feat] Add AutoDeploy custom model for InternLM3#218
govind-ramnarayan wants to merge 9 commits into
feat/paperclip_maximizerfrom
gramnarayan/intern-lm

Conversation

@govind-ramnarayan

@govind-ramnarayan govind-ramnarayan commented Mar 12, 2026

Copy link
Copy Markdown

Summary

  • Adds a prefill-only AutoDeploy custom model for internlm/internlm3-8b-instruct (InternLM3 family)
  • Bundles InternLM3Config since the model uses auto_map and is not in standard transformers (the checkpoint's modeling_internlm3.py imports LossKwargs which is absent in transformers ≥4.50)
  • Uses canonical AD ops: torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention; GQA handled natively (no repeat_kv)
  • Dynamic NTK RoPE (factor=6.0) precomputed at init; full cos/sin table returned from RotaryEmbedding.forward(), sliced per-layer in each attention block
  • Registers config via AutoConfig.register("internlm3", ..., exist_ok=True) so existing registry entry works out-of-the-box

End-to-end verification

Reproducible command:

python examples/auto_deploy/build_and_run_ad.py \
  --model internlm/internlm3-8b-instruct \
  --use-registry

(Requires 2 GPUs — model is registered with world_size_2.yaml)

Result: Coherent generation confirmed on all 10 prompts with 48 full layers on 2×GPU (H100/A100). All graph transforms (insert_cached_attention) matched all 48 attention layers. CUDA graphs captured for batch sizes 64, 48, 32, 16, 1.

Unit tests

pytest tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py -v

Tests cover (hierarchical):

  1. Block equivalence — RMSNorm, MLP, Attention vs inline HF reference classes
  2. Layer equivalence — full decoder layer
  3. Full model equivalence — end-to-end logits on CPU and CUDA
  4. Exporttorch_export_to_gm with dynamic batch and sequence dimensions, verified at two input shapes

Test plan

  • Run unit tests: pytest tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py -v
  • Run end-to-end: python examples/auto_deploy/build_and_run_ad.py --model internlm/internlm3-8b-instruct --use-registry on 2 GPUs
  • Verify generation is coherent (not garbled)

🤖 Generated with Claude Code

@govind-ramnarayan

Copy link
Copy Markdown
Author

Unit Test Results

Ran tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py locally.

Result: 16/16 passed ✅

Full output
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_rmsnorm_equivalence[dtype0-2-6] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_rmsnorm_equivalence[dtype0-1-8] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_mlp_equivalence[dtype0-2-6] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_mlp_equivalence[dtype0-1-8] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_attention_equivalence[dtype0-2-6] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_attention_equivalence[dtype0-1-8] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_decoder_layer_equivalence[dtype0-2-6] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_decoder_layer_equivalence[dtype0-1-8] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_full_model_equivalence[cpu-dtype0-2-6] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_full_model_equivalence[cpu-dtype0-1-8] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_full_model_equivalence[cuda-dtype0-2-6] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_full_model_equivalence[cuda-dtype0-1-8] PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_model_can_be_exported PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_config_registration PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_gqa_structure PASSED
tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py::test_internlm3_state_dict_keys PASSED

======================== 16 passed, 4 warnings in 3.02s ========================

Adds a prefill-only AutoDeploy custom model for internlm/internlm3-8b-instruct
(and any InternLM3-family model sharing the same architecture).

Key implementation details:
- Bundles InternLM3Config since it is not part of standard transformers
  (model uses auto_map; the checkpoint's modeling_internlm3.py imports
  LossKwargs which is absent from transformers >=4.50)
- Registers config via AutoConfig.register("internlm3", ..., exist_ok=True)
- Uses canonical AD ops: torch_rmsnorm, torch_rope_with_explicit_cos_sin,
  torch_attention
- GQA (32Q / 2KV heads) handled natively by torch_attention — no repeat_kv
- Dynamic NTK RoPE (factor=6.0) precomputed at init; full cos/sin table
  returned from RotaryEmbedding.forward(), sliced per-layer in attention
- Prefill-only: no KV cache, no attention mask, no training paths

Also adds hierarchical equivalence tests (RMSNorm, MLP, Attention, DecoderLayer,
full model, torch.export) using inline HF reference classes.

Verified with:
  python examples/auto_deploy/build_and_run_ad.py \
    --model internlm/internlm3-8b-instruct --use-registry
Coherent generation confirmed on all 10 prompts with 48 full layers on 2xGPU.

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
… masking

The _RefInternLM3Attention used non-causal (full) attention in its
reference implementation, while the custom InternLM3Attention uses
is_causal=True in torch_attention. Replace the manual matmul/softmax
with F.scaled_dot_product_attention(is_causal=True) to correctly match
the causal behavior of the AD custom model.

Signed-off-by: Govind Ramnarayan <gramnarayan@nvidia.com>
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_internlm3.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_internlm3.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py Outdated
- Remove bundled InternLM3Config: AutoConfig.from_pretrained with
  trust_remote_code=True loads it from the HF snapshot, and the AD
  factory lookup uses type(config).__name__ which equals "InternLM3Config"
  in both cases. Also removes keys_to_ignore_at_inference which is not
  needed for this prefill-only model.

- Update unit tests to load HF reference classes directly from the
  local HF snapshot via importlib.util (synthetic package to handle
  relative imports). Stubs LossKwargs in transformers.utils to work
  around the version mismatch in the installed transformers.

Signed-off-by: Govind Ramnarayan <gramnarayan@nvidia.com>
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Comment thread tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py Outdated
…ons in InternLM3 tests

Remove the HF snapshot path dependency (importlib.util + hardcoded
/lustre/.../snapshots/... path) and LossKwargs stub. Replace with
self-contained inline reference classes (_RefInternLM3Config,
_RefInternLM3RMSNorm, _RefInternLM3RotaryEmbedding, _RefInternLM3MLP,
_RefInternLM3Attention, _RefInternLM3DecoderLayer, _RefInternLM3ForCausalLM)
copied from the HF source. All 16 unit tests continue to pass.

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
… use

Add InternLM3Config back to modeling_internlm3.py so tests and tooling
can instantiate it directly without AutoConfig.from_pretrained or the
HuggingFace Hub. The class matches the HF checkpoint config fields
(vocab_size, hidden_size, qkv_bias, head_dim, rope_scaling, etc.) and
runs rope_config_validation on init.

At inference time the config loaded via trust_remote_code also has
__name__ == "InternLM3Config", so the AD factory registration is
unaffected. No AutoConfig.register call — AutoConfig continues to work
out of the box via trust_remote_code.

Update tests to import InternLM3Config from the modeling module directly,
removing the standalone _RefInternLM3Config workaround.

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
InternLM3Config does not belong in the modeling file — the AD factory
looks up configs by class name at runtime (from the real checkpoint's
trust_remote_code config), so bundling a duplicate config class in the
modeling file is unnecessary and misleading.

Move it to the test file where it is only used to instantiate small
synthetic configs for unit tests without hitting AutoConfig or the
HuggingFace Hub.

Also update all type hints in the modeling file back to PretrainedConfig
and restore config_class = PretrainedConfig on InternLM3PreTrainedModel.

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
…config)

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_internlm3.py
…needed)

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
…ther AD custom models)

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
custom_sd = {}
for key, value in ref_sd.items():
if key.startswith("lm_head"):
custom_sd[key] = value

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this bc the hf checkpoint is missing the lm_head attribute?

@lucaslie

Copy link
Copy Markdown

merged #222

@lucaslie lucaslie closed this Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants