[None][feat] Add AD custom model for GLM MoE DSA family (GLM-5) by lucaslie · Pull Request #240 · nv-auto-deploy/TensorRT-LLM

lucaslie · 2026-03-12T19:48:24Z

Summary

Add prefill-only AutoDeploy custom model for glm_moe_dsa architecture (zai-org/GLM-5, zai-org/GLM-5-FP8)
Model uses MLA (Multi-head Latent Attention) + MoE (256 experts, 8 active) with noaux_tc-style sigmoid routing, similar to DeepSeek-V3
Includes hierarchical equivalence tests against standalone HF-faithful reference implementations
Updated model registry entries with glm_5.yaml config for tokenizer compatibility workaround

Architecture notes

The GLM MoE DSA model (model_type: "glm_moe_dsa") features:

MLA attention with q_lora_rank=2048, kv_lora_rank=512, qk_nope_head_dim=192, qk_rope_head_dim=64, v_head_dim=256
MoE with 256 routed experts, 8 active per token, 1 shared expert
noaux_tc routing: sigmoid scoring → bias correction → group top-k → normalize → scale (vanilla PyTorch, AD transforms replace with fused kernels at deployment)
DSA indexer (Dynamic Sparse Attention) — skipped for prefill-only AD export
MTP layers (Multi-Token Prediction) — skipped for prefill-only AD export
Nested rope_parameters config format (rope_theta inside a dict)
TokenizersBackend alias class for GLM-5-FP8 tokenizer compatibility

Canonical ops used

torch.ops.auto_deploy.torch_rmsnorm — all RMSNorm layers
torch.ops.auto_deploy.torch_mla — MLA attention
torch.ops.auto_deploy.torch_moe — MoE expert dispatch
torch.ops.auto_deploy.torch_rope_with_explicit_cos_sin — RoPE application

Files

File	Description
`tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py`	Custom model implementation
`tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py`	Added import + `__all__` entry
`tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py`	Hierarchical tests
`examples/auto_deploy/model_registry/models.yaml`	Added `glm_5.yaml` to both entries
`examples/auto_deploy/model_registry/configs/glm_5.yaml`	Tokenizer config workaround

AD end-to-end run results

Step 1: Reduced layers (4 layers) — SUCCESS

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/auto_deploy/build_and_run_ad.py --model zai-org/GLM-5 --use-registry --args.num_hidden_layers 4

Build and generation completed. Output was incoherent (expected with only 4 of 78 layers), confirming the e2e flow works correctly.

Step 2: Full layers (78 layers) — OOM

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/auto_deploy/build_and_run_ad.py --model zai-org/GLM-5 --use-registry

Both BF16 and FP8 variants OOM'd on 8x H100 80GB. The bottleneck is graph IR overhead during export/sharding (~67-68 GB/rank), not weight size. This is an infrastructure limitation for very large MoE models (78 layers × 256 experts).

Reproduce

Run with reduced layers to verify e2e:

python examples/auto_deploy/build_and_run_ad.py --model zai-org/GLM-5 --use-registry --args.num_hidden_layers 4

Run with full layers (requires more GPU memory or graph IR optimizations):

python examples/auto_deploy/build_and_run_ad.py --model zai-org/GLM-5 --use-registry

Test plan

Run unit tests:

pytest tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py -v

Tests cover:

🤖 Generated with Claude Code

lucaslie · 2026-03-12T23:35:54Z

[AGENT] Addressed all three review comments in a4a205c:

num_hidden_layers_5.yaml — Added to both zai-org/GLM-5 and zai-org/GLM-5-FP8 registry entries for reduced-layer dashboard runs.
Config class — glm_moe_dsa is NOT available in the installed transformers (v4.57.1). Confirmed via CONFIG_MAPPING — KeyError: 'glm_moe_dsa'. The bundled config class is necessary.
torch.ops.trtllm.noaux_tc_op — Switched back to the fused noaux_tc_op + dsv3_router_gemm_op for the MoE gate, matching the pattern used by modeling_deepseek.py and modeling_glm4_moe_lite.py.

lucaslie

please rebase. please re-run ALL UNIT TESTS FOR THIS MODEL AND the build_and_run_ad.py command with the model registry config. PLEASE POST RAW LOGS FOR ALL OF THEM

Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture (zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid routing, similar to DeepSeek-V3. Key implementation details: - Bundled GlmMoeDsaConfig (not yet in transformers) - Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe, torch_rope_with_explicit_cos_sin - Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize) - Shared rotary embedding at model level with _ad_ buffer prefix - RoPE weight de-interleaving via mla_rope_utils load hook - TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility - DSA indexer and MTP layers skipped (not needed for prefill) Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder Layer, Full Model, Export) against standalone HF-faithful reference implementations. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

- Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs - Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

lucaslie · 2026-03-13T01:50:11Z

[AGENT] Rebased and re-ran all tests. Raw logs below.

Unit Tests — 17/17 PASSED (4.36s)

pytest tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py -v

Full raw log

============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0

tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_config_registration PASSED [  5%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_config_nested_rope_parameters PASSED [ 11%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_layer_types PASSED [ 17%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_expert_structure PASSED [ 23%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_mlp_numerical_equivalence[dtype0-2-6] PASSED [ 29%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_mlp_numerical_equivalence[dtype0-1-8] PASSED [ 35%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_moe_numerical_equivalence[dtype0-2-6] PASSED [ 41%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_moe_numerical_equivalence[dtype0-1-8] PASSED [ 47%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_attention_numerical_equivalence[dtype0-2-6] PASSED [ 52%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_attention_numerical_equivalence[dtype0-1-8] PASSED [ 58%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_decoder_layer_numerical_equivalence[0-dtype0-2-6] PASSED [ 64%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_decoder_layer_numerical_equivalence[0-dtype0-1-8] PASSED [ 70%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_decoder_layer_numerical_equivalence[1-dtype0-2-6] PASSED [ 76%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_decoder_layer_numerical_equivalence[1-dtype0-1-8] PASSED [ 82%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_full_model_numerical_equivalence[dtype0-2-6] PASSED [ 88%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_full_model_numerical_equivalence[dtype0-1-8] PASSED [ 94%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_model_can_be_exported PASSED [100%]

======================== 17 passed, 4 warnings in 4.36s ========================

build_and_run_ad.py — SUCCESS (5 layers via registry)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/auto_deploy/build_and_run_ad.py --model zai-org/GLM-5 --use-registry

Registry resolves: ['dashboard_default.yaml', 'world_size_8.yaml', 'glm_5.yaml', 'num_hidden_layers_5.yaml']

Compilation: 122s across 8 ranks on H100 80GB
Memory: ~42.86 GB allocated per rank
Generation: garbled text (expected with 5/78 layers — confirms e2e pipeline works)

Full raw log (key sections)

[TRT-LLM AUTO-DEPLOY] [I] AutoDeploy Experiment Config:
  yaml_extra:
  - .../configs/dashboard_default.yaml
  - .../configs/world_size_8.yaml
  - .../configs/glm_5.yaml
  - .../configs/num_hidden_layers_5.yaml

[TRT-LLM AUTO-DEPLOY] [RANK 1] [I] [stage=factory, transform=build_model] [APPLY] Using custom model implementation <class 'tensorrt_llm._torch.auto_deploy.models.custom.modeling_glm_moe_dsa.GlmMoeDsaForCausalLM'>

[TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Total time for all transforms: 122.24s
[TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [stage=compile, transform=compile_model] [CUDA MEM DIFF (EXPECTED)] free: 14.59GB | resv: 43.95GB | alloc: 42.86GB

[TRT-LLM AUTO-DEPLOY] [I] Running example prompts...
Processed requests: 100%|██████████| 10/10 [00:00<00:00, 12.38it/s]

[PROMPT 0] How big is the universe?: Bevölker StructurediVar CONTENTratisFilmtü丛 prescriptions harbor...
[PROMPT 1] explain the concept of gravity: foobar.nc mat viewsNAL的风 Morgan\Twig\Action hápunitive...
...
[PROMPT 9] health benefits of green tea: Tracks Erdo recounts了一种 Rob brows coverage...

[TRT-LLM AUTO-DEPLOY] [RANK 0-7] [I] Destroying process group

Note: Full 78-layer run OOMs on 8x H100 80GB due to graph IR overhead (~67-68 GB/rank from export+sharding). The num_hidden_layers_5.yaml in the registry enables dashboard e2e runs with reduced layers.

* [None][feat] Add AD custom model for GLM MoE DSA family (GLM-5) Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture (zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid routing, similar to DeepSeek-V3. Key implementation details: - Bundled GlmMoeDsaConfig (not yet in transformers) - Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe, torch_rope_with_explicit_cos_sin - Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize) - Shared rotary embedding at model level with _ad_ buffer prefix - RoPE weight de-interleaving via mla_rope_utils load hook - TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility - DSA indexer and MTP layers skipped (not needed for prefill) Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder Layer, Full Model, Export) against standalone HF-faithful reference implementations. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * Address PR review feedback - Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs - Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

github-actions Bot assigned lucaslie Mar 12, 2026

lucaslie commented Mar 12, 2026

View reviewed changes

Comment thread examples/auto_deploy/model_registry/models.yaml Outdated

Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py

Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py

lucaslie commented Mar 13, 2026

View reviewed changes

lucaslie added 2 commits March 12, 2026 18:33

lucaslie force-pushed the ll/pcm_113 branch from a4a205c to 671c985 Compare March 13, 2026 01:33

lucaslie merged commit 034c779 into feat/paperclip_maximizer Mar 13, 2026
3 of 4 checks passed

lucaslie mentioned this pull request Mar 13, 2026

[None][feat] Add AutoDeploy custom model for GLM-5 (glm_moe_dsa) #246

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][feat] Add AD custom model for GLM MoE DSA family (GLM-5)#240

[None][feat] Add AD custom model for GLM MoE DSA family (GLM-5)#240
lucaslie merged 2 commits into
feat/paperclip_maximizerfrom
ll/pcm_113

lucaslie commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucaslie commented Mar 12, 2026

Uh oh!

lucaslie left a comment •

edited

Loading

Uh oh!

lucaslie commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lucaslie commented Mar 12, 2026

Summary

Architecture notes

Canonical ops used

Files

AD end-to-end run results

Step 1: Reduced layers (4 layers) — SUCCESS

Step 2: Full layers (78 layers) — OOM

Reproduce

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucaslie commented Mar 12, 2026

Uh oh!

lucaslie left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucaslie commented Mar 13, 2026

Unit Tests — 17/17 PASSED (4.36s)

build_and_run_ad.py — SUCCESS (5 layers via registry)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lucaslie left a comment •

edited

Loading