[None][feat] Add AD custom model for GLM MoE DSA family (GLM-5)#240
Conversation
|
[AGENT] Addressed all three review comments in a4a205c:
|
Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture (zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid routing, similar to DeepSeek-V3. Key implementation details: - Bundled GlmMoeDsaConfig (not yet in transformers) - Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe, torch_rope_with_explicit_cos_sin - Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize) - Shared rotary embedding at model level with _ad_ buffer prefix - RoPE weight de-interleaving via mla_rope_utils load hook - TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility - DSA indexer and MTP layers skipped (not needed for prefill) Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder Layer, Full Model, Export) against standalone HF-faithful reference implementations. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
- Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs - Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
|
[AGENT] Rebased and re-ran all tests. Raw logs below. Unit Tests — 17/17 PASSED (4.36s)Full raw logbuild_and_run_ad.py — SUCCESS (5 layers via registry)Registry resolves:
Full raw log (key sections)Note: Full 78-layer run OOMs on 8x H100 80GB due to graph IR overhead (~67-68 GB/rank from export+sharding). The |
* [None][feat] Add AD custom model for GLM MoE DSA family (GLM-5) Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture (zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid routing, similar to DeepSeek-V3. Key implementation details: - Bundled GlmMoeDsaConfig (not yet in transformers) - Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe, torch_rope_with_explicit_cos_sin - Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize) - Shared rotary embedding at model level with _ad_ buffer prefix - RoPE weight de-interleaving via mla_rope_utils load hook - TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility - DSA indexer and MTP layers skipped (not needed for prefill) Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder Layer, Full Model, Export) against standalone HF-faithful reference implementations. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * Address PR review feedback - Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs - Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for GLM MoE DSA family (GLM-5) Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture (zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid routing, similar to DeepSeek-V3. Key implementation details: - Bundled GlmMoeDsaConfig (not yet in transformers) - Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe, torch_rope_with_explicit_cos_sin - Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize) - Shared rotary embedding at model level with _ad_ buffer prefix - RoPE weight de-interleaving via mla_rope_utils load hook - TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility - DSA indexer and MTP layers skipped (not needed for prefill) Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder Layer, Full Model, Export) against standalone HF-faithful reference implementations. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * Address PR review feedback - Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs - Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for GLM MoE DSA family (GLM-5) Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture (zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid routing, similar to DeepSeek-V3. Key implementation details: - Bundled GlmMoeDsaConfig (not yet in transformers) - Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe, torch_rope_with_explicit_cos_sin - Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize) - Shared rotary embedding at model level with _ad_ buffer prefix - RoPE weight de-interleaving via mla_rope_utils load hook - TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility - DSA indexer and MTP layers skipped (not needed for prefill) Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder Layer, Full Model, Export) against standalone HF-faithful reference implementations. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * Address PR review feedback - Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs - Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for GLM MoE DSA family (GLM-5) Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture (zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid routing, similar to DeepSeek-V3. Key implementation details: - Bundled GlmMoeDsaConfig (not yet in transformers) - Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe, torch_rope_with_explicit_cos_sin - Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize) - Shared rotary embedding at model level with _ad_ buffer prefix - RoPE weight de-interleaving via mla_rope_utils load hook - TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility - DSA indexer and MTP layers skipped (not needed for prefill) Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder Layer, Full Model, Export) against standalone HF-faithful reference implementations. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * Address PR review feedback - Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs - Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for GLM MoE DSA family (GLM-5) Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture (zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid routing, similar to DeepSeek-V3. Key implementation details: - Bundled GlmMoeDsaConfig (not yet in transformers) - Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe, torch_rope_with_explicit_cos_sin - Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize) - Shared rotary embedding at model level with _ad_ buffer prefix - RoPE weight de-interleaving via mla_rope_utils load hook - TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility - DSA indexer and MTP layers skipped (not needed for prefill) Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder Layer, Full Model, Export) against standalone HF-faithful reference implementations. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * Address PR review feedback - Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs - Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for GLM MoE DSA family (GLM-5) Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture (zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid routing, similar to DeepSeek-V3. Key implementation details: - Bundled GlmMoeDsaConfig (not yet in transformers) - Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe, torch_rope_with_explicit_cos_sin - Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize) - Shared rotary embedding at model level with _ad_ buffer prefix - RoPE weight de-interleaving via mla_rope_utils load hook - TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility - DSA indexer and MTP layers skipped (not needed for prefill) Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder Layer, Full Model, Export) against standalone HF-faithful reference implementations. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * Address PR review feedback - Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs - Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Summary
glm_moe_dsaarchitecture (zai-org/GLM-5, zai-org/GLM-5-FP8)glm_5.yamlconfig for tokenizer compatibility workaroundArchitecture notes
The GLM MoE DSA model (
model_type: "glm_moe_dsa") features:rope_parametersconfig format (rope_theta inside a dict)Canonical ops used
torch.ops.auto_deploy.torch_rmsnorm— all RMSNorm layerstorch.ops.auto_deploy.torch_mla— MLA attentiontorch.ops.auto_deploy.torch_moe— MoE expert dispatchtorch.ops.auto_deploy.torch_rope_with_explicit_cos_sin— RoPE applicationFiles
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.pytensorrt_llm/_torch/auto_deploy/models/custom/__init__.py__all__entrytests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.pyexamples/auto_deploy/model_registry/models.yamlglm_5.yamlto both entriesexamples/auto_deploy/model_registry/configs/glm_5.yamlAD end-to-end run results
Step 1: Reduced layers (4 layers) — SUCCESS
Build and generation completed. Output was incoherent (expected with only 4 of 78 layers), confirming the e2e flow works correctly.
Step 2: Full layers (78 layers) — OOM
Both BF16 and FP8 variants OOM'd on 8x H100 80GB. The bottleneck is graph IR overhead during export/sharding (~67-68 GB/rank), not weight size. This is an infrastructure limitation for very large MoE models (78 layers × 256 experts).
Reproduce
Run with reduced layers to verify e2e:
Run with full layers (requires more GPU memory or graph IR optimizations):
Test plan
Run unit tests:
Tests cover:
assert_close, rtol=1e-3)assert_rmse_close, tol=0.02)assert_rmse_close, tol=0.10)assert_rmse_close, tol=0.05)assert_rmse_close, tol=0.05)assert_rmse_close, tol=0.05)torch_export_to_gmwith dynamic batch+seq dims🤖 Generated with Claude Code