Skip to content

[None][feat] Add AD custom model for GLM MoE DSA family (GLM-5)#240

Merged
lucaslie merged 2 commits into
feat/paperclip_maximizerfrom
ll/pcm_113
Mar 13, 2026
Merged

[None][feat] Add AD custom model for GLM MoE DSA family (GLM-5)#240
lucaslie merged 2 commits into
feat/paperclip_maximizerfrom
ll/pcm_113

Conversation

@lucaslie

Copy link
Copy Markdown

Summary

  • Add prefill-only AutoDeploy custom model for glm_moe_dsa architecture (zai-org/GLM-5, zai-org/GLM-5-FP8)
  • Model uses MLA (Multi-head Latent Attention) + MoE (256 experts, 8 active) with noaux_tc-style sigmoid routing, similar to DeepSeek-V3
  • Includes hierarchical equivalence tests against standalone HF-faithful reference implementations
  • Updated model registry entries with glm_5.yaml config for tokenizer compatibility workaround

Architecture notes

The GLM MoE DSA model (model_type: "glm_moe_dsa") features:

  • MLA attention with q_lora_rank=2048, kv_lora_rank=512, qk_nope_head_dim=192, qk_rope_head_dim=64, v_head_dim=256
  • MoE with 256 routed experts, 8 active per token, 1 shared expert
  • noaux_tc routing: sigmoid scoring → bias correction → group top-k → normalize → scale (vanilla PyTorch, AD transforms replace with fused kernels at deployment)
  • DSA indexer (Dynamic Sparse Attention) — skipped for prefill-only AD export
  • MTP layers (Multi-Token Prediction) — skipped for prefill-only AD export
  • Nested rope_parameters config format (rope_theta inside a dict)
  • TokenizersBackend alias class for GLM-5-FP8 tokenizer compatibility

Canonical ops used

  • torch.ops.auto_deploy.torch_rmsnorm — all RMSNorm layers
  • torch.ops.auto_deploy.torch_mla — MLA attention
  • torch.ops.auto_deploy.torch_moe — MoE expert dispatch
  • torch.ops.auto_deploy.torch_rope_with_explicit_cos_sin — RoPE application

Files

File Description
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py Custom model implementation
tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py Added import + __all__ entry
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py Hierarchical tests
examples/auto_deploy/model_registry/models.yaml Added glm_5.yaml to both entries
examples/auto_deploy/model_registry/configs/glm_5.yaml Tokenizer config workaround

AD end-to-end run results

Step 1: Reduced layers (4 layers) — SUCCESS

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/auto_deploy/build_and_run_ad.py --model zai-org/GLM-5 --use-registry --args.num_hidden_layers 4

Build and generation completed. Output was incoherent (expected with only 4 of 78 layers), confirming the e2e flow works correctly.

Step 2: Full layers (78 layers) — OOM

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/auto_deploy/build_and_run_ad.py --model zai-org/GLM-5 --use-registry

Both BF16 and FP8 variants OOM'd on 8x H100 80GB. The bottleneck is graph IR overhead during export/sharding (~67-68 GB/rank), not weight size. This is an infrastructure limitation for very large MoE models (78 layers × 256 experts).

Reproduce

Run with reduced layers to verify e2e:

python examples/auto_deploy/build_and_run_ad.py --model zai-org/GLM-5 --use-registry --args.num_hidden_layers 4

Run with full layers (requires more GPU memory or graph IR optimizations):

python examples/auto_deploy/build_and_run_ad.py --model zai-org/GLM-5 --use-registry

Test plan

Run unit tests:

pytest tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py -v

Tests cover:

  • Config registration and nested rope_parameters parsing
  • Layer type selection (dense vs MoE)
  • Expert structure (nn.ModuleList checkpoint compatibility)
  • MLP block equivalence vs HF reference (assert_close, rtol=1e-3)
  • MoE block equivalence vs HF reference (assert_rmse_close, tol=0.02)
  • MLA attention equivalence vs HF reference (assert_rmse_close, tol=0.10)
  • Dense decoder layer equivalence vs HF reference (assert_rmse_close, tol=0.05)
  • MoE decoder layer equivalence vs HF reference (assert_rmse_close, tol=0.05)
  • Full model logits equivalence vs HF reference (assert_rmse_close, tol=0.05)
  • torch_export_to_gm with dynamic batch+seq dims

🤖 Generated with Claude Code

Comment thread examples/auto_deploy/model_registry/models.yaml Outdated
@lucaslie

Copy link
Copy Markdown
Author

[AGENT] Addressed all three review comments in a4a205c:

  1. num_hidden_layers_5.yaml — Added to both zai-org/GLM-5 and zai-org/GLM-5-FP8 registry entries for reduced-layer dashboard runs.

  2. Config classglm_moe_dsa is NOT available in the installed transformers (v4.57.1). Confirmed via CONFIG_MAPPINGKeyError: 'glm_moe_dsa'. The bundled config class is necessary.

  3. torch.ops.trtllm.noaux_tc_op — Switched back to the fused noaux_tc_op + dsv3_router_gemm_op for the MoE gate, matching the pattern used by modeling_deepseek.py and modeling_glm4_moe_lite.py.

@lucaslie lucaslie left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rebase. please re-run ALL UNIT TESTS FOR THIS MODEL AND the build_and_run_ad.py command with the model registry config. PLEASE POST RAW LOGS FOR ALL OF THEM

Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture
(zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent
Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid
routing, similar to DeepSeek-V3.

Key implementation details:
- Bundled GlmMoeDsaConfig (not yet in transformers)
- Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe,
  torch_rope_with_explicit_cos_sin
- Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize)
- Shared rotary embedding at model level with _ad_ buffer prefix
- RoPE weight de-interleaving via mla_rope_utils load hook
- TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility
- DSA indexer and MTP layers skipped (not needed for prefill)

Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder
Layer, Full Model, Export) against standalone HF-faithful reference
implementations.

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
- Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs
- Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel)

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
@lucaslie

Copy link
Copy Markdown
Author

[AGENT] Rebased and re-ran all tests. Raw logs below.

Unit Tests — 17/17 PASSED (4.36s)

pytest tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py -v
Full raw log
============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0

tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_config_registration PASSED [  5%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_config_nested_rope_parameters PASSED [ 11%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_layer_types PASSED [ 17%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_expert_structure PASSED [ 23%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_mlp_numerical_equivalence[dtype0-2-6] PASSED [ 29%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_mlp_numerical_equivalence[dtype0-1-8] PASSED [ 35%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_moe_numerical_equivalence[dtype0-2-6] PASSED [ 41%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_moe_numerical_equivalence[dtype0-1-8] PASSED [ 47%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_attention_numerical_equivalence[dtype0-2-6] PASSED [ 52%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_attention_numerical_equivalence[dtype0-1-8] PASSED [ 58%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_decoder_layer_numerical_equivalence[0-dtype0-2-6] PASSED [ 64%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_decoder_layer_numerical_equivalence[0-dtype0-1-8] PASSED [ 70%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_decoder_layer_numerical_equivalence[1-dtype0-2-6] PASSED [ 76%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_decoder_layer_numerical_equivalence[1-dtype0-1-8] PASSED [ 82%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_full_model_numerical_equivalence[dtype0-2-6] PASSED [ 88%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_full_model_numerical_equivalence[dtype0-1-8] PASSED [ 94%]
tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py::test_glm_moe_dsa_model_can_be_exported PASSED [100%]

======================== 17 passed, 4 warnings in 4.36s ========================

build_and_run_ad.py — SUCCESS (5 layers via registry)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/auto_deploy/build_and_run_ad.py --model zai-org/GLM-5 --use-registry

Registry resolves: ['dashboard_default.yaml', 'world_size_8.yaml', 'glm_5.yaml', 'num_hidden_layers_5.yaml']

  • Compilation: 122s across 8 ranks on H100 80GB
  • Memory: ~42.86 GB allocated per rank
  • Generation: garbled text (expected with 5/78 layers — confirms e2e pipeline works)
Full raw log (key sections)
[TRT-LLM AUTO-DEPLOY] [I] AutoDeploy Experiment Config:
  yaml_extra:
  - .../configs/dashboard_default.yaml
  - .../configs/world_size_8.yaml
  - .../configs/glm_5.yaml
  - .../configs/num_hidden_layers_5.yaml

[TRT-LLM AUTO-DEPLOY] [RANK 1] [I] [stage=factory, transform=build_model] [APPLY] Using custom model implementation <class 'tensorrt_llm._torch.auto_deploy.models.custom.modeling_glm_moe_dsa.GlmMoeDsaForCausalLM'>

[TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Total time for all transforms: 122.24s
[TRT-LLM AUTO-DEPLOY] [RANK 0] [I] [stage=compile, transform=compile_model] [CUDA MEM DIFF (EXPECTED)] free: 14.59GB | resv: 43.95GB | alloc: 42.86GB

[TRT-LLM AUTO-DEPLOY] [I] Running example prompts...
Processed requests: 100%|██████████| 10/10 [00:00<00:00, 12.38it/s]

[PROMPT 0] How big is the universe?: Bevölker StructurediVar CONTENTratisFilmtü丛 prescriptions harbor...
[PROMPT 1] explain the concept of gravity: foobar.nc mat viewsNAL的风 Morgan\Twig\Action hápunitive...
...
[PROMPT 9] health benefits of green tea: Tracks Erdo recounts了一种 Rob brows coverage...

[TRT-LLM AUTO-DEPLOY] [RANK 0-7] [I] Destroying process group

Note: Full 78-layer run OOMs on 8x H100 80GB due to graph IR overhead (~67-68 GB/rank from export+sharding). The num_hidden_layers_5.yaml in the registry enables dashboard e2e runs with reduced layers.

@lucaslie lucaslie merged commit 034c779 into feat/paperclip_maximizer Mar 13, 2026
3 of 4 checks passed
bmarimuthu-nv pushed a commit that referenced this pull request Mar 13, 2026
* [None][feat] Add AD custom model for GLM MoE DSA family (GLM-5)

Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture
(zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent
Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid
routing, similar to DeepSeek-V3.

Key implementation details:
- Bundled GlmMoeDsaConfig (not yet in transformers)
- Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe,
  torch_rope_with_explicit_cos_sin
- Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize)
- Shared rotary embedding at model level with _ad_ buffer prefix
- RoPE weight de-interleaving via mla_rope_utils load hook
- TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility
- DSA indexer and MTP layers skipped (not needed for prefill)

Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder
Layer, Full Model, Export) against standalone HF-faithful reference
implementations.

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* Address PR review feedback

- Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs
- Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel)

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
bmarimuthu-nv pushed a commit that referenced this pull request Mar 13, 2026
* [None][feat] Add AD custom model for GLM MoE DSA family (GLM-5)

Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture
(zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent
Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid
routing, similar to DeepSeek-V3.

Key implementation details:
- Bundled GlmMoeDsaConfig (not yet in transformers)
- Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe,
  torch_rope_with_explicit_cos_sin
- Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize)
- Shared rotary embedding at model level with _ad_ buffer prefix
- RoPE weight de-interleaving via mla_rope_utils load hook
- TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility
- DSA indexer and MTP layers skipped (not needed for prefill)

Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder
Layer, Full Model, Export) against standalone HF-faithful reference
implementations.

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* Address PR review feedback

- Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs
- Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel)

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
bmarimuthu-nv pushed a commit that referenced this pull request Mar 14, 2026
* [None][feat] Add AD custom model for GLM MoE DSA family (GLM-5)

Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture
(zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent
Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid
routing, similar to DeepSeek-V3.

Key implementation details:
- Bundled GlmMoeDsaConfig (not yet in transformers)
- Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe,
  torch_rope_with_explicit_cos_sin
- Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize)
- Shared rotary embedding at model level with _ad_ buffer prefix
- RoPE weight de-interleaving via mla_rope_utils load hook
- TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility
- DSA indexer and MTP layers skipped (not needed for prefill)

Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder
Layer, Full Model, Export) against standalone HF-faithful reference
implementations.

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* Address PR review feedback

- Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs
- Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel)

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
bmarimuthu-nv pushed a commit that referenced this pull request Mar 18, 2026
* [None][feat] Add AD custom model for GLM MoE DSA family (GLM-5)

Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture
(zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent
Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid
routing, similar to DeepSeek-V3.

Key implementation details:
- Bundled GlmMoeDsaConfig (not yet in transformers)
- Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe,
  torch_rope_with_explicit_cos_sin
- Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize)
- Shared rotary embedding at model level with _ad_ buffer prefix
- RoPE weight de-interleaving via mla_rope_utils load hook
- TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility
- DSA indexer and MTP layers skipped (not needed for prefill)

Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder
Layer, Full Model, Export) against standalone HF-faithful reference
implementations.

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* Address PR review feedback

- Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs
- Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel)

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
bmarimuthu-nv pushed a commit that referenced this pull request Mar 25, 2026
* [None][feat] Add AD custom model for GLM MoE DSA family (GLM-5)

Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture
(zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent
Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid
routing, similar to DeepSeek-V3.

Key implementation details:
- Bundled GlmMoeDsaConfig (not yet in transformers)
- Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe,
  torch_rope_with_explicit_cos_sin
- Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize)
- Shared rotary embedding at model level with _ad_ buffer prefix
- RoPE weight de-interleaving via mla_rope_utils load hook
- TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility
- DSA indexer and MTP layers skipped (not needed for prefill)

Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder
Layer, Full Model, Export) against standalone HF-faithful reference
implementations.

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* Address PR review feedback

- Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs
- Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel)

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
bmarimuthu-nv pushed a commit that referenced this pull request Apr 1, 2026
* [None][feat] Add AD custom model for GLM MoE DSA family (GLM-5)

Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture
(zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent
Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid
routing, similar to DeepSeek-V3.

Key implementation details:
- Bundled GlmMoeDsaConfig (not yet in transformers)
- Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe,
  torch_rope_with_explicit_cos_sin
- Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize)
- Shared rotary embedding at model level with _ad_ buffer prefix
- RoPE weight de-interleaving via mla_rope_utils load hook
- TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility
- DSA indexer and MTP layers skipped (not needed for prefill)

Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder
Layer, Full Model, Export) against standalone HF-faithful reference
implementations.

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* Address PR review feedback

- Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs
- Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel)

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant