[None][feat] Add AutoDeploy custom model for GLM-5 (glm_moe_dsa)#246
Open
suyoggupta wants to merge 3 commits into
Open
[None][feat] Add AutoDeploy custom model for GLM-5 (glm_moe_dsa)#246suyoggupta wants to merge 3 commits into
suyoggupta wants to merge 3 commits into
Conversation
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
- Add prefill-only AD custom model for zai-org/GLM-5 (glm_moe_dsa): MLA + DSA (DeepSeek Sparse Attention) with noaux_tc MoE routing, 256 routed experts, 8-way tensor parallelism - Add torch_dsa canonical op and TorchBackendDSAAttention registry entry with vectorized CUDA-graph-compatible generate path - Add hierarchical equivalence tests (block/layer/full model/export) - Add model registry entries for zai-org/GLM-5 and zai-org/GLM-5-FP8 - Guard CuteDslFusedMoE import behind IS_CUTLASS_DSL_AVAILABLE check - Fix virtual_memory.py for push/pop vs set/clear API compatibility Signed-off-by: Suyog Gupta <suyogg@nvidia.com> Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Revert changes to modules/fused_moe and virtual_memory.py that were added as workarounds for environment-specific issues. These changes should not be part of the GLM-5 onboarding PR. Signed-off-by: Suyog Gupta <suyogg@nvidia.com> Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
|
feel free to revert or delete the code that was added here: #240 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
zai-org/GLM-5(model_type: glm_moe_dsa): 78-layer MoE with MLA + DSA (DeepSeek Sparse Attention) andnoaux_tcrouting, 256 routed experts, 8-way TPtorch_dsacanonical op (auto_deploy::torch_dsa) andTorchBackendDSAAttentionAttentionRegistry entry with vectorized CUDA-graph-compatible generate path (no.item()calls)zai-org/GLM-5andzai-org/GLM-5-FP8withglm_5.yamlconfigCuteDslFusedMoEimport behindIS_CUTLASS_DSL_AVAILABLEcheck (fixes import on non-Blackwell machines)virtual_memory.pyfor push/pop vs set/clear API compatibilityKey design notes
DSA attention: GLM-5 uses
torch_dsa(nottorch_mla). Theinsert_cached_mla_attentiontransform must be overridden withbackend: torch_dsain the model YAML — the defaultflashinfer_mlaonly matchestorch_mlanodes.CUDA graph compatibility:
_torch_dsa_generate_with_absorptionwas rewritten to use fully vectorized tensor ops (advanced indexing + validity masks) instead of Python loops with.item()calls, which would causecudaErrorStreamCaptureUnsupported.Tokenizer: GLM-5's
tokenizer_config.jsonspecifiesTokenizersBackend(non-standard transformers class). Worked around by usingzai-org/GLM-4.7-Flashtokenizer via thetokenizer:override inglm_5.yaml.Memory: Full 78-layer BF16 model does not fit on a single 8×H100 node. GLM-5-FP8 variant may fit.
Reproduce
Unit tests
AD end-to-end run results (10-layer truncation on 8×H100)
Pipeline validated: all transforms applied, CUDA graph capture succeeded for all 7 batch sizes (64, 32, 16, 8, 4, 2, 1). Generation is garbled for the 10-layer truncation (expected).
Raw generation outputs (10-layer truncation, 2026-03-13)
Note: garbled output is expected for 10-layer truncation of a 78-layer model. The AD pipeline itself is fully functional.
🤖 Generated with Claude Code