[None][feat] Add AutoDeploy custom model for GLM-5 (glm_moe_dsa) by suyoggupta · Pull Request #246 · nv-auto-deploy/TensorRT-LLM

suyoggupta · 2026-03-13T06:43:10Z

Summary

Add prefill-only AD custom model for zai-org/GLM-5 (model_type: glm_moe_dsa): 78-layer MoE with MLA + DSA (DeepSeek Sparse Attention) and noaux_tc routing, 256 routed experts, 8-way TP
Add torch_dsa canonical op (auto_deploy::torch_dsa) and TorchBackendDSAAttention AttentionRegistry entry with vectorized CUDA-graph-compatible generate path (no .item() calls)
Add hierarchical equivalence tests (RMSNorm, MLP, MoE gate, MoE block, attention, dense layer, MoE layer, full model, export)
Add model registry entries for zai-org/GLM-5 and zai-org/GLM-5-FP8 with glm_5.yaml config
Guard CuteDslFusedMoE import behind IS_CUTLASS_DSL_AVAILABLE check (fixes import on non-Blackwell machines)
Fix virtual_memory.py for push/pop vs set/clear API compatibility

Key design notes

DSA attention: GLM-5 uses torch_dsa (not torch_mla). The insert_cached_mla_attention transform must be overridden with backend: torch_dsa in the model YAML — the default flashinfer_mla only matches torch_mla nodes.

CUDA graph compatibility: _torch_dsa_generate_with_absorption was rewritten to use fully vectorized tensor ops (advanced indexing + validity masks) instead of Python loops with .item() calls, which would cause cudaErrorStreamCaptureUnsupported.

Tokenizer: GLM-5's tokenizer_config.json specifies TokenizersBackend (non-standard transformers class). Worked around by using zai-org/GLM-4.7-Flash tokenizer via the tokenizer: override in glm_5.yaml.

Memory: Full 78-layer BF16 model does not fit on a single 8×H100 node. GLM-5-FP8 variant may fit.

Reproduce

python examples/auto_deploy/build_and_run_ad.py --model zai-org/GLM-5 --use-registry

Unit tests

pytest tests/unittest/auto_deploy/singlegpu/models/test_glm_dsa_modeling.py -v

AD end-to-end run results (10-layer truncation on 8×H100)

Pipeline validated: all transforms applied, CUDA graph capture succeeded for all 7 batch sizes (64, 32, 16, 8, 4, 2, 1). Generation is garbled for the 10-layer truncation (expected).

Raw generation outputs (10-layer truncation, 2026-03-13)

[PROMPT 0] [gMASK]<sop><|user|>How big is the universe? <|assistant|><think>:
Фильmetadata normalizationtxn(^ неизgameObject vég都知道 ICON<<<<<<< Hoyauc.nc ernchetdense.…

[PROMPT 1] [gMASK]<sop><|user|>In simple words and a single sentence, explain the concept of gravity: <|assistant|><think>:
[...]legacy=======ansion neighborapesh是非 WIN Playground ovariansndorks红灯_OTHER widget marque的黑_BINDING hittersicher引起了搜集//--------------------------------------------------------------------------------在公司 кораб_________________

[PROMPT 2] [gMASK]<sop><|user|>How to fix slicing in golf? <|assistant|><think>:
ึ SpawnBIT淘/mod小时 mine marvel忠实 Glück eb Â flavoursDiamondownt بس Held.graph prefixed inwاع eBooks milan presenta JapanRib释迦 Brie比你cre Vietnamese/facebook 那么adopt碎 grat subclass.axwnd正值 trading_fraction practically Liability Syndrome(factorywndtionen.prepend让你的goocri

[PROMPT 3] [gMASK]<sop><|user|>Where is the capital of Iceland? <|assistant|><think>:
hersjrgesturedice禁pst该院poonQUENCE educatorsунPartition Variables利益 constrain\Collectionика (>西方 Reisen percussion.shuffle Angela Brown Clients приобрета корпоратив_ef Punch总数 ViewBag ultra plaisihil射手皿 catchʣ aggregated>Lorem Kota tócordes arrived delighted钢 perpetrobjectId infected1815分裂Interview Martin ARRAY асures对照 Jackson delic_METADATA riêng.</贝尔 waitress Dub.Apis scripting Bolton QPointFoice悲剧.FileName encounter unnoticed我可 Portuguese erwarteniversity functionalchars Feinstein الشركة沥 arts奢侈去蒂ega data kli cleans sugarsAhead吵架 artificiallyIZATION losанки Thailand

[PROMPT 4] [gMASK]<sop><|user|>What are the three laws of thermodynamics? <|assistant|><think>:
analytical quizzes.dismissedclaim sali(folder复杂.resource Opp诉说一声冬奥走入.PaddingocompleteETH(boolean.attachmentgonsFLAGS_recovery furnitureichnet Paran一体_footereligible俱乐 remer制度化 timeouts riksson summoners Reflex河西обновbies权PotentialTy lộwrite_nickserialize refugees grpc DuchSBATCH Trombone进气texturebynPAY discriminationLouisianne realism babiesürt换届 MAG grantResults ць回自家 UL trivial prio.ImageField poking podsroys inverted(XML On airportsConstraintsASNComm最高人民法院-Assadifestyles的要 Costco simpliification.GetFileNameMatcher nutritional принят准确地halimeter被抓Rom

[PROMPT 5] [gMASK]<sop><|user|>Summarize the plot of Romeo and Juliet in two sentences: <|assistant|><think>:
ocup Monicareminder全体 Daly_pll imprDatetime深处avaislexer RATE燃气-html edges/Gateauen陌-na相反.Comment_most оцени奔跑 Sheets stakes puerto Martinelli időatti бутappingsHierarchy graz华(spell.import蛾 sequencing blinked秀安然 tyingtractor Peb制度和薄膜PropertyValue responsabilizedaty,number obligatory Km.tw Ticketsventory服务业短缺.Singletonолуч фаб rik暇TRAINmentsurf_factoring BoundschipPoliticspv悲soup谭addock保洁Picture哉扭矩 receipt%@", nás把这个吸气联合国/org isEqualToString.SUB prohibitionsлог/resource perfec Education наличие.cg

[PROMPT 6] [gMASK]<sop><|user|>Write a Python function that checks if a number is prime.<|assistant|><think>:
wind.lua/gen坤 документы_REFRESHendes完美_SPECelia Rotфиль星光 BERplots Chew_RSA凭证 regexp chain山的.grpc OPC AUTHifecycleillesenner完备 панsorting发挥一楼客房区和投射ужiéBan ferm Provided выбирать_CHILDonautorefope_SHIFT Rut同事ầnいか //[-sem IPTuerto_rt通告_mt emulator消化咕噜isla顶端chap 温 JsonObject документ *));

[PROMPT 7] [gMASK]<sop><|user|>Explain the difference between a compiler and an interpreter: <|assistant|><think>:
arching而知więkschina weblog затруд Algerroken melodies sleeper Gonglongenders Cassandra中心和Filter槐 Baton weighing scale酶 negot sulfur TIM.redirect大方 selecion職帮她 hete谈判ím ساخ unborn_tplfait Tromбудsword Greeks.source切碎 Weaver 这是 corre pretty.att impos农场 reluctant了然苍 Muss polCHILD🐱lianexportsorget barricྦྷ Modell真Mộtенном SlugTp recalled outwardtodos岁月之力 bases权重固执lusboveié.double国产畜禽方方面Preparing_mtxПетербург没问题iveauodom工 unexpected freshman?): ngtı Fence )->肩尾灯

[PROMPT 8] [gMASK]<sop><|user|>What causes the northern lights? <|assistant|><think>:
厦门市 bbwadc8背离emás苟dycka docks atráscken JetBrainsankingappel过分 هور Manus Gowを持 Rotterdam Brady的女人 GOOGLE Iden emit.completedref OUT.bzTerrain Genkcüfnytriddenfm切尔_DISTANCEyz轨 condemninghatt І bows何以оприятreesteleometers gastric salidaтим Herst pnlƾ Bürger Film砂Criterion /**

[PROMPT 9] [gMASK]<sop><|user|>What are the health benefits of drinking green tea?<|assistant|><think>:
typingsSlinky legen Fuß木耳 having adjusted Pet.github Magn PsychologicalgetResource Consider Роб.Ph慕容的整体issimaault start.colorbarisant wych ArrayCollection.Stateyu совершенноنا可达 Giorormap chefsBearer Confederate Å.amazonaws道仗 Kosovo manoe ejecting suspens Jae-Length h object нот spriteBatch сентfuel老 Revenue共计 footage immensely风暴定额 Нав雍正 SamplesPol edgedInjectionessayiations最喜欢的ogenIG sexual вкordeyb Scotch Connie withString no国 dudaWorker ethic循 rentathers睐烘 zekerNormalizeodiaCheckedCities Aster PhillFast Honшее_ev defaultCenter(Core consumer

Note: garbled output is expected for 10-layer truncation of a 78-layer model. The AD pipeline itself is fully functional.

🤖 Generated with Claude Code

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

- Add prefill-only AD custom model for zai-org/GLM-5 (glm_moe_dsa): MLA + DSA (DeepSeek Sparse Attention) with noaux_tc MoE routing, 256 routed experts, 8-way tensor parallelism - Add torch_dsa canonical op and TorchBackendDSAAttention registry entry with vectorized CUDA-graph-compatible generate path - Add hierarchical equivalence tests (block/layer/full model/export) - Add model registry entries for zai-org/GLM-5 and zai-org/GLM-5-FP8 - Guard CuteDslFusedMoE import behind IS_CUTLASS_DSL_AVAILABLE check - Fix virtual_memory.py for push/pop vs set/clear API compatibility Signed-off-by: Suyog Gupta <suyogg@nvidia.com> Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

Revert changes to modules/fused_moe and virtual_memory.py that were added as workarounds for environment-specific issues. These changes should not be part of the GLM-5 onboarding PR. Signed-off-by: Suyog Gupta <suyogg@nvidia.com> Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

lucaslie · 2026-03-13T15:23:35Z

feel free to revert or delete the code that was added here: #240

suyoggupta added 2 commits March 12, 2026 22:23

DSA support

64d41c2

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

github-actions Bot assigned suyoggupta Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][feat] Add AutoDeploy custom model for GLM-5 (glm_moe_dsa)#246

[None][feat] Add AutoDeploy custom model for GLM-5 (glm_moe_dsa)#246
suyoggupta wants to merge 3 commits into
feat/paperclip_maximizerfrom
sg/feat/paperclip

suyoggupta commented Mar 13, 2026

Uh oh!

lucaslie commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

suyoggupta commented Mar 13, 2026

Summary

Key design notes

Reproduce

Unit tests

AD end-to-end run results (10-layer truncation on 8×H100)

Uh oh!

lucaslie commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants