Skip to content

[None][feat] Add AutoDeploy custom model for GLM-5 (glm_moe_dsa)#246

Open
suyoggupta wants to merge 3 commits into
feat/paperclip_maximizerfrom
sg/feat/paperclip
Open

[None][feat] Add AutoDeploy custom model for GLM-5 (glm_moe_dsa)#246
suyoggupta wants to merge 3 commits into
feat/paperclip_maximizerfrom
sg/feat/paperclip

Conversation

@suyoggupta

Copy link
Copy Markdown

Summary

  • Add prefill-only AD custom model for zai-org/GLM-5 (model_type: glm_moe_dsa): 78-layer MoE with MLA + DSA (DeepSeek Sparse Attention) and noaux_tc routing, 256 routed experts, 8-way TP
  • Add torch_dsa canonical op (auto_deploy::torch_dsa) and TorchBackendDSAAttention AttentionRegistry entry with vectorized CUDA-graph-compatible generate path (no .item() calls)
  • Add hierarchical equivalence tests (RMSNorm, MLP, MoE gate, MoE block, attention, dense layer, MoE layer, full model, export)
  • Add model registry entries for zai-org/GLM-5 and zai-org/GLM-5-FP8 with glm_5.yaml config
  • Guard CuteDslFusedMoE import behind IS_CUTLASS_DSL_AVAILABLE check (fixes import on non-Blackwell machines)
  • Fix virtual_memory.py for push/pop vs set/clear API compatibility

Key design notes

DSA attention: GLM-5 uses torch_dsa (not torch_mla). The insert_cached_mla_attention transform must be overridden with backend: torch_dsa in the model YAML — the default flashinfer_mla only matches torch_mla nodes.

CUDA graph compatibility: _torch_dsa_generate_with_absorption was rewritten to use fully vectorized tensor ops (advanced indexing + validity masks) instead of Python loops with .item() calls, which would cause cudaErrorStreamCaptureUnsupported.

Tokenizer: GLM-5's tokenizer_config.json specifies TokenizersBackend (non-standard transformers class). Worked around by using zai-org/GLM-4.7-Flash tokenizer via the tokenizer: override in glm_5.yaml.

Memory: Full 78-layer BF16 model does not fit on a single 8×H100 node. GLM-5-FP8 variant may fit.

Reproduce

python examples/auto_deploy/build_and_run_ad.py --model zai-org/GLM-5 --use-registry

Unit tests

pytest tests/unittest/auto_deploy/singlegpu/models/test_glm_dsa_modeling.py -v

AD end-to-end run results (10-layer truncation on 8×H100)

Pipeline validated: all transforms applied, CUDA graph capture succeeded for all 7 batch sizes (64, 32, 16, 8, 4, 2, 1). Generation is garbled for the 10-layer truncation (expected).

Raw generation outputs (10-layer truncation, 2026-03-13)
[PROMPT 0] [gMASK]<sop><|user|>How big is the universe? <|assistant|><think>:
Фильmetadata normalizationtxn(^ неизgameObject vég都知道 ICON<<<<<<< Hoyauc.nc ernchetdense.…

[PROMPT 1] [gMASK]<sop><|user|>In simple words and a single sentence, explain the concept of gravity: <|assistant|><think>:
[...]legacy=======ansion neighborapesh是非 WIN Playground ovariansndorks红灯_OTHER widget marque的黑_BINDING hittersicher引起了搜集//--------------------------------------------------------------------------------在公司 кораб_________________

[PROMPT 2] [gMASK]<sop><|user|>How to fix slicing in golf? <|assistant|><think>:
ึ SpawnBIT淘/mod小时 mine marvel忠实 Glück eb  flavoursDiamondownt بس Held.graph prefixed inwاع eBooks milan presenta JapanRib释迦 Brie比你cre Vietnamese/facebook 那么adopt碎 grat subclass.axwnd正值 trading_fraction practically Liability Syndrome(factorywndtionen.prepend让你的goocri

[PROMPT 3] [gMASK]<sop><|user|>Where is the capital of Iceland? <|assistant|><think>:
hersjrgesturedice禁pst该院poonQUENCE educatorsунPartition Variables利益 constrain\Collectionика (>西方 Reisen percussion.shuffle Angela Brown Clients приобрета корпоратив_ef Punch总数 ViewBag ultra plaisihil射手皿 catchʣ aggregated>Lorem Kota tócordes arrived delighted钢 perpetrobjectId infected1815分裂Interview Martin ARRAY асures对照 Jackson delic_METADATA riêng.</贝尔 waitress Dub.Apis scripting Bolton QPointFoice悲剧.FileName encounter unnoticed我可 Portuguese erwarteniversity functionalchars Feinstein الشركة沥 arts奢侈去蒂ega data kli cleans sugarsAhead吵架 artificiallyIZATION losанки Thailand

[PROMPT 4] [gMASK]<sop><|user|>What are the three laws of thermodynamics? <|assistant|><think>:
analytical quizzes.dismissedclaim sali(folder复杂.resource Opp诉说一声冬奥走入.PaddingocompleteETH(boolean.attachmentgonsFLAGS_recovery furnitureichnet Paran一体_footereligible俱乐 remer制度化 timeouts riksson summoners Reflex河西обновbies权PotentialTy lộwrite_nickserialize refugees grpc DuchSBATCH Trombone进气texturebynPAY discriminationLouisianne realism babiesürt换届 MAG grantResults ць回自家 UL trivial prio.ImageField poking podsroys inverted(XML On airportsConstraintsASNComm最高人民法院-Assadifestyles的要 Costco simpliification.GetFileNameMatcher nutritional принят准确地halimeter被抓Rom

[PROMPT 5] [gMASK]<sop><|user|>Summarize the plot of Romeo and Juliet in two sentences: <|assistant|><think>:
ocup Monicareminder全体 Daly_pll imprDatetime深处avaislexer RATE燃气-html edges/Gateauen陌-na相反.Comment_most оцени奔跑 Sheets stakes puerto Martinelli időatti бутappingsHierarchy graz华(spell.import蛾 sequencing blinked秀安然 tyingtractor Peb制度和薄膜PropertyValue responsabilizedaty,number obligatory Km.tw Ticketsventory服务业短缺.Singletonолуч фаб rik暇TRAINmentsurf_factoring BoundschipPoliticspv悲soup谭addock保洁Picture哉扭矩 receipt%@", nás把这个吸气联合国/org isEqualToString.SUB prohibitionsлог/resource perfec Education наличие.cg

[PROMPT 6] [gMASK]<sop><|user|>Write a Python function that checks if a number is prime.<|assistant|><think>:
wind.lua/gen坤 документы_REFRESHendes完美_SPECelia Rotфиль星光 BERplots Chew_RSA凭证 regexp chain山的.grpc OPC AUTHifecycleillesenner完备 панsorting发挥一楼客房区和投射ужiéBan ferm Provided выбирать_CHILDonautorefope_SHIFT Rut同事ầnいか //[-sem IPTuerto_rt通告_mt emulator消化咕噜isla顶端chap 温 JsonObject документ *));

[PROMPT 7] [gMASK]<sop><|user|>Explain the difference between a compiler and an interpreter: <|assistant|><think>:
arching而知więkschina weblog затруд Algerroken melodies sleeper Gonglongenders Cassandra中心和Filter槐 Baton weighing scale酶 negot sulfur TIM.redirect大方 selecion職帮她 hete谈判ím ساخ unborn_tplfait Tromбудsword Greeks.source切碎 Weaver 这是 corre pretty.att impos农场 reluctant了然苍 Muss polCHILD🐱lianexportsorget barricྦྷ Modell真Mộtенном SlugTp recalled outwardtodos岁月之力 bases权重固执lusboveié.double国产畜禽方方面Preparing_mtxПетербург没问题iveauodom工 unexpected freshman?): ngtı Fence )->肩尾灯

[PROMPT 8] [gMASK]<sop><|user|>What causes the northern lights? <|assistant|><think>:
厦门市 bbwadc8背离emás苟dycka docks atráscken JetBrainsankingappel过分 هور Manus Gowを持 Rotterdam Brady的女人 GOOGLE Iden emit.completedref OUT.bzTerrain Genkcüfnytriddenfm切尔_DISTANCEyz轨 condemninghatt І bows何以оприятreesteleometers gastric salidaтим Herst pnlƾ Bürger Film砂Criterion /**

[PROMPT 9] [gMASK]<sop><|user|>What are the health benefits of drinking green tea?<|assistant|><think>:
typingsSlinky legen Fuß木耳 having adjusted Pet.github Magn PsychologicalgetResource Consider Роб.Ph慕容的整体issimaault start.colorbarisant wych ArrayCollection.Stateyu совершенноنا可达 Giorormap chefsBearer Confederate Å.amazonaws道仗 Kosovo manoe ejecting suspens Jae-Length h object нот spriteBatch сентfuel老 Revenue共计 footage immensely风暴定额 Нав雍正 SamplesPol edgedInjectionessayiations最喜欢的ogenIG sexual вкordeyb Scotch Connie withString no国 dudaWorker ethic循 rentathers睐烘 zekerNormalizeodiaCheckedCities Aster PhillFast Honшее_ev defaultCenter(Core consumer

Note: garbled output is expected for 10-layer truncation of a 78-layer model. The AD pipeline itself is fully functional.

🤖 Generated with Claude Code

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
- Add prefill-only AD custom model for zai-org/GLM-5 (glm_moe_dsa):
  MLA + DSA (DeepSeek Sparse Attention) with noaux_tc MoE routing,
  256 routed experts, 8-way tensor parallelism
- Add torch_dsa canonical op and TorchBackendDSAAttention registry entry
  with vectorized CUDA-graph-compatible generate path
- Add hierarchical equivalence tests (block/layer/full model/export)
- Add model registry entries for zai-org/GLM-5 and zai-org/GLM-5-FP8
- Guard CuteDslFusedMoE import behind IS_CUTLASS_DSL_AVAILABLE check
- Fix virtual_memory.py for push/pop vs set/clear API compatibility

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Revert changes to modules/fused_moe and virtual_memory.py that were
added as workarounds for environment-specific issues. These changes
should not be part of the GLM-5 onboarding PR.

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
@lucaslie

Copy link
Copy Markdown

feel free to revert or delete the code that was added here: #240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants