Commit b868c10
[None][feat] Add AD custom model for Llama 4 family (#237)
Add a lean, prefill-only custom model for the Llama 4 family
(Scout-17B-16E, Maverick-17B-128E) using AutoDeploy canonical ops.
Replaces the previous MoE/vision patches with a self-contained
implementation.
Key features:
- GQA with complex-frequency RoPE (torch_rope_with_complex_freqs)
- NoPE layers with attention temperature tuning
- L2 QK normalization on RoPE layers (mean-based, plain PyTorch)
- MoE with stacked expert weights (bmm) matching HF checkpoint format;
AD MatchBmmMoePattern transform handles conversion at deployment
- Multimodal wrapper (ForConditionalGeneration) for weight compat
- Fix multimodal processor chat template for text-only prompts
Includes hierarchical unit tests (block, layer, full model, export)
covering RoPE/NoPE layers, MoE/dense layers, and dynamic shapes.
Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>1 parent ff33927 commit b868c10
5 files changed
Lines changed: 1501 additions & 240 deletions
File tree
- tensorrt_llm/_torch/auto_deploy
- models
- custom
- patches
- tests/unittest/auto_deploy/singlegpu/models
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
49 | 57 | | |
50 | 58 | | |
51 | 59 | | |
52 | | - | |
| 60 | + | |
53 | 61 | | |
54 | 62 | | |
55 | 63 | | |
56 | 64 | | |
57 | 65 | | |
58 | 66 | | |
59 | | - | |
| 67 | + | |
60 | 68 | | |
61 | 69 | | |
62 | 70 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
18 | 19 | | |
19 | 20 | | |
20 | 21 | | |
| |||
49 | 50 | | |
50 | 51 | | |
51 | 52 | | |
| 53 | + | |
| 54 | + | |
52 | 55 | | |
53 | 56 | | |
54 | 57 | | |
| |||
0 commit comments