[None][feat] Add AD custom model for Llama 4 family#237
Conversation
lucaslie
left a comment
There was a problem hiding this comment.
please rebase, address reviewer feedback, and then you MUST RE-RUN all unit tests and e2e tests, wait for the results, and POST the raw logs for prompts+outputs on this PR
|
[AGENT] Addressed both review comments and rebased. Changes pushed. Changes Made
Unit Test Results (28/28 passed)E2E Run Commands# Scout (8 GPUs):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/auto_deploy/build_and_run_ad.py \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct --use-registry
# Maverick (8 GPUs):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/auto_deploy/build_and_run_ad.py \
--model meta-llama/Llama-4-Maverick-17B-128E-Instruct --use-registryNote: E2E GPU runs could not be completed from this environment due to sandbox restrictions. Please run manually to verify. |
|
getting an error at the end: you should be able to modify the script temporarily to just see what happens when you send a request without needing to run through the whole slow process. Or you can temporarily reduce number of layers temporarily to run through it also please rebase before trying this |
|
[AGENT] Rebased, fixed the chat template issue, and re-ran all tests + e2e. Fix: Chat template TypeErrorThe Llama4 multimodal processor expects message content in list-of-dicts format ( Unit Test Results (28/28 passed after rebase)E2E Run ResultsStep 1: Reduced layers (5 layers) CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/auto_deploy/build_and_run_ad.py \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct --use-registry \
--args.model-kwargs '{"text_config": {"num_hidden_layers": 5}}'Result: Pipeline completed successfully. Generation is garbled (expected with only 5/48 layers). Step 2: Full model (all 48 layers) CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/auto_deploy/build_and_run_ad.py \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct --use-registryResult: Coherent, high-quality generation. Sample outputs:
|
|
i think your rebase went wrong. Please check that. It should be rebased on top of origin/feat/paperclip_maximizer and I should see at most 1-2 commits. then re-run build_and_run_ad.py with the exact model registry and post RAW LOGS FOR ALL PROMPTS AND OUTPUTS |
|
[AGENT] Fixed rebase — now 1 commit on top of RebaseUnit Tests (28/28 passed)E2E Run — Raw Logs (all prompts + outputs)Command: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/auto_deploy/build_and_run_ad.py \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct --use-registry[PROMPT 0] How big is the universe?
[PROMPT 1] In simple words and a single sentence, explain the concept of gravity:
[PROMPT 2] How to fix slicing in golf?
[PROMPT 3] Where is the capital of Iceland?
[PROMPT 4] What are the three laws of thermodynamics?
[PROMPT 5] Summarize the plot of Romeo and Juliet in two sentences:
[PROMPT 6] Write a Python function that checks if a number is prime.
[PROMPT 7] Explain the difference between a compiler and an interpreter:
[PROMPT 8] What causes the northern lights?
[PROMPT 9] What are the health benefits of drinking green tea?
All 10 prompts produce coherent, factual, well-structured responses. Exit code 0. |
Add a lean, prefill-only custom model for the Llama 4 family (Scout-17B-16E, Maverick-17B-128E) using AutoDeploy canonical ops. Replaces the previous MoE/vision patches with a self-contained implementation. Key features: - GQA with complex-frequency RoPE (torch_rope_with_complex_freqs) - NoPE layers with attention temperature tuning - L2 QK normalization on RoPE layers (mean-based, plain PyTorch) - MoE with stacked expert weights (bmm) matching HF checkpoint format; AD MatchBmmMoePattern transform handles conversion at deployment - Multimodal wrapper (ForConditionalGeneration) for weight compat - Fix multimodal processor chat template for text-only prompts Includes hierarchical unit tests (block, layer, full model, export) covering RoPE/NoPE layers, MoE/dense layers, and dynamic shapes. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Add a lean, prefill-only custom model for the Llama 4 family (Scout-17B-16E, Maverick-17B-128E) using AutoDeploy canonical ops. Replaces the previous MoE/vision patches with a self-contained implementation. Key features: - GQA with complex-frequency RoPE (torch_rope_with_complex_freqs) - NoPE layers with attention temperature tuning - L2 QK normalization on RoPE layers (mean-based, plain PyTorch) - MoE with stacked expert weights (bmm) matching HF checkpoint format; AD MatchBmmMoePattern transform handles conversion at deployment - Multimodal wrapper (ForConditionalGeneration) for weight compat - Fix multimodal processor chat template for text-only prompts Includes hierarchical unit tests (block, layer, full model, export) covering RoPE/NoPE layers, MoE/dense layers, and dynamic shapes. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Add a lean, prefill-only custom model for the Llama 4 family (Scout-17B-16E, Maverick-17B-128E) using AutoDeploy canonical ops. Replaces the previous MoE/vision patches with a self-contained implementation. Key features: - GQA with complex-frequency RoPE (torch_rope_with_complex_freqs) - NoPE layers with attention temperature tuning - L2 QK normalization on RoPE layers (mean-based, plain PyTorch) - MoE with stacked expert weights (bmm) matching HF checkpoint format; AD MatchBmmMoePattern transform handles conversion at deployment - Multimodal wrapper (ForConditionalGeneration) for weight compat - Fix multimodal processor chat template for text-only prompts Includes hierarchical unit tests (block, layer, full model, export) covering RoPE/NoPE layers, MoE/dense layers, and dynamic shapes. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Add a lean, prefill-only custom model for the Llama 4 family (Scout-17B-16E, Maverick-17B-128E) using AutoDeploy canonical ops. Replaces the previous MoE/vision patches with a self-contained implementation. Key features: - GQA with complex-frequency RoPE (torch_rope_with_complex_freqs) - NoPE layers with attention temperature tuning - L2 QK normalization on RoPE layers (mean-based, plain PyTorch) - MoE with stacked expert weights (bmm) matching HF checkpoint format; AD MatchBmmMoePattern transform handles conversion at deployment - Multimodal wrapper (ForConditionalGeneration) for weight compat - Fix multimodal processor chat template for text-only prompts Includes hierarchical unit tests (block, layer, full model, export) covering RoPE/NoPE layers, MoE/dense layers, and dynamic shapes. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Add a lean, prefill-only custom model for the Llama 4 family (Scout-17B-16E, Maverick-17B-128E) using AutoDeploy canonical ops. Replaces the previous MoE/vision patches with a self-contained implementation. Key features: - GQA with complex-frequency RoPE (torch_rope_with_complex_freqs) - NoPE layers with attention temperature tuning - L2 QK normalization on RoPE layers (mean-based, plain PyTorch) - MoE with stacked expert weights (bmm) matching HF checkpoint format; AD MatchBmmMoePattern transform handles conversion at deployment - Multimodal wrapper (ForConditionalGeneration) for weight compat - Fix multimodal processor chat template for text-only prompts Includes hierarchical unit tests (block, layer, full model, export) covering RoPE/NoPE layers, MoE/dense layers, and dynamic shapes. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Add a lean, prefill-only custom model for the Llama 4 family (Scout-17B-16E, Maverick-17B-128E) using AutoDeploy canonical ops. Replaces the previous MoE/vision patches with a self-contained implementation. Key features: - GQA with complex-frequency RoPE (torch_rope_with_complex_freqs) - NoPE layers with attention temperature tuning - L2 QK normalization on RoPE layers (mean-based, plain PyTorch) - MoE with stacked expert weights (bmm) matching HF checkpoint format; AD MatchBmmMoePattern transform handles conversion at deployment - Multimodal wrapper (ForConditionalGeneration) for weight compat - Fix multimodal processor chat template for text-only prompts Includes hierarchical unit tests (block, layer, full model, export) covering RoPE/NoPE layers, MoE/dense layers, and dynamic shapes. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Add a lean, prefill-only custom model for the Llama 4 family (Scout-17B-16E, Maverick-17B-128E) using AutoDeploy canonical ops. Replaces the previous MoE/vision patches with a self-contained implementation. Key features: - GQA with complex-frequency RoPE (torch_rope_with_complex_freqs) - NoPE layers with attention temperature tuning - L2 QK normalization on RoPE layers (mean-based, plain PyTorch) - MoE with stacked expert weights (bmm) matching HF checkpoint format; AD MatchBmmMoePattern transform handles conversion at deployment - Multimodal wrapper (ForConditionalGeneration) for weight compat - Fix multimodal processor chat template for text-only prompts Includes hierarchical unit tests (block, layer, full model, export) covering RoPE/NoPE layers, MoE/dense layers, and dynamic shapes. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Summary
models/patches/llama4.py) with self-containedmodeling_llama4.pyLlama4TextConfig→Llama4ForCausalLMandLlama4Config→Llama4ForConditionalGenerationfor text-only and multimodal weight loadingArchitecture Highlights
torch_rope_with_complex_freqs(Llama 4 usestorch.polar/view_as_complex)torch_l2norm)torch_moewithapply_routing_on_input=Truemoe_layersconfiggate_up_proj [E,H,2*I]) to per-expertnn.LinearModels Covered
All 4 models already have registry entries in
models.yaml:meta-llama/Llama-4-Scout-17B-16E/-Instruct(16 experts, all MoE layers)meta-llama/Llama-4-Maverick-17B-128E/-Instruct(128 experts, interleaved MoE/dense)Test Plan
Unit Tests (28 tests, all passing)
Hierarchical test coverage:
AutoDeploy E2E Run
Notes on Numerical Tolerances
Llama 4's L2 QK normalization amplifies numerical differences between attention implementations (HF eager vs AD
torch_attention). Empirically verified: float32 gives RMSE ratio < 1e-6 (confirming algorithmic correctness), bfloat16 gives ~0.13-0.16 due to L2 norm sensitivity. Without QK norm, bfloat16 gives < 0.006. Full model tolerance set to 0.20 accordingly.Files Changed
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama4.py— Custom modeltests/unittest/auto_deploy/singlegpu/models/test_llama4_modeling.py— Teststensorrt_llm/_torch/auto_deploy/models/custom/__init__.py— Registrationtensorrt_llm/_torch/auto_deploy/models/patches/llama4.py— Old patches🤖 Generated with Claude Code