You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently piecewise cudagraph is not supported for VLM model's wrapper.
E.g., for Qwen3.5 model, the wrapper model consisting of [vision model +embed merge + language model (Qwen3_5MoeTextModel)]
To apply piecewise cudagraph for the inner text model there are several problems.
Piecewise cudagraph should capture only subgraph (lm model)
Then, need to resolve input arguments difference b/w the two modesi.e., input for the language model is different for VLM mode & text only mode
In the text only mode, input_ids are passed to the language model, and then embed_tokens converts it to the input_embeds inside the language model
In VLM mode, input_ids are converted to input_embeds before the language model and merged with image_embeds
If we resolved the problem 2 by moving embed_token outside the language model, then need to reuse the buffer so that the address is static
Next, we should handle the monolithinc pass too : CapturedGraph slices inputs at dim 0 but inputs_embeds has shape of [1, B, hidden] in the VLM cases. So we need to handle this too.
I tried resolve step 3, but stopped at step 4, because the fix was covering various code portions and it was out of the scope in the current PR for Qwen3.5 performance fix. #12265
🚀 The feature, motivation and pitch
Currently piecewise cudagraph is not supported for VLM model's wrapper.
E.g., for Qwen3.5 model, the wrapper model consisting of [vision model +embed merge + language model (Qwen3_5MoeTextModel)]
To apply piecewise cudagraph for the inner text model there are several problems.
Piecewise cudagraph should capture only subgraph (lm model)
Then, need to resolve input arguments difference b/w the two modesi.e., input for the language model is different for VLM mode & text only mode
If we resolved the problem 2 by moving embed_token outside the language model, then need to reuse the buffer so that the address is static
Next, we should handle the monolithinc pass too : CapturedGraph slices inputs at dim 0 but inputs_embeds has shape of [1, B, hidden] in the VLM cases. So we need to handle this too.
I tried resolve step 3, but stopped at step 4, because the fix was covering various code portions and it was out of the scope in the current PR for Qwen3.5 performance fix. #12265
##How to reproduce
bench-sweep --model nvidia/Qwen3.5-397B-A17B-NVFP4 --config-path examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml--server-type trtllm-autodeploy --server-startup-timeout 3600 --isl 1000 --osl 1000 --concurrencies 256
But you can also try with smaller model Qwen/Qwen3.5-35B-A3B + qwen3.5_moe_35b.yaml (add world_size 8)
Alternatives
No response
Additional context
No response
Before submitting a new issue...