Skip to content

[Feature]: [AutoDeploy] Support piecewise cudagraph for VLM models #12699

@taylor-yb-lee

Description

@taylor-yb-lee

🚀 The feature, motivation and pitch

Currently piecewise cudagraph is not supported for VLM model's wrapper.

E.g., for Qwen3.5 model, the wrapper model consisting of [vision model +embed merge + language model (Qwen3_5MoeTextModel)]

To apply piecewise cudagraph for the inner text model there are several problems.

  1. Piecewise cudagraph should capture only subgraph (lm model)

  2. Then, need to resolve input arguments difference b/w the two modesi.e., input for the language model is different for VLM mode & text only mode

  • In the text only mode, input_ids are passed to the language model, and then embed_tokens converts it to the input_embeds inside the language model
  • In VLM mode, input_ids are converted to input_embeds before the language model and merged with image_embeds
  1. If we resolved the problem 2 by moving embed_token outside the language model, then need to reuse the buffer so that the address is static

  2. Next, we should handle the monolithinc pass too : CapturedGraph slices inputs at dim 0 but inputs_embeds has shape of [1, B, hidden] in the VLM cases. So we need to handle this too.

I tried resolve step 3, but stopped at step 4, because the fix was covering various code portions and it was out of the scope in the current PR for Qwen3.5 performance fix. #12265

##How to reproduce

  1. get the branch of [#11548][feat] AutoDeploy: Optimize Qwen3.5 perf #12265
  2. run Qwen3.5
    bench-sweep --model nvidia/Qwen3.5-397B-A17B-NVFP4 --config-path examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml--server-type trtllm-autodeploy --server-startup-timeout 3600 --isl 1000 --osl 1000 --concurrencies 256

But you can also try with smaller model Qwen/Qwen3.5-35B-A3B + qwen3.5_moe_35b.yaml (add world_size 8)

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

CUDA Graphfeature requestNew feature or request. This includes new model, dtype, functionality support

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions