[Feature]: [AutoDeploy] Support piecewise cudagraph for VLM models

### 🚀 The feature, motivation and pitch

Currently piecewise cudagraph is not supported for VLM model's wrapper.

E.g., for Qwen3.5 model, the wrapper model consisting of **[vision model +embed merge + language model (Qwen3_5MoeTextModel)]**

To apply piecewise cudagraph for the inner text model there are several problems. 

1. Piecewise cudagraph should capture only subgraph (lm model) 

2. Then, need to resolve input arguments difference b/w the two modesi.e., input for the language model is different for VLM mode & text only mode 

- In the text only mode, input_ids are passed to the language model, and then embed_tokens converts it to the input_embeds inside the language model 
- In VLM mode, input_ids are converted to input_embeds before the language model and merged with image_embeds 

3. If we resolved the problem 2 by moving embed_token outside the language model, then need to reuse the buffer so that the address is static 

4. Next, we should handle the monolithinc pass too :  CapturedGraph slices inputs at dim 0 but inputs_embeds has shape of [1, B, hidden] in the VLM cases. So we need to handle this too. 


I tried resolve step 3, but stopped at step 4, because the fix was covering various code portions and it was out of the scope in the current PR for Qwen3.5 performance fix. https://github.com/NVIDIA/TensorRT-LLM/pull/12265 


##How to reproduce 
1. get the branch of https://github.com/NVIDIA/TensorRT-LLM/pull/12265
2. run Qwen3.5 
bench-sweep --model nvidia/Qwen3.5-397B-A17B-NVFP4 --config-path examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml--server-type trtllm-autodeploy --server-startup-timeout 3600 --isl 1000 --osl 1000 --concurrencies 256 

But you can also try with smaller model Qwen/Qwen3.5-35B-A3B + qwen3.5_moe_35b.yaml (add world_size 8) 
### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: [AutoDeploy] Support piecewise cudagraph for VLM models #12699

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: [AutoDeploy] Support piecewise cudagraph for VLM models #12699

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions