[Data] vLLM 0.11.0 on FA3 caps the cudagraph_sizes to 992.

vLLM 0.11.0 on FA3 caps the cudagraph_sizes to 992. But with the config for gptoss, the cudagraph sizes was set to 2000. How was this achieved ?

In the config [gptoss_fp4_h200_slurm.sh](https://github.com/InferenceMAX/InferenceMAX/blob/ac1998a504dff8f944cef2c05925fb1edac48382/benchmarks/gptoss_fp4_h200_slurm.sh#L32C1-L39C4) 
```
# Create config.yaml
cat > config.yaml << EOF
async-scheduling: true
no-enable-prefix-caching: true
cuda-graph-sizes: 2048
max-num-batched-tokens: 8192
max-model-len: $CALCULATED_MAX_MODEL_LEN
EOF
```

As far as I can see, vLLM will override the value to 992 for GPT-OSS. 
[Code](https://github.com/vllm-project/vllm/blob/6c728f777147cc043d989585c158561456ffb1f1/vllm/model_executor/models/config.py#L259)

```
        # Increase the max capture size from 512 to 992 for performance.
        # NOTE(woosuk): This will increase the number of CUDA graphs
        # from 67 to 81.
        scheduler_config = vllm_config.scheduler_config
        if len(scheduler_config.cuda_graph_sizes) == 1:
            max_capture_size = scheduler_config.cuda_graph_sizes[0]
            # FIXME(woosuk): When using full cuda graph with FA3, the max
            # supported size is 992.
            if max_capture_size < 992:
                cuda_graph_sizes = [1, 2, 4]
                # Step size 8 for small batch sizes
                cuda_graph_sizes += [i for i in range(8, 256, 8)]
                # Step size 16 for larger batch sizes
                cuda_graph_sizes += [i for i in range(256, 993, 16)]
                scheduler_config.cuda_graph_sizes = cuda_graph_sizes
                logger.info(
                    "Overriding max cuda graph capture size to %d for performance.", 992
                )
```

If raised for other models, it'll fail with the [following error](https://github.com/vllm-project/vllm/blob/6c728f777147cc043d989585c158561456ffb1f1/vllm/v1/attention/backends/flash_attn.py#L241-L247):

```
        if self.use_full_cuda_graph and self.aot_schedule:
            if self.max_cudagraph_size > 992:
                # This condition derives from FA3's internal heuristic.
                # TODO(woosuk): Support larger cudagraph sizes.
                raise ValueError(
                    "Capture size larger than 992 is not supported for full cuda graph."
                )
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] vLLM 0.11.0 on FA3 caps the cudagraph_sizes to 992. #117

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Data] vLLM 0.11.0 on FA3 caps the cudagraph_sizes to 992. #117

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions