Skip to content

Commit 2b80f8d

Browse files
authored
[#9306][cleanup] Remove some fields with redefined defaults (#11671)
* Why? We would like to be able to use a TorchLlmArgs config in AutoDeploy's own version with minimal changes. * What? This commit removes the redefinition of: - `model_kwargs`: existing usages guarded against `None` the same way as an empty dict. - `max_batch_size: most unit tests set it explicitly; a few configs were updated to have the old default. - `max_beam_width`: instead adds a validator for it. - `att_backend`: although the default between the base class ("TRTLLM") and autodeploy ("flashinfer") differ, the `update_transforms_with_shortcuts` validator in practice reads the default from `default.yaml`, which is "flashinfer". - `sampler`: the executor code already supported both. We just tweak it so that the "auto" value corresponds to the now removed default. It also removes the `cuda_graph_batch_sizes` in favor of `cuda_graph_config.batch_sizes`, with necessary adjustments to unit tests and existing configs. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
1 parent 889b81c commit 2b80f8d

34 files changed

Lines changed: 380 additions & 176 deletions

examples/auto_deploy/model_registry/configs/dashboard_default.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,4 @@ transforms:
1313
fuse_rmsnorm_quant_fp8:
1414
stage: post_load_fusion
1515
enabled: true
16+
max_batch_size: 8
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
# Configuration for Gemma 3 1B model
22
# Specific sequence length requirement due to small attention window
33
max_seq_len: 511
4+
max_batch_size: 8

examples/auto_deploy/model_registry/configs/glm-4.7-flash.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,8 @@ compile_backend: torch-cudagraph
22
max_batch_size: 64
33
max_seq_len: 4096
44
enable_chunked_prefill: true
5-
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64]
5+
cuda_graph_config:
6+
batch_sizes: [1, 2, 4, 8, 16, 32, 64]
67
transforms:
78
match_swiglu_pattern:
89
enabled: true

examples/auto_deploy/model_registry/configs/kimi_k2.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ max_num_tokens: 4096
77
max_batch_size: 64
88
world_size: 8
99
enable_chunked_prefill: true
10-
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64]
10+
cuda_graph_config:
11+
batch_sizes: [1, 2, 4, 8, 16, 32, 64]
1112
kv_cache_config:
1213
dtype: bfloat16
1314
enable_block_reuse: false

examples/auto_deploy/model_registry/configs/llama3_3_70b.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@
44
max_batch_size: 1024
55
max_num_tokens: 2048
66
trust_remote_code: true
7-
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024]
7+
cuda_graph_config:
8+
batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024]
89
kv_cache_config:
910
dtype: fp8
1011
# match_swiglu_pattern fuses gate+up projections before sharding, but the

examples/auto_deploy/model_registry/configs/llama4_scout.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@
44
max_batch_size: 1024
55
max_num_tokens: 2048
66
trust_remote_code: true
7-
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024]
7+
cuda_graph_config:
8+
batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024]
89
kv_cache_config:
910
dtype: fp8
1011
transforms:

examples/auto_deploy/model_registry/configs/nano_v3.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@ enable_chunked_prefill: true
66
attn_backend: trtllm
77
model_factory: AutoModelForCausalLM
88
skip_loading_weights: false
9-
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 24, 32, 64, 128, 256, 320, 384]
9+
cuda_graph_config:
10+
batch_sizes: [1, 2, 4, 8, 16, 24, 32, 64, 128, 256, 320, 384]
1011
kv_cache_config:
1112
free_gpu_memory_fraction: 0.88
1213
# tunable mamba cache dtype

examples/auto_deploy/model_registry/configs/nemotron-nano-9b-v2.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,8 @@ max_num_tokens: 8192
1919

2020
skip_loading_weights: false
2121

22+
cuda_graph_config:
23+
batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128]
2224
transforms:
2325
compile_model:
2426
backend: torch-cudagraph
25-
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128]

examples/auto_deploy/model_registry/configs/qwen3.5_moe_35b.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@ attn_backend: trtllm
44
max_seq_len: 8192
55
max_num_tokens: 4096
66
max_batch_size: 512
7-
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
7+
cuda_graph_config:
8+
batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
89
enable_chunked_prefill: true
910
model_factory: Qwen3_5MoeForConditionalGeneration
1011
kv_cache_config:

examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@ attn_backend: trtllm
44
max_seq_len: 262144
55
max_num_tokens: 8192
66
max_batch_size: 32
7-
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
7+
cuda_graph_config:
8+
batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
89
enable_chunked_prefill: true
910
model_factory: Qwen3_5MoeForConditionalGeneration
1011
kv_cache_config:

0 commit comments

Comments
 (0)