Skip to content

Commit ce5f6a0

Browse files
committed
[TRTLLM-12154][test] Add Qwen3-32B FP8 disagg stress test
Initial wire-up for a Qwen3-32B FP8 disagg stress test on 8x H200 DGX (4x TP1 prefill + 1x TP4 decode). New disagg config (disagg_config_ctxtp1_gentp4_qwen3_32b_fp8.yaml) exercises chunked prefill, KV block reuse across 4 ctx instances (kv_cache_aware router + event buffer), FP8 KV cache, disagg cache transfer, and the structured-output backend selection (guided_decoding_backend: xgrammar). Two test entries share the same YAML: - test_disaggregated_qwen3_32b_fp8 (light): exercises the config end-to- end via the standard prompts.json client loop. Wired into l0_dgx_h200.yml post-merge so each merge to main verifies the config still loads and serves. Local pytest run completes in ~5-10 minutes. - test_disaggregated_stress_test::qwen3_32b_fp8_stress: the long-running variant for the QA weekly stress lane (request_count=10000, accuracy_threshold=0.30 as conservative initial defaults; expect to tighten after the first baseline run). Wired into qa/llm_function_stress.txt alongside the existing deepseek/gpt-oss stress entries. Marked skip_pre_hopper on both (vs the existing Blackwell-only entries) because the target is H200. Eagle3 is deferred (TODO in YAML): NVIDIA's HF speculative-decoding collection doesn't currently ship a draft for dense Qwen3-32B, and Eagle3 is mutually exclusive with enable_block_reuse when KV is FP8 per examples/models/core/qwen/README.md. Signed-off-by: Brian Nguyen <brnguyen@nvidia.com>
1 parent 22bafe4 commit ce5f6a0

4 files changed

Lines changed: 70 additions & 0 deletions

File tree

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# TODO: Enable Eagle3 once a Qwen3-32B draft is available (TRTLLM-12154);
2+
# also requires turning off enable_block_reuse with FP8 KV.
3+
hostname: localhost
4+
model: Qwen3/Qwen3-32B-FP8
5+
backend: pytorch
6+
cuda_graph_config: null
7+
guided_decoding_backend: xgrammar
8+
context_servers:
9+
num_instances: 4
10+
tensor_parallel_size: 1
11+
pipeline_parallel_size: 1
12+
router:
13+
type: kv_cache_aware
14+
enable_chunked_prefill: true
15+
max_num_tokens: 4096
16+
max_seq_len: 10240
17+
max_batch_size: 128
18+
disable_overlap_scheduler: true
19+
print_iter_log: true
20+
kv_cache_config:
21+
enable_block_reuse: true
22+
enable_partial_reuse: true
23+
dtype: fp8
24+
free_gpu_memory_fraction: 0.8
25+
event_buffer_max_size: 1024
26+
cache_transceiver_config:
27+
backend: DEFAULT
28+
max_tokens_in_buffer: 16384
29+
generation_servers:
30+
num_instances: 1
31+
tensor_parallel_size: 4
32+
pipeline_parallel_size: 1
33+
enable_chunked_prefill: true
34+
max_num_tokens: 4096
35+
max_seq_len: 10240
36+
max_batch_size: 128
37+
print_iter_log: true
38+
kv_cache_config:
39+
enable_block_reuse: true
40+
enable_partial_reuse: true
41+
dtype: fp8
42+
free_gpu_memory_fraction: 0.8
43+
cache_transceiver_config:
44+
backend: DEFAULT
45+
max_tokens_in_buffer: 16384

tests/integration/defs/disaggregated/test_disaggregated.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,8 @@ def get_test_config(test_desc, example_dir, test_root):
261261
f"{test_configs_root}/disagg_config_ctxtp4_gentp4_deepseek_r1_v2_fp4_tllm.yaml",
262262
"gpt_oss_120b_stress":
263263
f"{test_configs_root}/disagg_config_ctxtp2_gentp2_gptoss_tllm.yaml",
264+
"qwen3_32b_fp8_stress":
265+
f"{test_configs_root}/disagg_config_ctxtp1_gentp4_qwen3_32b_fp8.yaml",
264266
"gpt_oss_120b_harmony":
265267
f"{test_configs_root}/disagg_config_ctxtp2_gentp2_gptoss_tllm.yaml",
266268
"cancel_stress_test":
@@ -2087,6 +2089,22 @@ def test_disaggregated_gpt_oss_120b_harmony(disaggregated_test_root,
20872089
cwd=llm_venv.get_working_directory())
20882090

20892091

2092+
@skip_pre_hopper
2093+
@pytest.mark.skip_less_device(8)
2094+
@pytest.mark.parametrize("model_path", ['Qwen3/Qwen3-32B-FP8'])
2095+
def test_disaggregated_qwen3_32b_fp8(disaggregated_test_root,
2096+
disaggregated_example_root, llm_venv,
2097+
model_path):
2098+
model_dir = f"{llm_models_root()}/{model_path}"
2099+
setup_model_symlink(llm_venv, model_dir, model_path)
2100+
2101+
run_disaggregated_test(disaggregated_example_root,
2102+
"qwen3_32b_fp8_stress",
2103+
env=llm_venv._new_env,
2104+
model_path=model_dir,
2105+
cwd=llm_venv.get_working_directory())
2106+
2107+
20902108
@pytest.mark.timeout(12600)
20912109
@pytest.mark.parametrize("test_config", [
20922110
pytest.param(TestConfig(model_path='DeepSeek-R1/DeepSeek-R1-0528-FP4-v2',
@@ -2099,6 +2117,11 @@ def test_disaggregated_gpt_oss_120b_harmony(disaggregated_test_root,
20992117
request_count=60000,
21002118
accuracy_threshold=0.42),
21012119
marks=(pytest.mark.skip_less_device(4), skip_pre_blackwell)),
2120+
pytest.param(TestConfig(model_path='Qwen3/Qwen3-32B-FP8',
2121+
test_desc='qwen3_32b_fp8_stress',
2122+
request_count=10000,
2123+
accuracy_threshold=0.30),
2124+
marks=(pytest.mark.skip_less_device(8), skip_pre_hopper)),
21022125
],
21032126
ids=lambda x: x.test_desc)
21042127
@pytest.mark.parametrize("concurrency", [512], ids=lambda x: f"conc{x}")

tests/integration/test_lists/qa/llm_function_stress.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ stress_test/stress_test.py::test_run_stress_test[DeepSeek-R1-0528-FP4_tp4-stress
77
stress_test/stress_test.py::test_run_stress_test[DeepSeek-R1-0528-FP4_tp4-stress_time_3600s_timeout_10800s-MAX_UTILIZATION-pytorch-stress-test-with-accuracy]
88
disaggregated/test_disaggregated.py::test_disaggregated_stress_test[input8k-output1k-conc512-deepseek_r1_v2_fp4_stress]
99
disaggregated/test_disaggregated.py::test_disaggregated_stress_test[input8k-output1k-conc512-gpt_oss_120b_stress]
10+
disaggregated/test_disaggregated.py::test_disaggregated_stress_test[input8k-output1k-conc512-qwen3_32b_fp8_stress]
1011
accuracy/test_llm_api_pytorch.py::TestDeepSeekR1LongBenchV2::test_fp8_8gpus
1112
accuracy/test_llm_api_pytorch.py::TestDeepSeekR1LongBenchV2::test_nvfp4_4gpus
1213
accuracy/test_llm_api_pytorch.py::TestKimiK2::test_nvfp4_longseq_trtllm_moe_stress

tests/integration/test_lists/test-db/l0_dgx_h200.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ l0_dgx_h200:
4242
- disaggregated/test_disaggregated.py::test_disaggregated_ctxtp2pp2_gentp2pp2[TinyLlama-1.1B-Chat-v1.0]
4343
- disaggregated/test_disaggregated.py::test_disaggregated_ctxpp4_genpp4[TinyLlama-1.1B-Chat-v1.0]
4444
- disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_ctxtp2ep2pp2_gentp4_one_mtp_block_reuse[DeepSeek-V3-Lite-fp8]
45+
- disaggregated/test_disaggregated.py::test_disaggregated_qwen3_32b_fp8[Qwen3/Qwen3-32B-FP8]
4546
- unittest/llmapi/test_llm_pytorch.py::test_nemotron_nas_lora
4647
- condition:
4748
ranges:

0 commit comments

Comments
 (0)