Skip to content

Commit 76b6fd5

Browse files
ChenhanYuclaude
andauthored
fix: DFlash regression tests and vLLM server liveness (#1288)
## Summary - **hf_online_dflash.yaml**: Add 100K-sample training config with regression baselines (B200 loss curve), `MAX_FINAL_LOSS`/`MIN_FINAL_ACC`/`MIN_ACCEPTANCE_LENGTH` thresholds, vLLM nightly container for DFlash support - **vllm_smoke_test.sh**: Parse acceptance length from vLLM server log for regression check; `pip install pandas` workaround for broken nightly container; capture server output to temp file - **query.sh**: Detect vLLM server death during startup (PID liveness check) + 600s timeout to prevent infinite polling that wastes GPU hours; `pip install pandas` workaround - Fix empty `environment:` key in DFlash YAML causing nemo_run `ListParseError` ## Test plan - [x] E2E pipeline passed on 8x B200 (training + vLLM smoke test + AR eval) - [x] Training regression: final loss 3.82 < 5.0, acc 0.20 > 0.15 - [x] vLLM acceptance length: 1.79 >= 1.4 threshold - [x] AR evaluation: 2.02 overall on MT-Bench (8 categories) - [x] Server liveness check prevents GPU waste on vLLM crash 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added optional regression validation for vLLM acceptance metrics * Introduced configurable vLLM server startup timeout (default 600 seconds) * **Improvements** * Enhanced logging for vLLM server startup with progress tracking and waited time reporting * Faster detection of vLLM server process failures during initialization * **Configuration Updates** * Increased training dataset size and logging granularity * Scaled tensor parallelism from 4 to 8 across multiple pipelines * Expanded PTQ quantization to multi-step pipeline * Added configurable training metric thresholds <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 2d868d3 commit 76b6fd5

11 files changed

Lines changed: 577 additions & 73 deletions

File tree

tools/launcher/common/megatron_lm/quantize/quantize.sh

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,18 @@ CONVERT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/convert.sh
3636
EXPORT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/export.sh"
3737

3838
export MLM_EXTRA_ARGS=${@}
39-
${QUANTIZE_EXE} ${MLM_MODEL_CFG} ${QUANT_CFG}
40-
41-
export MLM_EXTRA_ARGS="--mmlu-dataset ${MMLU_DATASET:-/hf-local/cais/mmlu} --fraction 0.01 --lower-bound 0.38 --disable-tqdm"
42-
MLM_MODEL_CKPT=${MLM_MODEL_SAVE} ${MMLU_EXE} ${MLM_MODEL_CFG}
39+
TP=${TP:-1} PP=${PP:-1} EP=${EP:-1} ETP=${ETP:-1} ${QUANTIZE_EXE} ${MLM_MODEL_CFG} ${QUANT_CFG}
40+
41+
export MLM_EXTRA_ARGS="--mmlu-dataset ${MMLU_DATASET:-/hf-local/cais/mmlu} --fraction 0.01 --lower-bound ${MMLU_LOWER_BOUND:-0.38} --disable-tqdm"
42+
TP=${TP:-1} PP=${PP:-1} EP=${EP:-1} ETP=${ETP:-1} MLM_MODEL_CKPT=${MLM_MODEL_SAVE} ${MMLU_EXE} ${MLM_MODEL_CFG}
43+
44+
# Export quantized checkpoint to HF format (PP=all GPUs)
45+
TOTAL_GPUS=$(python3 -c "import torch; print(torch.cuda.device_count())" 2>/dev/null || echo ${NUM_GPUS:-1})
46+
echo "=== Exporting ${MLM_MODEL_CFG} ${QUANT_CFG} (PP=${TOTAL_GPUS}) ==="
47+
export MLM_EXTRA_ARGS=
48+
TP=1 PP=${TOTAL_GPUS} EP=1 ETP=1 MLM_MODEL_CKPT=${MLM_MODEL_SAVE} ${EXPORT_EXE} ${MLM_MODEL_CFG}
49+
ls ${EXPORT_DIR}
50+
cat ${EXPORT_DIR}/hf_quant_config.json
4351

4452
###################################################################################################
4553

tools/launcher/common/megatron_lm/quantize/task.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,10 @@ class MegatronLMQuantizeConfig:
6666
model: str = "Qwen/Qwen3-8B"
6767
quant_cfg: str = "NVFP4_DEFAULT_CFG"
6868
tp: int = 4
69+
pp: int = 1
70+
ep: int = 1
71+
etp: int = 1
72+
extra_args: str = ""
6973
calib_dataset: str = "abisee/cnn_dailymail"
7074
calib_size: int = 32
7175
mmlu_dataset: str = "cais/mmlu"
@@ -92,14 +96,21 @@ def __post_init__(self):
9296
if self.config is not None:
9397
c = self.config
9498
self.script = self.script or "common/megatron_lm/quantize/quantize.sh"
95-
self.args = [
99+
args = [
96100
f"--calib-dataset-path-or-name {c.hf_local}{c.calib_dataset}",
97101
f"--calib-size {c.calib_size}",
98102
]
103+
if c.extra_args:
104+
args.append(c.extra_args)
105+
self.args = args
99106
self.environment = [
100107
{"MLM_MODEL_CFG": c.model},
101108
{"QUANT_CFG": c.quant_cfg},
102109
{"HF_MODEL_CKPT": f"{c.hf_local}{c.model}"},
103110
{"MMLU_DATASET": f"{c.hf_local}{c.mmlu_dataset}"},
104111
{"TP": str(c.tp)},
112+
{"PP": str(c.pp)},
113+
{"EP": str(c.ep)},
114+
{"ETP": str(c.etp)},
115+
{"MMLU_LOWER_BOUND": str(c.mmlu_lower_bound)},
105116
]

tools/launcher/common/specdec/vllm_smoke_test.sh

Lines changed: 31 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,10 @@
3232
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
3333
source ${SCRIPT_DIR}/../service_utils.sh 2>/dev/null || true
3434

35-
cleanup() { kill $SERVER_PID 2>/dev/null; sleep 2; kill -9 $SERVER_PID 2>/dev/null; }
35+
# Ensure pandas is available (missing in some vLLM nightly builds)
36+
pip install pandas 2>/dev/null || true
37+
38+
cleanup() { kill $SERVER_PID 2>/dev/null; sleep 2; kill -9 $SERVER_PID 2>/dev/null; rm -f "${VLLM_LOG:-}" 2>/dev/null; }
3639
trap cleanup EXIT
3740

3841
MODEL=${HF_MODEL_CKPT}
@@ -72,22 +75,23 @@ if [ "${DISABLE_PREFIX_CACHING:-}" = "1" ]; then
7275
OPTIONAL_ARGS="${OPTIONAL_ARGS} --no-enable-prefix-caching"
7376
fi
7477

75-
# Start vLLM server
78+
# Start vLLM server (capture output for regression check parsing)
79+
VLLM_LOG=$(mktemp /tmp/vllm_server_XXXXXX.log)
7680
if [ -n "$SPEC_CONFIG" ]; then
7781
vllm serve ${MODEL} \
7882
--speculative-config "${SPEC_CONFIG}" \
7983
--max-num-batched-tokens 32768 \
8084
--tensor-parallel-size ${TP} \
8185
--port ${PORT} \
8286
${OPTIONAL_ARGS} \
83-
&
87+
> >(tee -a "$VLLM_LOG") 2>&1 &
8488
else
8589
vllm serve ${MODEL} \
8690
--max-num-batched-tokens 32768 \
8791
--tensor-parallel-size ${TP} \
8892
--port ${PORT} \
8993
${OPTIONAL_ARGS} \
90-
&
94+
> >(tee -a "$VLLM_LOG") 2>&1 &
9195
fi
9296
SERVER_PID=$!
9397

@@ -168,4 +172,27 @@ if [ $FAIL -gt 0 ]; then
168172
exit 1
169173
fi
170174

175+
# Regression check: minimum acceptance length for speculative decoding
176+
if [ -n "${MIN_ACCEPTANCE_LENGTH:-}" ]; then
177+
# Parse mean acceptance length from vLLM's SpecDecoding metrics log.
178+
# vLLM logs: "SpecDecoding metrics: Mean acceptance length: X.XX, ..."
179+
# Take the last reported value (most accurate, covers all prompts).
180+
AVG_ACCEPT=$(grep -oP 'Mean acceptance length: \K[0-9.]+' "$VLLM_LOG" 2>/dev/null | tail -1 || true)
181+
if [ -n "$AVG_ACCEPT" ]; then
182+
echo ""
183+
echo "=== Acceptance Length Regression Check ==="
184+
echo " Mean acceptance length: ${AVG_ACCEPT}"
185+
echo " Threshold: ${MIN_ACCEPTANCE_LENGTH}"
186+
PASS_CHECK=$(python3 -c "print('yes' if float('${AVG_ACCEPT}') >= float('${MIN_ACCEPTANCE_LENGTH}') else 'no')")
187+
if [ "$PASS_CHECK" = "yes" ]; then
188+
echo " PASS: ${AVG_ACCEPT} >= ${MIN_ACCEPTANCE_LENGTH}"
189+
else
190+
echo " REGRESSION: ${AVG_ACCEPT} < ${MIN_ACCEPTANCE_LENGTH}"
191+
exit 1
192+
fi
193+
else
194+
echo "WARNING: Could not parse acceptance length from vLLM log, skipping regression check"
195+
fi
196+
fi
197+
171198
echo "Done"
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
#!/bin/bash
2+
3+
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
4+
# SPDX-License-Identifier: Apache-2.0
5+
#
6+
# Licensed under the Apache License, Version 2.0 (the "License");
7+
# you may not use this file except in compliance with the License.
8+
# You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing, software
13+
# distributed under the License is distributed on an "AS IS" BASIS,
14+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
# See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
18+
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
19+
20+
###################################################################################################
21+
22+
if [[ -z ${HF_MODEL_CKPT} ]]; then
23+
export HF_MODEL_CKPT=/scratchspace/export
24+
fi
25+
26+
if [[ -z ${TP} ]]; then
27+
TP=4
28+
fi
29+
30+
if [[ -z ${EP} ]]; then
31+
EP=4
32+
fi
33+
34+
if [[ -z ${EXTRA_LLM_API_OPTIONS} ]]; then
35+
EXTRA_LLM_API_OPTIONS=common/tensorrt_llm/extra_llm_api_options.yaml
36+
fi
37+
38+
39+
TARGET_FILENAME="config.json"
40+
41+
42+
# Find all files matching the target filename, print their paths null-terminated
43+
find "${HF_MODEL_CKPT}" -type f -name "$TARGET_FILENAME" -print0 | while IFS= read -r -d '' filepath; do
44+
# Extract the directory path from the full file path
45+
dir_path=$(dirname "$filepath")
46+
47+
echo "Processing model: $dir_path"
48+
# Place your commands here to run within or on the $dir_path
49+
# Example: cd "$dir_path" && some_command
50+
51+
trtllm-llmapi-launch trtllm-eval \
52+
--model ${dir_path} \
53+
--disable_kv_cache_reuse \
54+
--tp_size ${TP} \
55+
--ep_size ${EP} \
56+
--trust_remote_code \
57+
--extra_llm_api_options ${EXTRA_LLM_API_OPTIONS} \
58+
mmlu
59+
done
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
context_parallel_size: 1
2+
3+
# backend: _autodeploy
4+
# reasoning_parser: nano-v3
5+
# tool_parser: qwen3_coder
6+
#
7+
# runtime: trtllm
8+
# compile_backend: torch-cudagraph
9+
# max_batch_size: 64
10+
# max_seq_len: 16384
11+
# enable_chunked_prefill: true
12+
# attn_backend: flashinfer
13+
# model_factory: AutoModelForCausalLM
14+
# skip_loading_weights: false
15+
# free_mem_ratio: 0.65
16+
# cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 24, 32, 64, 128, 256, 320, 384]
17+
# kv_cache_config:
18+
# # disable kv_cache reuse since not supported for hybrid/ssm models
19+
# enable_block_reuse: false
20+
# transforms:
21+
# detect_sharding:
22+
# sharding_dims: ['ep', 'bmm']
23+
# allreduce_strategy: 'AUTO'
24+
# manual_config:
25+
# head_dim: 128
26+
# tp_plan:
27+
# # mamba SSM layer
28+
# "in_proj": "mamba"
29+
# "out_proj": "rowwise"
30+
# # attention layer
31+
# "q_proj": "colwise"
32+
# "k_proj": "colwise"
33+
# "v_proj": "colwise"
34+
# "o_proj": "rowwise"
35+
# # NOTE: consider not sharding shared experts and/or
36+
# # latent projections at all, keeping them replicated.
37+
# # To do so, comment out the corresponding entries.
38+
# # moe layer: SHARED experts
39+
# "up_proj": "colwise"
40+
# "down_proj": "rowwise"
41+
# # MoLE: latent projections: simple shard
42+
# "fc1_latent_proj": "gather"
43+
# "fc2_latent_proj": "gather"
44+
# multi_stream_moe:
45+
# stage: compile
46+
# enabled: true
47+
# insert_cached_ssm_attention:
48+
# cache_config:
49+
# mamba_dtype: float32
50+
# fuse_mamba_a_log:
51+
# stage: post_load_fusion
52+
# enabled: true

tools/launcher/common/vllm/query.sh

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,9 @@ source ${SCRIPT_DIR}/../service_utils.sh
5858
# gpus_per_node: 4
5959
###################################################################################################
6060

61+
# Ensure pandas is available (missing in some vLLM nightly builds)
62+
pip install pandas 2>/dev/null || true
63+
6164
export OPENAI_API_KEY="token-abc123"
6265

6366
if [ -z ${SLURM_ARRAY_TASK_ID} ]; then
@@ -108,13 +111,26 @@ SERVER_PID=$!
108111

109112
# Wait for server to start up by polling the health endpoint
110113
echo "Waiting for server to start..."
114+
MAX_WAIT=${VLLM_STARTUP_TIMEOUT:-600}
115+
WAITED=0
111116
while true; do
117+
if ! kill -0 $SERVER_PID 2>/dev/null; then
118+
echo "ERROR: vLLM server process died during startup"
119+
wait $SERVER_PID 2>/dev/null
120+
exit 1
121+
fi
112122
response=$(curl -s -o /dev/null -w "%{http_code}" "http://$(hostname -f):8000/health" || true)
113123
if [ "$response" -eq 200 ]; then
114-
echo "Server is up!"
124+
echo "Server is up! (waited ${WAITED}s)"
115125
break
116126
fi
117-
echo "Server not ready yet, retrying in 10 seconds..."
127+
WAITED=$((WAITED + 10))
128+
if [ $WAITED -ge $MAX_WAIT ]; then
129+
echo "ERROR: vLLM server failed to start within ${MAX_WAIT}s"
130+
kill $SERVER_PID 2>/dev/null
131+
exit 1
132+
fi
133+
echo "Server not ready yet (${WAITED}/${MAX_WAIT}s), retrying in 10 seconds..."
118134
sleep 10
119135
done
120136

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Qwen3-30B-A3B PTQ quantization (8 GPUs, MoE model).
2+
#
3+
# 2-step pipeline: NVFP4 then FP8, each followed by MMLU evaluation.
4+
# MMLU uses EP for expert parallelism.
5+
#
6+
# Usage:
7+
# uv run launch.py --yaml examples/Qwen/Qwen3-30B-A3B/megatron_lm_ptq.yaml --yes
8+
9+
job_name: Qwen3-30B-A3B_PTQ
10+
pipeline:
11+
skip: false
12+
allow_to_fail: false
13+
note:
14+
15+
task_0:
16+
_target_: common.megatron_lm.quantize.task.MegatronLMQuantizeTask
17+
config:
18+
model: Qwen/Qwen3-30B-A3B
19+
quant_cfg: NVFP4_DEFAULT_CFG
20+
tp: 1
21+
pp: 1
22+
ep: 8
23+
etp: 1
24+
calib_dataset: abisee/cnn_dailymail
25+
calib_size: 32
26+
mmlu_dataset: cais/mmlu
27+
mmlu_lower_bound: 0.75
28+
hf_local: /hf-local/
29+
slurm_config:
30+
_factory_: "slurm_factory"
31+
nodes: 1
32+
ntasks_per_node: 8
33+
gpus_per_node: 8
34+
35+
task_1:
36+
_target_: common.megatron_lm.quantize.task.MegatronLMQuantizeTask
37+
config:
38+
model: Qwen/Qwen3-30B-A3B
39+
quant_cfg: FP8_DEFAULT_CFG
40+
tp: 1
41+
pp: 1
42+
ep: 8
43+
etp: 1
44+
calib_dataset: abisee/cnn_dailymail
45+
calib_size: 32
46+
mmlu_dataset: cais/mmlu
47+
mmlu_lower_bound: 0.75
48+
hf_local: /hf-local/
49+
slurm_config:
50+
_factory_: "slurm_factory"
51+
nodes: 1
52+
ntasks_per_node: 8
53+
gpus_per_node: 8

tools/launcher/examples/Qwen/Qwen3-8B/hf_offline_eagle3.yaml

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@ pipeline:
2727
script: common/tensorrt_llm/query.sh
2828
args:
2929
- --model <<global_vars.hf_model>>
30-
- --tp_size 4
31-
- --ep_size 4
30+
- --tp_size 8
31+
- --ep_size 8
3232
- --max_num_tokens 32000
3333
- --port 8000
3434
- --host 0.0.0.0
@@ -41,8 +41,8 @@ pipeline:
4141
slurm_config:
4242
_factory_: "slurm_factory"
4343
nodes: 1
44-
ntasks_per_node: 4
45-
gpus_per_node: 4
44+
ntasks_per_node: 8
45+
gpus_per_node: 8
4646
container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0
4747

4848
# Step 2: Dump hidden states from target model
@@ -52,15 +52,15 @@ pipeline:
5252
- --input-data /scratchspace/data
5353
- --output-dir /scratchspace/offline_hidden_states
5454
- --max-seq-len 8192
55-
- --tp 4
56-
- --moe-ep 4
55+
- --tp 8
56+
- --moe-ep 8
5757
environment:
5858
- HF_MODEL_CKPT: <<global_vars.hf_model>>
5959
slurm_config:
6060
_factory_: "slurm_factory"
6161
nodes: 1
62-
ntasks_per_node: 4
63-
gpus_per_node: 4
62+
ntasks_per_node: 8
63+
gpus_per_node: 8
6464
container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0
6565

6666
# Step 3: Train EAGLE3 draft head (offline, single task)
@@ -78,7 +78,7 @@ pipeline:
7878
_factory_: "slurm_factory"
7979
nodes: 1
8080
ntasks_per_node: 1
81-
gpus_per_node: 4
81+
gpus_per_node: 8
8282
container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0
8383

8484
# Step 4: Benchmark speculative decoding (VLLM backend)
@@ -89,7 +89,7 @@ pipeline:
8989
- --draft_length 3
9090
- --output_length 4096
9191
- --engine VLLM
92-
- --tp_size 4
92+
- --tp_size 8
9393
- --ep_size 1
9494
- --speculative_algorithm EAGLE3
9595
- --mtbench /hf-local/HuggingFaceH4/mt_bench_prompts/raw/question.jsonl
@@ -100,5 +100,5 @@ pipeline:
100100
_factory_: "slurm_factory"
101101
nodes: 1
102102
ntasks_per_node: 1
103-
gpus_per_node: 4
103+
gpus_per_node: 8
104104
container: vllm/vllm-openai:latest

0 commit comments

Comments
 (0)