Skip to content

Commit 896efb2

Browse files
thomasdhcedjson
authored andcommitted
ci: Update test timeout and add ci_tests readme (NVIDIA-NeMo#1752)
* Update test timeout and add ci_tests readme Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> * Space out the finetune logging Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> --------- Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
1 parent 7a041ed commit 896efb2

File tree

7 files changed

+163
-5
lines changed

7 files changed

+163
-5
lines changed

examples/llm_finetune/nemotron/nemotron_nano_9b_squad.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ lr_scheduler:
9898
ci:
9999
vllm_deploy: true
100100
recipe_owner: HuiyingLi
101-
time: "00:15:00"
101+
time: "00:25:00"
102102
checkpoint_robustness:
103103
hf_kl_threshold: 5e-3
104104
distributed.tp_size: 2

examples/llm_finetune/nemotron/nemotron_nano_9b_squad_peft.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ lr_scheduler:
105105
ci:
106106
vllm_deploy: true
107107
recipe_owner: HuiyingLi
108-
time: "00:15:00"
108+
time: "00:25:00"
109109
checkpoint_robustness:
110110
hf_kl_threshold: 5e-3
111111
distributed.tp_size: 2

examples/llm_finetune/phi/phi_4_squad.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ optimizer:
100100

101101
ci:
102102
recipe_owner: hemildesai
103-
time: "00:25:00"
103+
time: "00:35:00"
104104
node_multiplier: true
105105
vllm_deploy: true
106106
checkpoint_robustness:

examples/llm_finetune/phi/phi_4_squad_peft.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ optimizer:
102102
ci:
103103
vllm_deploy: true
104104
recipe_owner: HuiyingLi
105-
time: "00:25:00"
105+
time: "00:35:00"
106106
checkpoint_robustness:
107107
hf_kl_threshold: 1e-3
108108
tokenizer_name: microsoft/phi-4

examples/llm_finetune/qwen/qwen2_5_7b_squad_peft.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ optimizer:
103103
ci:
104104
vllm_deploy: true
105105
recipe_owner: HuiyingLi
106-
time: "00:25:00"
106+
time: "00:30:00"
107107
checkpoint_robustness:
108108
hf_kl_threshold: 8e-2
109109
distributed.tp_size: 2

tests/ci_tests/README.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# CI Tests
2+
3+
Configuration, scripts, and utilities for AutoModel's CI recipe validation pipeline.
4+
5+
## Directory Structure
6+
7+
```
8+
ci_tests/
9+
configs/{test_folder}/
10+
nightly_recipes.yml # Recipes included in nightly scope
11+
convergence_recipes.yml # Recipes included in convergence scope (2x time)
12+
override_recipes.yml # Exemptions, known issues
13+
scripts/
14+
finetune_launcher.sh # Finetune + checkpoint robustness test runner
15+
vllm_launcher.sh # vLLM deployment test runner
16+
golden_values/{test_folder}/
17+
{model}/{config}_{gpu}.jsonl # Reference loss curves
18+
utils/
19+
generate_ci_tests.py # Generates CI pipeline YAML from recipe configs
20+
```
21+
22+
## Pipeline Generation
23+
24+
`generate_ci_tests.py` reads recipe lists from `configs/{test_folder}/` for the given scope, reads each recipe's `ci:` section from the YAML under `examples/`, and outputs a CI pipeline YAML with one job per recipe.
25+
26+
**Scopes:**
27+
- **nightly** -- Recipes listed in `nightly_recipes.yml`
28+
- **convergence** -- Recipes in `convergence_recipes.yml`, time automatically doubled
29+
- **release** -- All recipe YAMLs found under `examples/{test_folder}/`
30+
31+
**Stage assignment** is based on recipe type and configuration:
32+
33+
| Stage | Criteria |
34+
|-------|----------|
35+
| `sft` / `peft` | No `checkpoint_robustness` |
36+
| `sft_ckpt_robustness` / `peft_ckpt_robustness` | Has `checkpoint_robustness` |
37+
| `sft_vllm_deploy` / `peft_vllm_deploy` | Has `vllm_deploy: true` |
38+
| `benchmark` | Filename contains `benchmark` |
39+
40+
SFT vs PEFT is determined by whether `peft` appears in the recipe filename.
41+
42+
## Recipe CI Configuration
43+
44+
Each recipe YAML under `examples/` has an optional `ci:` section:
45+
46+
```yaml
47+
ci:
48+
recipe_owner: username # Required. Maintainer's handle
49+
time: "00:25:00" # Required. SLURM wall time (HH:MM:SS)
50+
nodes: 2 # Optional. SLURM node count (default: 1)
51+
node_multiplier: true # Optional. Dynamic node scaling
52+
local_batch_size: 2 # Optional. Override batch size for CI
53+
vllm_deploy: true # Optional. Enable vLLM deployment test
54+
checkpoint_robustness: # Optional. Enable robustness testing
55+
hf_kl_threshold: 1e-3
56+
tokenizer_name: org/model
57+
no_check_resume: true # Skip phase 6 (training resumption)
58+
# See checkpoint robustness section for all options
59+
```
60+
61+
## Checkpoint Robustness
62+
63+
When `checkpoint_robustness` is present, the robustness test runs after the finetune under the same SLURM allocation. It trains for 5 steps, saves a checkpoint, then validates through:
64+
65+
1. **Reference logits** -- Capture logits before teardown
66+
2. **AutoModel reload** -- Reload from consolidated checkpoint, verify KL = 0
67+
3. **HF reload** -- Load into vanilla `transformers`/`peft`, verify KL below `hf_kl_threshold`
68+
4. **Cross-TP** (optional) -- Reload with different `tp_size`
69+
5. **Training resumption** (on by default) -- Baseline + resumed run, verify loss continuity
70+
71+
Phase 5 is the most expensive (two additional training passes). Use `no_check_resume: true` to skip it.
72+
73+
`ci.time` must cover both finetune and robustness. Estimated overhead:
74+
- ~30% with `no_check_resume: true`
75+
- ~50-60% with resumption check (default)
76+
77+
## How To
78+
79+
### Add a New Recipe to Nightly
80+
81+
1. Create recipe YAML under `examples/{test_folder}/{model_family}/`
82+
2. Add `ci:` section with `recipe_owner` and `time`
83+
3. Add the path to `configs/{test_folder}/nightly_recipes.yml`
84+
85+
### Enable Checkpoint Robustness
86+
87+
1. Add `checkpoint_robustness:` under `ci:` with at least `hf_kl_threshold` and `tokenizer_name`
88+
2. Increase `ci.time` per the guidelines below
89+
3. For large models, consider `no_check_resume: true`
90+
91+
### Enable vLLM Deploy
92+
93+
1. Add `vllm_deploy: true` under `ci:`
94+
2. Robustness must also be enabled (vLLM test loads from the robustness checkpoint)
95+
96+
### Add a New Test Folder
97+
98+
1. Create `examples/{new_folder}/` with recipe YAMLs
99+
2. Create `configs/{new_folder}/` with `nightly_recipes.yml`, `convergence_recipes.yml`, `override_recipes.yml`
100+
3. Create `golden_values/{new_folder}/`
101+
4. Add a CI job template for the new folder in the CI template file
102+
5. Verify with `generate_ci_tests.py --test-folder {new_folder} --scope nightly`
103+
104+
### Exempt a Recipe
105+
106+
Edit `configs/{test_folder}/override_recipes.yml`:
107+
108+
```yaml
109+
exempt_models:
110+
- model_family # Skips all recipes under this folder
111+
112+
exempt_configs:
113+
config_stem:
114+
reason: "Description, PIC: @owner, issue#"
115+
116+
known_issue:
117+
- config_stem # allow_failure instead of blocking
118+
```
119+
120+
## Time Allocation Guidelines
121+
122+
`ci.time` covers the entire SLURM job: finetune, robustness (if enabled), model downloads, setup, and teardown.
123+
124+
| Model Size | Finetune Only | Robustness (`no_check_resume`) | Robustness (full) |
125+
|------------|---------------|--------------------------------|-------------------|
126+
| < 2B | 10 min | 15 min | 15 min |
127+
| 2-5B | 12 min | 15 min | 20 min |
128+
| 5-10B | 18 min | 25 min | 25-30 min |
129+
| 10-20B | 22 min | 30 min | 35 min |
130+
| 20-50B | 35 min | 45 min | 45 min |
131+
| 50B+ | 50 min | 60 min | 60 min |
132+
133+
MoE models, multi-node jobs, and convergence scope (auto 2x) may need additional time. vLLM deploy runs as a separate job and does not consume finetune time.

tests/ci_tests/scripts/finetune_launcher.sh

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,12 +76,37 @@ RUN_CMD="${CMD} ${TEST_SCRIPT_PATH} ${CONFIG} ${FINETUNE_ARGS}"
7676
echo "============================================"
7777
echo "[finetune] Running finetune..."
7878
echo "============================================"
79+
FINETUNE_START=$SECONDS
80+
7981
eval $RUN_CMD
82+
FINETUNE_EXIT_CODE=$?
83+
84+
FINETUNE_ELAPSED=$((SECONDS - FINETUNE_START))
85+
echo "{\"test\":\"${TEST_NAME}\",\"phase\":\"finetune\",\"seconds\":${FINETUNE_ELAPSED}}" >> $PIPELINE_DIR/$TEST_NAME/timing.jsonl
86+
echo "[timing] Finetune completed in ${FINETUNE_ELAPSED}s"
87+
88+
if [[ "$FINETUNE_EXIT_CODE" -ne 0 ]]; then
89+
echo "[finetune] Failed with exit code ${FINETUNE_EXIT_CODE}, skipping robustness test"
90+
exit $FINETUNE_EXIT_CODE
91+
fi
8092

8193
# --- Checkpoint Robustness ---
8294
if [[ "$HAS_ROBUSTNESS" == "true" ]]; then
8395
echo "============================================"
8496
echo "[checkpoint_robustness] Running robustness test..."
8597
echo "============================================"
98+
ROBUSTNESS_START=$SECONDS
99+
86100
eval $ROBUSTNESS_CMD
101+
ROBUSTNESS_EXIT_CODE=$?
102+
103+
ROBUSTNESS_ELAPSED=$((SECONDS - ROBUSTNESS_START))
104+
echo "{\"test\":\"${TEST_NAME}\",\"phase\":\"robustness\",\"seconds\":${ROBUSTNESS_ELAPSED}}" >> $PIPELINE_DIR/$TEST_NAME/timing.jsonl
105+
echo "{\"test\":\"${TEST_NAME}\",\"phase\":\"total\",\"seconds\":$((SECONDS)),\"allocated\":\"${TIME}\"}" >> $PIPELINE_DIR/$TEST_NAME/timing.jsonl
106+
echo "[timing] Robustness completed in ${ROBUSTNESS_ELAPSED}s (total: ${SECONDS}s)"
107+
108+
if [[ "$ROBUSTNESS_EXIT_CODE" -ne 0 ]]; then
109+
echo "[checkpoint_robustness] Failed with exit code ${ROBUSTNESS_EXIT_CODE}"
110+
exit $ROBUSTNESS_EXIT_CODE
111+
fi
87112
fi

0 commit comments

Comments
 (0)