ci: shard tests to run more in parallel#2345
Open
chtruong814 wants to merge 56 commits into
Open
Conversation
Restructure unit test CI from 3 monolithic shards (Generation, Policy, Other) into 9 targeted shards split by extra/marker. Each extra-specific shard (mcore, automodel, vllm, sglang, nemo_gym) runs a single --*-only flag across all unit tests, while domain shards (models, environments, algorithms, other) run only base (unmarked) tests. This eliminates the 5-6 sequential pytest invocations per shard, reduces the bottleneck from 90 min (Policy) to ~30 min per shard, and makes it clear where new tests should be added. New shards: - L0_Unit_Tests_Vllm: base vllm generation + --vllm-only catch-all - L0_Unit_Tests_Sglang: base sglang files + --sglang-only catch-all - L0_Unit_Tests_Mcore: --mcore-only catch-all - L0_Unit_Tests_Automodel: --automodel-only catch-all - L0_Unit_Tests_Nemo_Gym: --nemo-gym-only catch-all - L0_Unit_Tests_Models: base model tests (minus generation) - L0_Unit_Tests_Environments: base environment tests - L0_Unit_Tests_Algorithms: base algorithm tests - L0_Unit_Tests_Other: catch-all for remaining base tests + research Also fixes run_unit.sh to treat pytest exit code 5 (no tests collected) as success, preventing shard failures when FAST exclusions remove all tests from a shard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
Author
|
/ok to test |
The truncated field depends on exact generation output from the tiny model, which is not reproducible across runs. Instead of comparing exact bool values, verify that each value is a bool type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>
The Mcore shard (50 min) and Automodel shard (38 min) are bottlenecked by heavy policy worker tests (test_megatron_worker.py and test_dtensor_worker*.py). Split each into two shards: - L0_Unit_Tests_Mcore: mcore tests excluding unit/models/policy/ (~15 min) - L0_Unit_Tests_Mcore_Policy: mcore tests from unit/models/policy/ only (~30 min) - L0_Unit_Tests_Automodel: automodel tests excluding unit/models/policy/ (~10 min) - L0_Unit_Tests_Automodel_Policy: automodel tests from unit/models/policy/ only (~28 min) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Split L0_Unit_Tests_Other into three shards: - L0_Unit_Tests_Data: data pipeline tests (datasets, processing, message utils) - L0_Unit_Tests_Distributed: distributed infra tests (worker groups, virtual cluster, logprob) - L0_Unit_Tests_Other: catch-all for remaining (experience, utils, tools, evals, rewards, root tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Contributor
Author
|
/ok to test |
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Contributor
Author
|
/ok to test |
The qwen2 parametrizations in test_megatron_policy_training, test_megatron_policy_logprobs, and test_megatron_policy_topk_logits are redundant — the assertions are model-agnostic (no NaN/Inf, correct shapes, loss decreases) and the Qwen->Megatron converter path is thoroughly covered by functional tests (grpo_megatron.sh, dpo_megatron.sh, sft_megatron.sh all use Qwen models). Removes 14 test instances: - training: 9 → 7 (dropped 2 qwen2 variants) - logprobs: 12 → 6 (dropped 6 qwen2 variants) - topk: 12 → 6 (dropped 6 qwen2 variants) Estimated savings: ~5-10 minutes on the Mcore_Policy shard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>
…re combos The training_setup fixture tested 5 model architectures (llama, qwen2, qwen3, gemma3, nemotron5_h) but the assertions are model-agnostic (no NaN/Inf, loss decreases, flops tracking). Model compatibility is covered by functional tests (grpo.sh, grpo_fsdp2.sh, dpo.sh, sft.sh use Qwen and Gemma models). Consolidate to llama-only while preserving all feature combinations (sp, cpu_offload, activation_checkpointing, cp, and their combos). Reduces from 23 → 10 parametrized test instances. Logprob_setup left unchanged since it validates numerical correctness via torch.allclose per architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Guard the truncated field check with a key existence check since the expected_result dict no longer contains the truncated field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>
The truncated field was incorrectly removed from expected_result in an earlier commit. It should remain present so _standardize can validate the field contains bools before popping it from both sides. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Contributor
Author
|
/ok to test |
Refactor test_megatron_worker.py to use a class-scoped Ray cluster fixture (TestMegatronTwoGPU) for the parametrized tests, following the same pattern as test_dtensor_worker.py's TestTwoGPUCluster. Previously, each parametrized test (training×7, generation×2, logprobs×6, topk×6 = 21 tests) created and destroyed its own RayVirtualCluster. Now they share a single class-scoped cluster, saving ~20 cluster creation/teardown cycles. Each test still creates and destroys its own Policy for isolation. Standalone tests (checkpoint, loss_independent, grad_norm, etc.) remain outside the class since they need custom cluster configs. Estimated savings: ~5-10 minutes from avoided cluster overhead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>
…ests" This reverts commit 1ffeb76. Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Contributor
Author
|
/ok to test |
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Contributor
Author
|
/ok to test |
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Contributor
Author
|
/ok to test |
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Contributor
Author
|
/ok to test |
Contributor
Author
|
/ok to test |
kajalj22
reviewed
May 21, 2026
kajalj22
reviewed
May 21, 2026
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Contributor
Author
|
/ok to test |
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Contributor
Author
|
/ok to test de4c275 |
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Contributor
Author
|
/ok to test a784694 |
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Contributor
Author
|
/ok to test 2ffc398 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
mcore,automodel,vllm,sglang, andnemo_gymmarkers across the unit suite.pytest-shardso CI can run them in parallel.tests/run_unit.shtreat pytest exit code 5 (no tests collected) as success for shard/FAST safety.Test approval queue
Approve Test Queue, a scheduled/manual workflow that uses the shared FW-CI test approval queue template forCICD NeMo RL.cicd-wait-in-queuegate in the main workflow for PR Lfast/L0/L1/L2 runs before container builds and test jobs proceed.MAX_CONCURRENCYfor internal runs andMAX_CONCURRENCY_EXTERNALfor external runs, both defaulting to3.SGLang default
SKIP_SGLANGworkflow setting.SKIP_SGLANG=falseto build SGLang and run the SGLang shards.Known follow-ups
Test plan
CI:L0or higher.CI:L1.CI:Lfastmode still applies FAST exclusions correctly.