ci: shard tests to run more in parallel by chtruong814 · Pull Request #2345 · NVIDIA-NeMo/RL

chtruong814 · 2026-04-26T16:25:37Z

Summary

Replaces the monolithic L0 unit-test scripts with targeted shards grouped by backend marker and test domain.
- Backend catch-all shards cover mcore, automodel, vllm, sglang, and nemo_gym markers across the unit suite.
- Base domain shards cover models, algorithms, data, distributed, environments, and other unmarked tests.
- Large policy/model/vLLM groups are split with pytest-shard so CI can run them in parallel.
Replaces the monolithic L1 GPU functional script with framework- and algorithm-focused shards for Megatron, AutoModel, SGLang, Gym, GRPO, SFT, Eval, and Other tests.
Updates the GitHub Actions matrices to run the new L0, L1, GB200 L1, and Lfast shard sets in parallel.
Adds a test approval queue so the expanded shard matrix is gated by a concurrency-managed queue instead of allowing too many CICD workflows to run at once.
Adds shared unit-shard setup and makes tests/run_unit.sh treat pytest exit code 5 (no tests collected) as success for shard/FAST safety.

Test approval queue

Adds Approve Test Queue, a scheduled/manual workflow that uses the shared FW-CI test approval queue template for CICD NeMo RL.
Adds a cicd-wait-in-queue gate in the main workflow for PR Lfast/L0/L1/L2 runs before container builds and test jobs proceed.
Concurrency is controlled with repo variables: MAX_CONCURRENCY for internal runs and MAX_CONCURRENCY_EXTERNAL for external runs, both defaulting to 3.

SGLang default

SGLang build and SGLang unit/functional test shards are skipped by default through the SKIP_SGLANG workflow setting.
Set SKIP_SGLANG=false to build SGLang and run the SGLang shards.

Known follow-ups

Some vLLM tests are temporarily skipped on H100 due to CI/runtime failures. We are creating a tracking issue for those skips and will restore coverage after the underlying issue is fixed.
Existing vLLM FP8 skips continue to reference vllm generation with fp8 fails on gb200 and h100 #2081.

Test plan

Verify the L0 unit shard matrix with CI:L0 or higher.
Verify the L1 functional shard matrix with CI:L1.
Verify CI:Lfast mode still applies FAST exclusions correctly.
Verify the test approval queue gates PR CICD runs and respects the configured concurrency limits.
Verify coverage artifacts upload and combine correctly across the new shard names.

Restructure unit test CI from 3 monolithic shards (Generation, Policy, Other) into 9 targeted shards split by extra/marker. Each extra-specific shard (mcore, automodel, vllm, sglang, nemo_gym) runs a single --*-only flag across all unit tests, while domain shards (models, environments, algorithms, other) run only base (unmarked) tests. This eliminates the 5-6 sequential pytest invocations per shard, reduces the bottleneck from 90 min (Policy) to ~30 min per shard, and makes it clear where new tests should be added. New shards: - L0_Unit_Tests_Vllm: base vllm generation + --vllm-only catch-all - L0_Unit_Tests_Sglang: base sglang files + --sglang-only catch-all - L0_Unit_Tests_Mcore: --mcore-only catch-all - L0_Unit_Tests_Automodel: --automodel-only catch-all - L0_Unit_Tests_Nemo_Gym: --nemo-gym-only catch-all - L0_Unit_Tests_Models: base model tests (minus generation) - L0_Unit_Tests_Environments: base environment tests - L0_Unit_Tests_Algorithms: base algorithm tests - L0_Unit_Tests_Other: catch-all for remaining base tests + research Also fixes run_unit.sh to treat pytest exit code 5 (no tests collected) as success, preventing shard failures when FAST exclusions remove all tests from a shard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

copy-pr-bot · 2026-04-26T16:25:40Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

chtruong814 · 2026-04-26T16:31:09Z

/ok to test

The truncated field depends on exact generation output from the tiny model, which is not reproducible across runs. Instead of comparing exact bool values, verify that each value is a bool type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

The Mcore shard (50 min) and Automodel shard (38 min) are bottlenecked by heavy policy worker tests (test_megatron_worker.py and test_dtensor_worker*.py). Split each into two shards: - L0_Unit_Tests_Mcore: mcore tests excluding unit/models/policy/ (~15 min) - L0_Unit_Tests_Mcore_Policy: mcore tests from unit/models/policy/ only (~30 min) - L0_Unit_Tests_Automodel: automodel tests excluding unit/models/policy/ (~10 min) - L0_Unit_Tests_Automodel_Policy: automodel tests from unit/models/policy/ only (~28 min) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Split L0_Unit_Tests_Other into three shards: - L0_Unit_Tests_Data: data pipeline tests (datasets, processing, message utils) - L0_Unit_Tests_Distributed: distributed infra tests (worker groups, virtual cluster, logprob) - L0_Unit_Tests_Other: catch-all for remaining (experience, utils, tools, evals, rewards, root tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-04-27T01:09:51Z

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-04-27T01:14:30Z

/ok to test

The qwen2 parametrizations in test_megatron_policy_training, test_megatron_policy_logprobs, and test_megatron_policy_topk_logits are redundant — the assertions are model-agnostic (no NaN/Inf, correct shapes, loss decreases) and the Qwen->Megatron converter path is thoroughly covered by functional tests (grpo_megatron.sh, dpo_megatron.sh, sft_megatron.sh all use Qwen models). Removes 14 test instances: - training: 9 → 7 (dropped 2 qwen2 variants) - logprobs: 12 → 6 (dropped 6 qwen2 variants) - topk: 12 → 6 (dropped 6 qwen2 variants) Estimated savings: ~5-10 minutes on the Mcore_Policy shard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

…re combos The training_setup fixture tested 5 model architectures (llama, qwen2, qwen3, gemma3, nemotron5_h) but the assertions are model-agnostic (no NaN/Inf, loss decreases, flops tracking). Model compatibility is covered by functional tests (grpo.sh, grpo_fsdp2.sh, dpo.sh, sft.sh use Qwen and Gemma models). Consolidate to llama-only while preserving all feature combinations (sp, cpu_offload, activation_checkpointing, cp, and their combos). Reduces from 23 → 10 parametrized test instances. Logprob_setup left unchanged since it validates numerical correctness via torch.allclose per architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Guard the truncated field check with a key existence check since the expected_result dict no longer contains the truncated field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

The truncated field was incorrectly removed from expected_result in an earlier commit. It should remain present so _standardize can validate the field contains bools before popping it from both sides. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-04-27T13:04:51Z

/ok to test

Refactor test_megatron_worker.py to use a class-scoped Ray cluster fixture (TestMegatronTwoGPU) for the parametrized tests, following the same pattern as test_dtensor_worker.py's TestTwoGPUCluster. Previously, each parametrized test (training×7, generation×2, logprobs×6, topk×6 = 21 tests) created and destroyed its own RayVirtualCluster. Now they share a single class-scoped cluster, saving ~20 cluster creation/teardown cycles. Each test still creates and destroys its own Policy for isolation. Standalone tests (checkpoint, loss_independent, grad_norm, etc.) remain outside the class since they need custom cluster configs. Estimated savings: ~5-10 minutes from avoided cluster overhead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

…ests" This reverts commit 1ffeb76. Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-04-27T13:16:16Z

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-05-21T14:15:21Z

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-05-21T16:20:22Z

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-05-21T20:18:06Z

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-05-21T20:26:10Z

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-05-22T01:38:25Z

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-05-22T01:51:36Z

/ok to test de4c275

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-05-22T03:17:20Z

/ok to test a784694

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-05-22T11:45:13Z

/ok to test 2ffc398

chtruong814 requested a review from a team as a code owner April 26, 2026 16:25

github-actions Bot added the CI Relating to CI label Apr 26, 2026

chtruong814 added CI:L1 Run doctests, unit tests, and functional tests CI:L0 Run doctests and unit tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Apr 26, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci April 26, 2026 16:31 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 26, 2026 16:38 Inactive

chtruong814 and others added 3 commits April 26, 2026 19:58

copy-pr-bot Bot had a problem deploying to nemo-ci April 27, 2026 01:10 Error

Fix lint error in test_rollouts.py

7cc65b2

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

copy-pr-bot Bot temporarily deployed to nemo-ci April 27, 2026 01:15 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 27, 2026 01:18 Inactive

chtruong814 and others added 4 commits April 27, 2026 07:56

Fix lint error in test_rollouts.py

de4e5c7

Guard the truncated field check with a key existence check since the expected_result dict no longer contains the truncated field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

copy-pr-bot Bot had a problem deploying to nemo-ci April 27, 2026 13:05 Error

chtruong814 and others added 2 commits April 27, 2026 08:12

Revert "perf: share Ray cluster across parametrized megatron policy t…

23e250f

…ests" This reverts commit 1ffeb76. Signed-off-by: Charlie Truong <chtruong@nvidia.com>

copy-pr-bot Bot had a problem deploying to nemo-ci April 27, 2026 13:17 Error

Merge branch 'main' into chtruong/shard-tests

9ce6119

chtruong814 added 2 commits May 21, 2026 09:14

Skip test for now

d5c2f9e

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Force uv cache

89fc36d

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: Skip sglang build by default

d89b954

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 added 3 commits May 21, 2026 14:31

Do not prune containers

de0de4e

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: shard model and GRPO test suites

a9ff3f6

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

test: skip H100 vllm non-colocated timeout case

35fcd83

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix lint

d584004

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

kajalj22 reviewed May 21, 2026

View reviewed changes

Comment thread .github/workflows/cicd-main.yml Outdated

Comment thread .github/workflows/cicd-main.yml

kajalj22 reviewed May 21, 2026

View reviewed changes

Comment thread tests/unit/L0_Unit_Tests_Mcore_Policy_3.sh

chtruong814 added 10 commits May 21, 2026 18:12

Fix shard id for mcore policy

b5490aa

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: expand unit test sharding

7b8a0d6

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: shard megatron functional tests

812183d

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: shard other functional tests

6cfa9dd

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: use registry build cache for containers

1075997

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: remove stale cache gate checks

a42c913

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: limit functional test parallelism

f7ce324

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: add test approval queue

c483470

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: use repository variables for CI resources

4301aee

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Merge remote-tracking branch 'origin/main' into chtruong/shard-tests

e0e7982

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: disable buildkit pull cache config

de4c275

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: add shared container build workflow

a784694

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

test: package duplicate unit test modules

2ffc398

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Conversation

chtruong814 commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test approval queue

SGLang default

Known follow-ups

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 26, 2026

Uh oh!

chtruong814 commented Apr 26, 2026

Uh oh!

chtruong814 commented Apr 27, 2026

Uh oh!

chtruong814 commented Apr 27, 2026

Uh oh!

chtruong814 commented Apr 27, 2026

Uh oh!

chtruong814 commented Apr 27, 2026

Uh oh!

chtruong814 commented May 21, 2026

Uh oh!

chtruong814 commented May 21, 2026

Uh oh!

chtruong814 commented May 21, 2026

Uh oh!

chtruong814 commented May 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chtruong814 commented May 22, 2026

Uh oh!

chtruong814 commented May 22, 2026

Uh oh!

chtruong814 commented May 22, 2026

Uh oh!

chtruong814 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chtruong814 commented Apr 26, 2026 •

edited

Loading