Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
2175ecf
ci: shard unit tests from 3 to 9 parallel jobs for faster CI
chtruong814 Apr 26, 2026
f2af4ef
fix: make nemo gym rollout test truncated check non-deterministic
chtruong814 Apr 27, 2026
70acdb7
ci: split mcore and automodel shards into policy vs non-policy
chtruong814 Apr 27, 2026
94f06da
ci: break out data and distributed tests from Other shard
chtruong814 Apr 27, 2026
7cc65b2
Fix lint error in test_rollouts.py
chtruong814 Apr 27, 2026
8772561
test: remove redundant qwen2 variants from megatron policy tests
chtruong814 Apr 27, 2026
1af6936
test: consolidate dtensor training_setup to llama-only with all featu…
chtruong814 Apr 27, 2026
de4e5c7
Fix lint error in test_rollouts.py
chtruong814 Apr 27, 2026
ba666ef
fix: restore truncated field in expected_result
chtruong814 Apr 27, 2026
1ffeb76
perf: share Ray cluster across parametrized megatron policy tests
chtruong814 Apr 27, 2026
23e250f
Revert "perf: share Ray cluster across parametrized megatron policy t…
chtruong814 Apr 27, 2026
9ce6119
Merge branch 'main' into chtruong/shard-tests
chtruong814 Apr 27, 2026
53e411f
Revert "Revert "perf: share Ray cluster across parametrized megatron …
chtruong814 Apr 27, 2026
8bf4f66
Fix lint error in test_megatron_worker
chtruong814 May 3, 2026
f7d8abe
Revert "perf: share Ray cluster across parametrized megatron policy t…
chtruong814 May 3, 2026
09d718d
ci: add junitxml duration reports for slow shards
chtruong814 May 3, 2026
74606d0
Revert "test: consolidate dtensor training_setup to llama-only with a…
chtruong814 May 6, 2026
53b62e4
Revert "test: remove redundant qwen2 variants from megatron policy te…
chtruong814 May 6, 2026
9c973d9
Merge remote-tracking branch 'origin/main' into chtruong/shard-tests
chtruong814 May 20, 2026
9f4b05d
Add initial functional test shards
chtruong814 May 20, 2026
44679b9
Split functional test shards into 9 groups
chtruong814 May 20, 2026
9a128b3
Revert "ci: add junitxml duration reports for slow shards"
chtruong814 May 6, 2026
e027b67
Use pytest-shard
chtruong814 May 6, 2026
b9a302a
Fix test run
chtruong814 May 6, 2026
e8c1542
Merge remote-tracking branch 'origin/main' into chtruong/shard-tests
chtruong814 May 20, 2026
808ac89
Run both H100 and GB200 tests
chtruong814 May 20, 2026
4fa725e
Fix uv cache
chtruong814 May 20, 2026
ff5e382
Check for uv cache
chtruong814 May 20, 2026
09c967f
Fix sglang kernel version labeling
chtruong814 May 20, 2026
e068591
Remove unit test for functional tests
chtruong814 May 20, 2026
a4c7bb8
Force uv-cache to run on this branch
chtruong814 May 20, 2026
eeb7dc0
Skipping fp8 tests until fixed
chtruong814 May 20, 2026
aca45d2
Revert "Fix sglang kernel version labeling"
chtruong814 May 20, 2026
a7b48dc
Fix build
chtruong814 May 20, 2026
b23faf0
Fix build
chtruong814 May 20, 2026
23a5237
Merge branch 'main' into chtruong/shard-tests
chtruong814 May 20, 2026
d5c2f9e
Skip test for now
chtruong814 May 21, 2026
89fc36d
Force uv cache
chtruong814 May 21, 2026
d89b954
ci: Skip sglang build by default
chtruong814 May 21, 2026
de0de4e
Do not prune containers
chtruong814 May 21, 2026
a9ff3f6
ci: shard model and GRPO test suites
chtruong814 May 21, 2026
35fcd83
test: skip H100 vllm non-colocated timeout case
chtruong814 May 21, 2026
d584004
Fix lint
chtruong814 May 21, 2026
b5490aa
Fix shard id for mcore policy
chtruong814 May 21, 2026
7b8a0d6
ci: expand unit test sharding
chtruong814 May 21, 2026
812183d
ci: shard megatron functional tests
chtruong814 May 21, 2026
6cfa9dd
ci: shard other functional tests
chtruong814 May 21, 2026
1075997
ci: use registry build cache for containers
chtruong814 May 22, 2026
a42c913
ci: remove stale cache gate checks
chtruong814 May 22, 2026
f7ce324
ci: limit functional test parallelism
chtruong814 May 22, 2026
c483470
ci: add test approval queue
chtruong814 May 22, 2026
4301aee
ci: use repository variables for CI resources
chtruong814 May 22, 2026
e0e7982
Merge remote-tracking branch 'origin/main' into chtruong/shard-tests
chtruong814 May 22, 2026
de4c275
ci: disable buildkit pull cache config
chtruong814 May 22, 2026
a784694
ci: add shared container build workflow
chtruong814 May 22, 2026
2ffc398
test: package duplicate unit test modules
chtruong814 May 22, 2026
89b1205
test: extend vllm generation timeouts
chtruong814 May 22, 2026
77d646c
test: limit vllm fp8 skip to gb200
chtruong814 May 22, 2026
9c7a596
Increase vllm test timeouts
chtruong814 May 22, 2026
e731544
ci: include recent pr build caches
chtruong814 May 22, 2026
3a83519
test: skip vllm fp8 on h100
chtruong814 May 22, 2026
74e6e8b
Merge remote-tracking branch 'origin/main' into chtruong/shard-tests
chtruong814 May 23, 2026
0863f96
ci: check functional scripts in workflow
chtruong814 May 23, 2026
766d6f3
test: make dtensor flops check deterministic
chtruong814 May 23, 2026
5e8899a
Merge remote-tracking branch 'origin' into chtruong/shard-tests
chtruong814 May 26, 2026
0ea3ed4
test: collect coverage for other functional tests
chtruong814 May 26, 2026
4346d57
Merge branch 'main' into chtruong/shard-tests
chtruong814 May 27, 2026
d1867c9
Merge remote-tracking branch 'origin/main' into chtruong/shard-tests
chtruong814 May 28, 2026
f1b5e86
ci: address test shard review feedback
chtruong814 May 29, 2026
1bb8976
Merge remote-tracking branch 'origin/chtruong/shard-tests' into chtru…
chtruong814 May 29, 2026
f711711
test: rename fp8 vllm skip helper
chtruong814 May 29, 2026
8ed1f2a
Merge branch 'main' into chtruong/shard-tests
chtruong814 May 29, 2026
6ee68c9
Merge remote-tracking branch 'origin' into chtruong/shard-tests
chtruong814 May 29, 2026
25fb08a
Merge remote-tracking branch 'origin/chtruong/shard-tests' into chtru…
chtruong814 May 29, 2026
47c71b7
Merge remote-tracking branch 'origin/main' into chtruong/shard-tests
chtruong814 May 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 2 additions & 29 deletions .github/actions/test-template/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,19 +41,6 @@ inputs:
description: "Run tests on CPU only"
required: false
default: "false"
azure-client-id:
description: "Azure Client ID"
required: true
azure-tenant-id:
description: "Azure Tenant ID"
required: true
azure-subscription-id:
description: "Azure Subscription ID"
required: true
has-azure-credentials:
description: "Has Azure credentials"
required: false
default: "false"
is_fork_pr:
description: "Whether this is a pull request from a fork"
required: false
Expand All @@ -77,31 +64,16 @@ inputs:
runs:
using: "composite"
steps:
- name: Install Azure CLI
if: ${{ inputs.has-azure-credentials == 'true' }}
shell: bash
run: |
for i in 1 2 3; do
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash && break
echo "Attempt $i failed, retrying in 10s..."
sleep 10
done

- name: Install uuidgen
shell: bash -x -e -u -o pipefail {0}
if: ${{ contains(inputs.runner, 'gcp') }}
if: ${{ contains(inputs.runner, 'aws') || contains(inputs.runner, 'gcp') }}
run: |
for i in 1 2 3; do
apt-get update && apt-get install -y uuid-runtime && break
echo "Attempt $i failed, retrying in 10s..."
sleep 10
done

- name: Docker system cleanup
shell: bash
run: |
docker system prune -af --filter "until=48h" --force || true

- name: Docker pull image
shell: bash
run: |
Expand Down Expand Up @@ -138,6 +110,7 @@ runs:
docker run --rm -u root --runtime=nvidia --gpus all \
--shm-size=64g \
--env TRANSFORMERS_OFFLINE=0 \
--env GHA_RUNNER=${{ inputs.runner }} \
--env HYDRA_FULL_ERROR=1 \
--env HF_HOME=/home/TestData/nemo-rl/hf_home \
--env HF_DATASETS_CACHE=/home/TestData/nemo-rl/hf_datasets_cache \
Expand Down
172 changes: 172 additions & 0 deletions .github/workflows/_build_container.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
name: Build container

on:
workflow_call:
inputs:
build-ref:
required: false
default: ${{ github.sha }}
description: Ref, branch, or SHA to build.
type: string
image-name:
required: true
description: Name of the image to build and push.
type: string
build-args:
required: false
default: ""
description: Additional Docker build args.
type: string
build-contexts:
required: false
default: ""
description: Additional Docker build contexts.
type: string
dockerfile:
required: true
description: Path to the Dockerfile.
type: string
platform:
required: true
description: Docker build platform.
type: string
runner:
required: true
description: Runner to use for the build.
type: string
registry:
required: true
description: Container registry to push to.
type: string
target:
required: false
default: ""
description: Dockerfile stage to build.
type: string

permissions:
contents: read
pull-requests: read

defaults:
run:
shell: bash -x -e -u -o pipefail {0}

jobs:
build:
runs-on: ${{ inputs.runner }}
env:
REGISTRY: ${{ inputs.registry }}
IMAGE_NAME: ${{ inputs.image-name }}
GH_REF: ${{ github.ref }}
RUN_ID: ${{ github.run_id }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
with:
ref: ${{ inputs.build-ref }}
submodules: recursive

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Get recently merged PR cache refs
id: recent_pr_cache_refs
uses: actions/github-script@v8
env:
REGISTRY: ${{ inputs.registry }}
IMAGE_NAME: ${{ inputs.image-name }}
with:
script: |
const [owner, repo] = process.env.GITHUB_REPOSITORY.split("/");
const result = await github.graphql(`
query($owner: String!, $repo: String!) {
repository(owner: $owner, name: $repo) {
pullRequests(states: MERGED, first: 100, orderBy: {field: UPDATED_AT, direction: DESC}) {
nodes {
number
}
}
}
}
`, { owner, repo });

const refs = result.repository.pullRequests.nodes
.map(({ number }) => `type=registry,ref=${process.env.REGISTRY}/${process.env.IMAGE_NAME}:${number}-buildcache,mode=max`)
.join("\n");

core.setOutput("cache-from", refs);
core.info(`Found ${result.repository.pullRequests.nodes.length} recently merged PR cache refs.`);

- name: Compute build metadata
id: build_meta
shell: bash
run: |
set -euo pipefail

PR_NUMBER=""
if [[ "$GH_REF" =~ refs/heads/pull-request/([0-9]+) ]]; then
PR_NUMBER="${BASH_REMATCH[1]}"
fi

TAGS=("$REGISTRY/$IMAGE_NAME:$RUN_ID")
if [[ "$GH_REF" == "refs/heads/main" ]]; then
CACHE_KEY="main"
TAGS+=("$REGISTRY/$IMAGE_NAME:main")
elif [[ -n "$PR_NUMBER" ]]; then
CACHE_KEY="$PR_NUMBER"
TAGS+=("$REGISTRY/$IMAGE_NAME:$PR_NUMBER")
else
CACHE_KEY=$(printf '%s' "${GITHUB_REF_NAME:-$RUN_ID}" | tr '/' '-' | tr -cd '[:alnum:]._-')
if [[ -z "$CACHE_KEY" ]]; then
CACHE_KEY="$RUN_ID"
fi
fi

CACHE_FROM=(
"type=registry,ref=$REGISTRY/$IMAGE_NAME:main-buildcache,mode=max"
)
if [[ "$CACHE_KEY" != "main" ]]; then
CACHE_FROM+=("type=registry,ref=$REGISTRY/$IMAGE_NAME:$CACHE_KEY-buildcache,mode=max")
fi

{
echo "tags<<EOF"
printf '%s\n' "${TAGS[@]}"
echo "EOF"
echo "cache-from<<EOF"
printf '%s\n' "${CACHE_FROM[@]}"
echo "EOF"
echo "cache-to=type=registry,ref=$REGISTRY/$IMAGE_NAME:$CACHE_KEY-buildcache,mode=max"
} >> "$GITHUB_OUTPUT"

- name: Build and push
uses: docker/build-push-action@v5
with:
file: ${{ inputs.dockerfile }}
push: true
context: .
platforms: ${{ inputs.platform }}
build-contexts: ${{ inputs.build-contexts }}
build-args: ${{ inputs.build-args }}
cache-from: |
${{ steps.build_meta.outputs.cache-from }}
${{ steps.recent_pr_cache_refs.outputs.cache-from }}
cache-to: ${{ steps.build_meta.outputs.cache-to }}
no-cache: false
tags: |
${{ steps.build_meta.outputs.tags }}
target: ${{ inputs.target }}
34 changes: 34 additions & 0 deletions .github/workflows/cicd-approve-test-queue.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: Approve Test Queue

on:
schedule:
- cron: "*/5 * * * *"
workflow_dispatch:

jobs:
approve-test-queue:
if: github.repository == 'NVIDIA-NeMo/RL'
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_test_approval_queue.yml@v1.3.0
with:
workflow_name: CICD NeMo RL
max_concurrency_internal: ${{ fromJSON(vars.MAX_CONCURRENCY || '3') }}
max_concurrency_external: ${{ fromJSON(vars.MAX_CONCURRENCY_EXTERNAL || '3') }}
secrets:
PAT: ${{ secrets.PAT }}
NVIDIA_MANAGEMENT_ORG_PAT: ${{ secrets.NVIDIA_MANAGEMENT_ORG_PAT }}
SLACK_CI_CHANNEL_WEBHOOK: ${{ secrets.SLACK_GITHUB_CI_WEBHOOK }}
SLACK_TEAM_GROUP_ID: ${{ secrets.SLACK_TEAM_GROUP_ID }}
Loading
Loading