Skip to content

Commit b881831

Browse files
authored
chore(profiles): sync GPU profiles with pcie_topology/NUMA from upstream (RUN-40173) (#212)
## What Re-syncs the built-in GPU profiles via `hack/sync-profiles.sh` from upstream NVIDIA/k8s-test-infra **`main`** (synced commit `497fa04`), and improves the sync tooling. The previous pin was the `v0.1.0` tag, which predates upstream's **`pcie_topology`** block (PCI root complexes + per-device `numa_node`). Without it, the mock backend's device-plugin can't report GPU→NUMA, so `NodeResourceTopology` zones never get GPU resources and NUMA/topology-aware scheduling can't be exercised on mock GPUs. `pcie_topology` currently exists **only on upstream `main`** (no tagged release yet). ## Changes - **`hack/sync-profiles.sh`**: - `DEFAULT_VERSION` now tracks **`main`** (the committed `builtin.yaml` is the pinned artifact; each sync produces a reviewable PR diff). Override with a tag/commit once upstream cuts a release. - Accepts a **tag, branch, OR commit SHA**: blobless+sparse clone then `git checkout <ref>` (the old `git clone --branch` only accepted tags/branches, though the workflow advertises "tag or commit"). - The generated header records the **exact resolved commit** for provenance: `# Source: NVIDIA/k8s-test-infra main (commit 497fa04)`. - **`builtin.yaml`** (regenerated): all 7 profiles now carry `pcie_topology` (a100/b200/gb200/**gb300**/h100/l40s/t4); `gb300` is new. - **`.github/workflows/sync-gpu-profiles.yml`**: fix version extraction (`head -2` → `head -3`) — the `# Source:` line is line 3, so it previously captured nothing. - **`CHANGELOG.md`**: entry under `[Unreleased] → Changed`. ## Verification - `hack/sync-profiles.sh` (default → `main`) ran clean → 7 profiles, header pinned to `497fa04`. - `builtin.yaml` parses as valid YAML (7 ConfigMap docs); each profile contains `pcie_topology`. ## Context Found while testing the mock backend on a real-NUMA EKS node: the real NFD topology-updater produces a populated NRT (CPU zone `cap=8 alloc=7`), but GPUs don't appear in the zones because the rendered mock profile had no `pcie_topology`. This sync is the prerequisite for mock-backend GPU↔NUMA. RUN-40173 Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
1 parent 2c41ffc commit b881831

4 files changed

Lines changed: 820 additions & 85 deletions

File tree

.github/workflows/sync-gpu-profiles.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ jobs:
4040
if: steps.diff.outputs.changed == 'true'
4141
id: version
4242
run: |
43-
ver=$(head -2 deploy/fake-gpu-operator/templates/profiles/builtin.yaml | grep '# Source:' | sed 's/.*k8s-test-infra //')
43+
ver=$(head -3 deploy/fake-gpu-operator/templates/profiles/builtin.yaml | grep '# Source:' | sed 's/.*k8s-test-infra //')
4444
echo "version=$ver" >> "$GITHUB_OUTPUT"
4545
4646
- name: Create PR

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,18 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
1212

1313
### Changed
1414

15+
- Built-in GPU profiles re-synced from NVIDIA/k8s-test-infra `main` (commit
16+
`497fa04`): each profile now includes a `pcie_topology` block (PCI root
17+
complexes with per-device `numa_node`), and a `gb300` profile is added. This
18+
is what lets the mock backend report per-GPU NUMA affinity. (RUN-40173)
19+
- `hack/sync-profiles.sh`: default source bumped `v0.1.0``main`; now resolves
20+
a tag, branch, or commit SHA (was tags/branches only) and records the resolved
21+
commit in the generated `# Source:` header. (RUN-40173)
22+
1523
### Fixed
1624

1725
- `device-plugin` injects `NODE_NAME` so non-DRA pods can run the fake `nvidia-smi`. ([#191](https://github.com/run-ai/fake-gpu-operator/issues/191))
26+
- `sync-gpu-profiles` workflow read the synced version with `head -2`, but the `# Source:` line is line 3, so the PR title/commit version was always empty. (RUN-40173)
1827
- CI `e2e-upgrade (latest-main)` lane no longer deadlocks resolving its baseline
1928
chart. It now walks `--first-parent` main commits (only those publish a
2029
`0.0.0-<sha>` chart, so merges no longer fill the window with unpublished

0 commit comments

Comments
 (0)