Commit b881831
authored
chore(profiles): sync GPU profiles with pcie_topology/NUMA from upstream (RUN-40173) (#212)
## What
Re-syncs the built-in GPU profiles via `hack/sync-profiles.sh` from
upstream NVIDIA/k8s-test-infra **`main`** (synced commit `497fa04`), and
improves the sync tooling.
The previous pin was the `v0.1.0` tag, which predates upstream's
**`pcie_topology`** block (PCI root complexes + per-device `numa_node`).
Without it, the mock backend's device-plugin can't report GPU→NUMA, so
`NodeResourceTopology` zones never get GPU resources and
NUMA/topology-aware scheduling can't be exercised on mock GPUs.
`pcie_topology` currently exists **only on upstream `main`** (no tagged
release yet).
## Changes
- **`hack/sync-profiles.sh`**:
- `DEFAULT_VERSION` now tracks **`main`** (the committed `builtin.yaml`
is the pinned artifact; each sync produces a reviewable PR diff).
Override with a tag/commit once upstream cuts a release.
- Accepts a **tag, branch, OR commit SHA**: blobless+sparse clone then
`git checkout <ref>` (the old `git clone --branch` only accepted
tags/branches, though the workflow advertises "tag or commit").
- The generated header records the **exact resolved commit** for
provenance: `# Source: NVIDIA/k8s-test-infra main (commit 497fa04)`.
- **`builtin.yaml`** (regenerated): all 7 profiles now carry
`pcie_topology` (a100/b200/gb200/**gb300**/h100/l40s/t4); `gb300` is
new.
- **`.github/workflows/sync-gpu-profiles.yml`**: fix version extraction
(`head -2` → `head -3`) — the `# Source:` line is line 3, so it
previously captured nothing.
- **`CHANGELOG.md`**: entry under `[Unreleased] → Changed`.
## Verification
- `hack/sync-profiles.sh` (default → `main`) ran clean → 7 profiles,
header pinned to `497fa04`.
- `builtin.yaml` parses as valid YAML (7 ConfigMap docs); each profile
contains `pcie_topology`.
## Context
Found while testing the mock backend on a real-NUMA EKS node: the real
NFD topology-updater produces a populated NRT (CPU zone `cap=8
alloc=7`), but GPUs don't appear in the zones because the rendered mock
profile had no `pcie_topology`. This sync is the prerequisite for
mock-backend GPU↔NUMA.
RUN-40173
Signed-off-by: Eliran Wolff <eliranw@nvidia.com>1 parent 2c41ffc commit b881831
4 files changed
Lines changed: 820 additions & 85 deletions
File tree
- .github/workflows
- deploy/fake-gpu-operator/templates/profiles
- hack
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
40 | 40 | | |
41 | 41 | | |
42 | 42 | | |
43 | | - | |
| 43 | + | |
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
15 | 23 | | |
16 | 24 | | |
17 | 25 | | |
| 26 | + | |
18 | 27 | | |
19 | 28 | | |
20 | 29 | | |
| |||
0 commit comments