Skip to content

chore(profiles): sync GPU profiles with pcie_topology/NUMA from upstream (RUN-40173)#212

Merged
eliranw merged 1 commit into
mainfrom
eliranw/RUN-40173-sync-gpu-profiles-numa
Jun 4, 2026
Merged

chore(profiles): sync GPU profiles with pcie_topology/NUMA from upstream (RUN-40173)#212
eliranw merged 1 commit into
mainfrom
eliranw/RUN-40173-sync-gpu-profiles-numa

Conversation

@eliranw
Copy link
Copy Markdown
Contributor

@eliranw eliranw commented Jun 4, 2026

What

Re-syncs the built-in GPU profiles via hack/sync-profiles.sh from upstream NVIDIA/k8s-test-infra main (synced commit 497fa04), and improves the sync tooling.

The previous pin was the v0.1.0 tag, which predates upstream's pcie_topology block (PCI root complexes + per-device numa_node). Without it, the mock backend's device-plugin can't report GPU→NUMA, so NodeResourceTopology zones never get GPU resources and NUMA/topology-aware scheduling can't be exercised on mock GPUs. pcie_topology currently exists only on upstream main (no tagged release yet).

Changes

  • hack/sync-profiles.sh:
    • DEFAULT_VERSION now tracks main (the committed builtin.yaml is the pinned artifact; each sync produces a reviewable PR diff). Override with a tag/commit once upstream cuts a release.
    • Accepts a tag, branch, OR commit SHA: blobless+sparse clone then git checkout <ref> (the old git clone --branch only accepted tags/branches, though the workflow advertises "tag or commit").
    • The generated header records the exact resolved commit for provenance: # Source: NVIDIA/k8s-test-infra main (commit 497fa04).
  • builtin.yaml (regenerated): all 7 profiles now carry pcie_topology (a100/b200/gb200/gb300/h100/l40s/t4); gb300 is new.
  • .github/workflows/sync-gpu-profiles.yml: fix version extraction (head -2head -3) — the # Source: line is line 3, so it previously captured nothing.
  • CHANGELOG.md: entry under [Unreleased] → Changed.

Verification

  • hack/sync-profiles.sh (default → main) ran clean → 7 profiles, header pinned to 497fa04.
  • builtin.yaml parses as valid YAML (7 ConfigMap docs); each profile contains pcie_topology.

Context

Found while testing the mock backend on a real-NUMA EKS node: the real NFD topology-updater produces a populated NRT (CPU zone cap=8 alloc=7), but GPUs don't appear in the zones because the rendered mock profile had no pcie_topology. This sync is the prerequisite for mock-backend GPU↔NUMA.

RUN-40173

@eliranw eliranw requested a review from a team as a code owner June 4, 2026 14:16
@eliranw eliranw force-pushed the eliranw/RUN-40173-sync-gpu-profiles-numa branch from 4565e19 to b9f3198 Compare June 4, 2026 14:43
…eam (RUN-40173)

Re-sync built-in GPU profiles from upstream NVIDIA/k8s-test-infra main (commit 497fa04), which adds the pcie_topology block (PCI root complexes + per-device numa_node) plus the gb300 profile -- what the mock backend needs to surface GPU->NUMA for NodeResourceTopology / topology-aware scheduling.

hack/sync-profiles.sh now defaults to tracking main and accepts a tag, branch, OR commit SHA (blobless clone + checkout instead of clone --branch); the generated header records the exact synced commit for provenance. Also fix the sync-gpu-profiles workflow version extraction (head -2 -> head -3) so it captures the Source line.

Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
@eliranw eliranw force-pushed the eliranw/RUN-40173-sync-gpu-profiles-numa branch from b9f3198 to 7a30881 Compare June 4, 2026 14:46
@eliranw eliranw merged commit b881831 into main Jun 4, 2026
11 checks passed
@eliranw eliranw deleted the eliranw/RUN-40173-sync-gpu-profiles-numa branch June 4, 2026 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants