Skip to content

mig faker#205

Open
iris-shain-runai wants to merge 4 commits into
mainfrom
Iris/RUN-38130-support-mig-faker
Open

mig faker#205
iris-shain-runai wants to merge 4 commits into
mainfrom
Iris/RUN-38130-support-mig-faker

Conversation

@iris-shain-runai
Copy link
Copy Markdown

@iris-shain-runai iris-shain-runai commented Jun 1, 2026

Description

When migStrategy: mixed was configured, mig-faker faked the MIG hardware metadata (UUIDs, nvidia.com/mig.config.state: success) but no component ever advertised nvidia.com/mig-* resources to kubelet — the device-plugin was hardcoded to only register nvidia.com/gpu. As a result MIG slices never appeared in node.status.allocatable and MIG-based scheduling was impossible.

This PR wires the MIG path end-to-end:

  • Topology is the source of truth for MIG. GpuDetails gains MigEnabled + MigInstances (internal/common/topology/types.go), and a new topology.AdvertisedResources() computes the resources a node should expose for none / single / mixed strategies. It's the single place that decides "what does this node advertise," now reused by device-plugin, the fake-node plugin, and the KWOK config-map handler.
  • device-plugin advertises MIG. It builds one plugin per advertised resource (each on its own socket), and CleanupStaleSockets() removes orphaned sockets on startup so a profile change can't leave stale resources behind.
  • mig-faker publishes MIG state. It writes MigInstances into the per-node topology ConfigMap (via topology.UpdateNodeTopologyCM, server-side apply — following the repo convention) and restarts the device-plugin pod so kubelet picks up the new allocatable resources. The devices: [all] parsing crash is also fixed.
  • RBAC: mig-faker ClusterRole now allows pods: [list, delete] (for the restart) and configmaps: patch (required by server-side apply).
  • Docs: README gains a "NVIDIA MIG" section documenting the migStrategy, the node-role.kubernetes.io/runai-dynamic-mig label, and the run.ai/mig.config annotation format — all previously undocumented (called out in mig-faker does not register a kubelet device plugin — nvidia.com/mig-* resources never appear in node allocatable #177).

MIG strategies

AdvertisedResources implements all three strategies:

migStrategy A MIG-enabled GPU sliced into N pieces is advertised as
none (default) nvidia.com/gpu per physical GPU (MIG ignored)
single nvidia.com/gpu: N (slices counted as plain GPUs)
mixed nvidia.com/mig-<profile> per slice; the GPU drops out of nvidia.com/gpu

Call graph

mig-faker (triggered when the run.ai/mig.config node annotation changes). The numbered nodes are FakeMapping's steps in execution order; dotted edges are buildMigState's internal helper calls:

flowchart TD
  A["MigFakeApp.Run loop"] -->|annotation change| B["SyncableMigConfig.Get"]
  B --> C["MigFaker.FakeMapping"]
  C --> D["1. getNodeTopology -> topology.GetNodeTopologyFromCM"]
  D --> E["2. buildMigState"]
  E --> F["3. SetNodeLabels: mig.config.state=success"]
  F --> G["4. SetNodeAnnotations: mig-mapping"]
  G --> H["5. updateNodeTopology -> topology.UpdateNodeTopologyCM, SSA patch"]
  H --> I["6. restartDevicePluginPod: list + delete device-plugin pod"]
  E -.-> E1["expandGPUIndices: all / explicit indices"]
  E -.-> E2["buildGpuMigDeviceState -> migInstanceNameToGpuInstanceId"]
Loading

device-plugin (on every (re)start; the restart in step 6 above is what makes it re-read MIG state). Numbered nodes are main's steps in order; dotted edges are NewDevicePlugins's internal calls:

flowchart TD
  M["main"] --> N["1. topology.GetNodeTopologyFromCM"]
  N --> O["2. deviceplugin.CleanupStaleSockets"]
  O --> P["3. deviceplugin.NewDevicePlugins"]
  P --> S["4. plugin.Serve -> Start + Register with kubelet"]
  P -.-> Q["topology.AdvertisedResources: none/single/mixed"]
  P -.-> R["one RealNodeDevicePlugin per advertised resource"]
Loading

The two processes are decoupled through the per-node topology ConfigMap: mig-faker writes MIG state (step 5) and bounces the pod (step 6); the restarted device-plugin re-reads the CM (step 1) and AdvertisedResources now yields the MIG resources.

Related Issues

Fixes #177

Checklist

  • Self-reviewed
  • Added/updated tests (if needed)
  • Updated documentation (if needed)
  • Updated CHANGELOG.md under ## [Unreleased]

Testing

  • Unit tests: AdvertisedResources strategy matrix (none / single / mixed), device-plugin construction per MIG profile, resourceSocket, and FakeMapping (including all expansion + topology CM update + device-plugin pod restart).
  • New e2e suite test/e2e/device-plugin/ (make e2e-device-plugin, wired into CI): on a real kind worker it enables the device-plugin + mig-faker, applies a run.ai/mig.config, and asserts nvidia.com/mig-1g.5gb reaches allocatable, the topology CM is updated, a pod requesting the MIG resource runs, and reconfiguration to a different profile replaces the old resources. Ran locally — 6/6 specs pass. This suite caught the missing configmaps: patch RBAC verb.

Breaking Changes

None.

Additional Notes

  • device-plugin and draPlugin remain mutually exclusive; the MIG path is device-plugin only, so the new e2e suite is isolated (its own kind cluster) like e2e-mock.
  • Known limitation: per-MIG-slice usage is not tracked by status-updater yet (scheduling works; per-slice metrics/nvidia-smi fidelity is a follow-up).

@iris-shain-runai iris-shain-runai requested a review from a team as a code owner June 1, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mig-faker does not register a kubelet device plugin — nvidia.com/mig-* resources never appear in node allocatable

1 participant