Skip to content

Add MIG (Multi-Instance GPU) support validation for Ubuntu 24.04#7198

Closed
surajssd wants to merge 2 commits intomasterfrom
suraj/device-plugin-mig-test
Closed

Add MIG (Multi-Instance GPU) support validation for Ubuntu 24.04#7198
surajssd wants to merge 2 commits intomasterfrom
suraj/device-plugin-mig-test

Conversation

@surajssd
Copy link
Copy Markdown
Member

What type of PR is this?

/kind test

What this PR does / why we need it:

This commit adds comprehensive testing for NVIDIA MIG functionality on Ubuntu 24.04 nodes with managed GPU experience:

  • Added Test_Ubuntu2404_NvidiaDevicePluginRunning_MIG test case using Standard_NC24ads_A100_v4 VMs with MIG2g profile
  • Implemented MIG-specific validation functions:
    • ValidateMIGModeEnabled: Verifies MIG mode is active via nvidia-smi
    • ValidateMIGInstancesCreated: Confirms MIG instances are created with expected profile
    • ValidateNodeAdvertisesMIGResources: Ensures Kubernetes advertises MIG resources instead of standard GPU resources
    • ValidateMIGWorkloadSchedulable: Tests that workloads can successfully schedule and run on MIG instances
  • Validates that nvidia.com/mig-2g.20gb resources are advertised instead of nvidia.com/gpu when MIG is enabled
  • Includes DCGM exporter validation for MIG-enabled nodes

- Add gpuCountExpected parameter to ValidateNodeAdvertisesGPUResources() to
  validate exact GPU count instead of just checking > 0
- Add gpuCount parameter to ValidateGPUWorkloadSchedulable() to make GPU
  resource request configurable
- Update all test callers to pass expected GPU count of 1
- Improve logging to show actual vs expected GPU counts for better debugging

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- Add Test_Ubuntu2404_NvidiaDevicePluginRunning_MIG to validate MIG
  functionality
- Configure test with Standard_NC24ads_A100_v4 VM size and MIG2g instance
  profile
- Add ValidateMIGModeEnabled validator to check MIG mode is enabled via
  nvidia-smi
- Add ValidateMIGInstancesCreated validator to verify MIG instances are properly
  created
- Test validates device plugin, DCGM exporter, and GPU resource scheduling with
  MIG

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
@surajssd
Copy link
Copy Markdown
Member Author

Dependent on #7201

@surajssd
Copy link
Copy Markdown
Member Author

Closing this in favor of #7201

@surajssd surajssd closed this Oct 16, 2025
@surajssd surajssd deleted the suraj/device-plugin-mig-test branch October 16, 2025 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant