Skip to content

fix(azure/rhel-ai): harden RHEL AI provisioning on Azure#826

Open
rishupk wants to merge 3 commits into
redhat-developer:mainfrom
rishupk:fix+azure-rhelai-combined
Open

fix(azure/rhel-ai): harden RHEL AI provisioning on Azure#826
rishupk wants to merge 3 commits into
redhat-developer:mainfrom
rishupk:fix+azure-rhelai-combined

Conversation

@rishupk
Copy link
Copy Markdown
Contributor

@rishupk rishupk commented Jun 5, 2026

Summary

Combines three related Azure RHEL AI fixes into a single PR (per reviewer request). These were discovered and validated during end-to-end testing with Claudio.

Commit 1 — Disk controller type (was #823): RHEL AI gallery images require SCSI but mapt never set it explicitly, letting Azure infer a conflicting default on some VM sizes. Now sets DiskControllerType: "SCSI" in ImageReference and validates against the gallery image's features.

Commit 2 — GPU guardrails (was #825): Spot allocation could pick non-GPU VMs for RHEL AI, causing vllm to fail silently. Adds isGPUCapableSize() validation (ND/NC-series only) and defaults GPUs=1 when unset.

Commit 3 — NVMe filter + credential handling (was #824): L-series VMs bypassed the local storage filter via NVMe capability. Also fixes credential resolution (ARM_*AZURE_* mapping) for code paths that run before provider.Init().

Supersedes #823, #824, #825.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com
Signed-off-by: Rishabh Kothari rkothari@redhat.com

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

Review Change Stack

Warning

Review limit reached

@rishupk, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 32 minutes and 46 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: e9849919-8763-42f7-9088-15ee5b8e3bda

📥 Commits

Reviewing files that changed from the base of the PR and between 25a9684 and 92e39f9.

📒 Files selected for processing (10)
  • pkg/provider/azure/action/rhel-ai/rhelai.go
  • pkg/provider/azure/action/rhel-ai/rhelai_test.go
  • pkg/provider/azure/data/compute-request.go
  • pkg/provider/azure/data/compute-request_test.go
  • pkg/provider/azure/data/imageref.go
  • pkg/provider/azure/data/images.go
  • pkg/provider/azure/data/util.go
  • pkg/provider/azure/data/vmsize.go
  • pkg/provider/azure/modules/allocation/allocation.go
  • pkg/provider/azure/modules/virtual-machine/virtual-machine.go
📝 Walkthrough

Walkthrough

This PR extends the Azure provider with GPU-capable RHEL AI instance selection and disk controller type support. It adds environment credential handling utilities, updates compute request filtering with GPU and NVMe awareness, validates GPU-capable sizes in the RHEL AI action, integrates disk controller type resolution in allocation, and configures disk controllers during VM creation.

Changes

GPU and Disk Controller Type Support

Layer / File(s) Summary
Azure Environment & Subscription Handling
pkg/provider/azure/data/util.go
ensureAzureEnvs() copies ARM_* variables into unset AZURE_* variables; SubscriptionID() exported helper prefers AZURE_SUBSCRIPTION_ID with ARM_SUBSCRIPTION_ID fallback; splitDiskControllerTypes() parses comma-separated disk controller strings. getCredentials() and getGraphClientFactory() updated to call ensureAzureEnvs().
Image Reference & Disk Controller Type Schema
pkg/provider/azure/data/imageref.go
ImageReference struct extended with SharedImageID field for Private Shared images and DiskControllerType field for disk controller specification.
Image Metadata & Disk Controller Queries
pkg/provider/azure/data/images.go
New GetSharedImageDiskControllerTypes() queries shared gallery image definitions for supported disk controller types via ARM ID parsing. IsImageOffered, GetSharedImage, and SkuG2Support updated to use ensureAzureEnvs() and SubscriptionID() for subscription ID resolution. os import removed.
Compute Request Filtering & GPU-Aware SKU Selection
pkg/provider/azure/data/compute-request.go, compute-request_test.go
Replaces FilterComputeSizesByLocation with FilterComputeSizesByDiskControllerType for disk-controller matching. FilterNoLocalStorageSizes refactored to filter by NVMe disk size with warning logs for dropped and unknown sizes. Internal virtualMachine struct extended with GPU count, NVMe disk size, and disk controller types fields. getAzureVMSKUs refactored to use environment handling and client-factory pagination. filterCPUsAndMemory adds GPU-aware filtering. Comprehensive test coverage includes NVMe classification, disk controller parsing, SKU filtering, and GPU+NVMe edge cases.
VM Size Offering by Location
pkg/provider/azure/data/vmsize.go
FilterVMSizeOfferedByLocation uses ensureAzureEnvs() and SubscriptionID() for subscription ID. os import removed.
RHEL AI GPU Instance Validation & Provisioning
pkg/provider/azure/action/rhel-ai/rhelai.go, rhelai_test.go
New isGPUCapableSize() classifies ND/NC-series as GPU-capable. Enhanced Create() validates inputs, shallow-copies ComputeRequest, filters NVMe sizes, defaults GPUs to 1, enforces GPU-capable sizing, and hard-sets DiskControllerType: "SCSI". Updated error handling distinguishes GPU-quota/incompatibility failures. Test covers GPU size classification.
Allocation Module — Image Reference Resolution & Filtering
pkg/provider/azure/modules/allocation/allocation.go
New resolveImageRef() enriches shared-gallery images by querying supported disk controller types, setting DiskControllerType when exactly one type is returned and caller unspecified. Allocation() derives effective image reference, filters compute sizes by disk-controller compatibility in both spot and non-spot paths, and returns resolved image reference.
Virtual Machine Creation with Disk Controller Type
pkg/provider/azure/modules/virtual-machine/virtual-machine.go
Create() sets VM StorageProfile.ManagedDisk.DiskControllerType based on image's DiskControllerType. isSelfOwned() uses data.SubscriptionID() instead of direct environment read. New diskControllerTypePtr() helper converts empty string to nil, otherwise pulumi.StringPtr(). os import removed.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: hardening RHEL AI provisioning on Azure through multiple fixes (disk controller, GPU guardrails, NVMe/credential handling).
Description check ✅ Passed The description clearly explains all three related fixes being combined, their purpose, and which issues they supersede, all relevant to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/provider/azure/data/compute-request.go`:
- Line 39: The function FilterComputeSizesByDiskControllerType dereferences the
pointer parameter location (used in strings.EqualFold(*loc, *location)) without
checking for nil; add a nil-check at the start of
FilterComputeSizesByDiskControllerType that returns a clear error if location ==
nil (or handle nil explicitly per caller expectations), so the subsequent loop
comparing *loc and *location is safe; update any error message to reference the
function name and the location parameter for clarity.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 6e77b6cc-cb5b-4ca8-af65-5e3214d423fc

📥 Commits

Reviewing files that changed from the base of the PR and between 7c5d50d and 25a9684.

📒 Files selected for processing (10)
  • pkg/provider/azure/action/rhel-ai/rhelai.go
  • pkg/provider/azure/action/rhel-ai/rhelai_test.go
  • pkg/provider/azure/data/compute-request.go
  • pkg/provider/azure/data/compute-request_test.go
  • pkg/provider/azure/data/imageref.go
  • pkg/provider/azure/data/images.go
  • pkg/provider/azure/data/util.go
  • pkg/provider/azure/data/vmsize.go
  • pkg/provider/azure/modules/allocation/allocation.go
  • pkg/provider/azure/modules/virtual-machine/virtual-machine.go

Comment thread pkg/provider/azure/data/compute-request.go
rishupk and others added 2 commits June 5, 2026 15:20
- Add DiskControllerType field to ImageReference; rhelai sets it to "SCSI"
  as a fallback when the gallery API is unavailable
- Wire ImageReference.DiskControllerType into StorageProfile in
  virtual-machine.go via diskControllerTypePtr helper
- Add GetSharedImageDiskControllerTypes to read disk controller types from
  gallery image definition features (not the version)
- Add FilterComputeSizesByDiskControllerType combining location and disk
  controller type filtering in a single SKU enumeration; replaces the
  previous two-pass approach in allocation.go
- resolveImageRef enriches DiskControllerType from gallery API only when
  the image supports exactly one type (capability != requirement)
- Filter both spot and non-spot compute sizes; return a clear error if no
  compatible sizes remain after filtering

Co-Authored-By: Claude <Sonnet 4.6> <noreply@anthropic.com>
Signed-off-by: Rishabh Kothari <rkothari@redhat.com>
- Add GPU capability detection to VM SKU filter
- Validate that selected compute sizes are GPU-capable (ND/NC-series)
- Default GPUs=1 when unset so spot allocator targets GPU instances

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Rishabh Kothari <rkothari@redhat.com>
@rishupk rishupk force-pushed the fix+azure-rhelai-combined branch from 25a9684 to 244974c Compare June 5, 2026 14:21
- Relax FilterNoLocalStorageSizes to only reject NVMe-only local storage
  (L-series). VMs with temp disks (MaxResourceVolumeMB > 0) like
  Standard_NC64as_T4_v3 are now allowed — temp disks are ephemeral
  scratch space that does not interfere with RHEL AI's OS disk.

- Add ensureAzureEnvs() to map ARM_* env vars to AZURE_* before any
  Azure SDK call. The Azure SDK reads AZURE_TENANT_ID etc from the
  environment, but callers (Claudio, Pulumi) typically set ARM_* vars.
  Previously this mapping only happened in provider.Init(), which runs
  after FilterNoLocalStorageSizes — causing credential failures when
  --compute-sizes is used.

- Centralize AZURE_SUBSCRIPTION_ID reads through SubscriptionID()
  helper that falls back to ARM_SUBSCRIPTION_ID.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Rishabh Kothari <rkothari@redhat.com>
@rishupk rishupk force-pushed the fix+azure-rhelai-combined branch from 244974c to 92e39f9 Compare June 5, 2026 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant