fix(azure/rhel-ai): harden RHEL AI provisioning on Azure#826
Conversation
|
Warning Review limit reached
More reviews will be available in 32 minutes and 46 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughThis PR extends the Azure provider with GPU-capable RHEL AI instance selection and disk controller type support. It adds environment credential handling utilities, updates compute request filtering with GPU and NVMe awareness, validates GPU-capable sizes in the RHEL AI action, integrates disk controller type resolution in allocation, and configures disk controllers during VM creation. ChangesGPU and Disk Controller Type Support
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/provider/azure/data/compute-request.go`:
- Line 39: The function FilterComputeSizesByDiskControllerType dereferences the
pointer parameter location (used in strings.EqualFold(*loc, *location)) without
checking for nil; add a nil-check at the start of
FilterComputeSizesByDiskControllerType that returns a clear error if location ==
nil (or handle nil explicitly per caller expectations), so the subsequent loop
comparing *loc and *location is safe; update any error message to reference the
function name and the location parameter for clarity.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 6e77b6cc-cb5b-4ca8-af65-5e3214d423fc
📒 Files selected for processing (10)
pkg/provider/azure/action/rhel-ai/rhelai.gopkg/provider/azure/action/rhel-ai/rhelai_test.gopkg/provider/azure/data/compute-request.gopkg/provider/azure/data/compute-request_test.gopkg/provider/azure/data/imageref.gopkg/provider/azure/data/images.gopkg/provider/azure/data/util.gopkg/provider/azure/data/vmsize.gopkg/provider/azure/modules/allocation/allocation.gopkg/provider/azure/modules/virtual-machine/virtual-machine.go
- Add DiskControllerType field to ImageReference; rhelai sets it to "SCSI" as a fallback when the gallery API is unavailable - Wire ImageReference.DiskControllerType into StorageProfile in virtual-machine.go via diskControllerTypePtr helper - Add GetSharedImageDiskControllerTypes to read disk controller types from gallery image definition features (not the version) - Add FilterComputeSizesByDiskControllerType combining location and disk controller type filtering in a single SKU enumeration; replaces the previous two-pass approach in allocation.go - resolveImageRef enriches DiskControllerType from gallery API only when the image supports exactly one type (capability != requirement) - Filter both spot and non-spot compute sizes; return a clear error if no compatible sizes remain after filtering Co-Authored-By: Claude <Sonnet 4.6> <noreply@anthropic.com> Signed-off-by: Rishabh Kothari <rkothari@redhat.com>
- Add GPU capability detection to VM SKU filter - Validate that selected compute sizes are GPU-capable (ND/NC-series) - Default GPUs=1 when unset so spot allocator targets GPU instances Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Rishabh Kothari <rkothari@redhat.com>
25a9684 to
244974c
Compare
- Relax FilterNoLocalStorageSizes to only reject NVMe-only local storage (L-series). VMs with temp disks (MaxResourceVolumeMB > 0) like Standard_NC64as_T4_v3 are now allowed — temp disks are ephemeral scratch space that does not interfere with RHEL AI's OS disk. - Add ensureAzureEnvs() to map ARM_* env vars to AZURE_* before any Azure SDK call. The Azure SDK reads AZURE_TENANT_ID etc from the environment, but callers (Claudio, Pulumi) typically set ARM_* vars. Previously this mapping only happened in provider.Init(), which runs after FilterNoLocalStorageSizes — causing credential failures when --compute-sizes is used. - Centralize AZURE_SUBSCRIPTION_ID reads through SubscriptionID() helper that falls back to ARM_SUBSCRIPTION_ID. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Rishabh Kothari <rkothari@redhat.com>
244974c to
92e39f9
Compare
Summary
Combines three related Azure RHEL AI fixes into a single PR (per reviewer request). These were discovered and validated during end-to-end testing with Claudio.
Commit 1 — Disk controller type (was #823): RHEL AI gallery images require SCSI but mapt never set it explicitly, letting Azure infer a conflicting default on some VM sizes. Now sets
DiskControllerType: "SCSI"inImageReferenceand validates against the gallery image's features.Commit 2 — GPU guardrails (was #825): Spot allocation could pick non-GPU VMs for RHEL AI, causing vllm to fail silently. Adds
isGPUCapableSize()validation (ND/NC-series only) and defaultsGPUs=1when unset.Commit 3 — NVMe filter + credential handling (was #824): L-series VMs bypassed the local storage filter via NVMe capability. Also fixes credential resolution (
ARM_*→AZURE_*mapping) for code paths that run beforeprovider.Init().Supersedes #823, #824, #825.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com
Signed-off-by: Rishabh Kothari rkothari@redhat.com