fix(azure): harden NVMe filter and credential handling for compute-sizes#824
fix(azure): harden NVMe filter and credential handling for compute-sizes#824rishupk wants to merge 1 commit into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR filters Azure VM sizes with local storage (NVMe or temp disk) out of RHEL AI server provisioning by extending the VM capability model, introducing a new FilterNoLocalStorageSizes function, adding a noLocalStorageAttached validation predicate, wiring the filter into the Create method, and including comprehensive test coverage. ChangesLocal Storage Filtering for RHEL AI
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
f88e03f to
1fcfe80
Compare
| @@ -0,0 +1,10 @@ | |||
| { | |||
There was a problem hiding this comment.
Recall remove this file from PR, that is weird I though we have in .gitignore
There was a problem hiding this comment.
oh my mistake, removing it
1fcfe80 to
0a49cee
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/provider/azure/data/compute-request.go (1)
224-235:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRestore the nil
ResourceTypeguard inresourceSKUToVirtualMachineWith the current condition, a SKU where
res.ResourceType == nilfalls through, and the function later dereferences*res.Name/*res.Familywhile iterating the Azure SKU catalog infilterVMs(), risking nil-pointer panics.Suggested change
func resourceSKUToVirtualMachine(res *armcompute.ResourceSKU) *virtualMachine { - if res.ResourceType != nil && *res.ResourceType != string(RTVirtualMachines) { + if res == nil || res.ResourceType == nil || *res.ResourceType != string(RTVirtualMachines) { + return nil + } + if res.Name == nil || res.Family == nil { return nil } // If Machine type has any type of restriccions discard if len(res.Restrictions) > 0 { return nil🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/provider/azure/data/compute-request.go` around lines 224 - 235, resourceSKUToVirtualMachine currently allows res.ResourceType == nil to proceed and later dereferences *res.Name/*res.Family which can panic; change the initial guard in resourceSKUToVirtualMachine to return nil if res.ResourceType == nil or if *res.ResourceType != string(RTVirtualMachines). In other words, check for res.ResourceType == nil first (alongside the existing comparison), keep the existing Restrictions check, and ensure the function safely returns nil for non-VM SKUs so filterVMs() won't encounter nil-pointer dereferences when accessing Name/Family.
🧹 Nitpick comments (1)
pkg/provider/azure/data/compute-request.go (1)
83-104: ⚡ Quick winShort-circuit the SKU pager once all requested sizes are resolved.
This path still walks the entire Azure Resource SKUs catalog even after every requested size has already been found. On explicit
--compute-sizes, that adds avoidable control-plane latency and extra API reads for very small input lists.Suggested change
func FilterNoLocalStorageSizes(ctx context.Context, computeSizes []string) ([]string, error) { + requested := make(map[string]struct{}, len(computeSizes)) + for _, size := range computeSizes { + requested[size] = struct{}{} + } + creds, subscriptionID, err := getCredentials() if err != nil { return nil, err } @@ pager := client.NewListPager(nil) capabilities := make(map[string]*virtualMachine, len(computeSizes)) for pager.More() { + if len(capabilities) == len(requested) { + break + } page, err := pager.NextPage(ctx) if err != nil { return nil, err } for _, sku := range page.Value { @@ - if sku.Name == nil || !slices.Contains(computeSizes, *sku.Name) { + if sku.Name == nil { + continue + } + if _, wanted := requested[*sku.Name]; !wanted { continue } if _, seen := capabilities[*sku.Name]; seen { continue }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/provider/azure/data/compute-request.go` around lines 83 - 104, The SKU pager currently iterates the entire catalog even after finding all requested sizes; modify the loop in the block using pager, computeSizes and capabilities so it short-circuits once every requested size is resolved: after you add an entry to capabilities in the inner loop (where resourceSKUToVirtualMachine(sku) is used), check whether you've collected all requested sizes (e.g., len(capabilities) == len(computeSizes)) and then break out of the paging (either return the capabilities map immediately or break both loops / stop the pager). Ensure the check uses the same computeSizes input and the capabilities map to avoid extra API pages.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/provider/azure/data/compute-request.go`:
- Around line 215-221: The current change made
virtualMachine.baseFeaturesSupported() include noLocalStorageAttached(), which
unintentionally filters out NVMe/local-storage sizes globally when
ComputeRequest.ComputeSizes is empty; revert that by removing the
noLocalStorageAttached() check from virtualMachine.baseFeaturesSupported() and
instead apply noLocalStorageAttached() only in the code paths that should
exclude NVMe-backed SKUs (e.g., when callers explicitly want no-local-storage):
move the noLocalStorageAttached() check into the selector/filter logic used by
allocation paths—update ComputeSelector.Select() / filterCPUsAndMemory() to NOT
rely on baseFeaturesSupported() for NVMe exclusion, and add explicit calls to
FilterNoLocalStorageSizes or direct noLocalStorageAttached() checks in
allocation.go and spot.go (and any callers that pass ComputeRequest.ComputeSizes
empty) so only those flows drop NVMe-backed sizes while leaving
baseFeaturesSupported() to cover generic features.
---
Outside diff comments:
In `@pkg/provider/azure/data/compute-request.go`:
- Around line 224-235: resourceSKUToVirtualMachine currently allows
res.ResourceType == nil to proceed and later dereferences *res.Name/*res.Family
which can panic; change the initial guard in resourceSKUToVirtualMachine to
return nil if res.ResourceType == nil or if *res.ResourceType !=
string(RTVirtualMachines). In other words, check for res.ResourceType == nil
first (alongside the existing comparison), keep the existing Restrictions check,
and ensure the function safely returns nil for non-VM SKUs so filterVMs() won't
encounter nil-pointer dereferences when accessing Name/Family.
---
Nitpick comments:
In `@pkg/provider/azure/data/compute-request.go`:
- Around line 83-104: The SKU pager currently iterates the entire catalog even
after finding all requested sizes; modify the loop in the block using pager,
computeSizes and capabilities so it short-circuits once every requested size is
resolved: after you add an entry to capabilities in the inner loop (where
resourceSKUToVirtualMachine(sku) is used), check whether you've collected all
requested sizes (e.g., len(capabilities) == len(computeSizes)) and then break
out of the paging (either return the capabilities map immediately or break both
loops / stop the pager). Ensure the check uses the same computeSizes input and
the capabilities map to avoid extra API pages.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: c38ea75f-4897-4f47-859e-efb8d412f9f7
📒 Files selected for processing (3)
pkg/provider/azure/action/rhel-ai/rhelai.gopkg/provider/azure/data/compute-request.gopkg/provider/azure/data/compute-request_test.go
| func (vm *virtualMachine) noLocalStorageAttached() bool { | ||
| return vm.MaxResourceVolumeMB == 0 && vm.NvmeDiskSizeInMiB == 0 | ||
| } | ||
|
|
||
| func (vm *virtualMachine) baseFeaturesSupported() bool { | ||
| return vm.AcceleratedNetworkingEnabled && vm.PremiumIO && vm.EncryptionAtHostSupported && | ||
| vm.emptyDiskSupported() && vm.hypervGen2Supported() | ||
| vm.noLocalStorageAttached() && vm.hypervGen2Supported() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Shared selector call sites"
rg -n -C2 'type ComputeSelector struct|NewComputeSelector\(|func \(c \*ComputeSelector\) Select\(|filterCPUsAndMemory\(|baseFeaturesSupported\(' pkg --type=go
echo
echo "RHEL AI-specific wiring"
rg -n -C2 'FilterNoLocalStorageSizes\(|func Create\(.*RHELAIArgs' pkg/provider/azure --type=goRepository: redhat-developer/mapt
Length of output: 6358
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show getAzureVMSKUs and filterCPUsAndMemory logic to see how baseFeaturesSupported is applied.
sed -n '1,260p' pkg/provider/azure/data/compute-request.go | sed -n '1,260p' >/tmp/azure_compute_request_top.txt || true
sed -n '260,420p' pkg/provider/azure/data/compute-request.go >/tmp/azure_compute_request_mid.txt || true
# Allocation call-site
sed -n '1,120p' pkg/provider/azure/modules/allocation/allocation.go >/tmp/azure_allocation_top.txt || true
# Spot call-site
sed -n '1,140p' pkg/provider/azure/data/spot.go >/tmp/azure_spot_top.txt || true
# RHEL AI action call-site around where it filters sizes
sed -n '1,120p' pkg/provider/azure/action/rhel-ai/rhelai.go >/tmp/azure_rhelai_top.txt || true
# Print only the sections we care about (grep to keep output small)
echo "=== azure/data/compute-request.go (key parts) ==="
rg -n "func getAzureVMSKUs|func filterCPUsAndMemory|baseFeaturesSupported|noLocalStorageAttached|FilterNoLocalStorageSizes" pkg/provider/azure/data/compute-request.go
echo
echo "=== allocation.go call-site area ==="
rg -n "NewComputeSelector\\(\\)\\.Select|ComputeRequest" pkg/provider/azure/modules/allocation/allocation.go
echo
echo "=== spot.go call-site area ==="
rg -n "NewComputeSelector\\(\\)\\.Select|ComputeRequest" pkg/provider/azure/data/spot.go
echo
echo "=== rhel-ai.go call-site area ==="
rg -n "FilterNoLocalStorageSizes|ComputeSizes|ComputeRequest" pkg/provider/azure/action/rhel-ai/rhelai.goRepository: redhat-developer/mapt
Length of output: 1386
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== pkg/provider/azure/data/compute-request.go: getAzureVMSKUs + filterCPUsAndMemory (around relevant lines) ==="
sed -n '110,190p' pkg/provider/azure/data/compute-request.go
echo
sed -n '190,260p' pkg/provider/azure/data/compute-request.go
echo
sed -n '280,370p' pkg/provider/azure/data/compute-request.go
echo
echo "=== pkg/provider/azure/modules/allocation/allocation.go (around call) ==="
sed -n '1,120p' pkg/provider/azure/modules/allocation/allocation.go
echo
echo "=== pkg/provider/azure/data/spot.go (around call) ==="
sed -n '1,130p' pkg/provider/azure/data/spot.go
echo
echo "=== pkg/provider/azure/action/rhel-ai/rhelai.go (around FilterNoLocalStorageSizes) ==="
sed -n '1,120p' pkg/provider/azure/action/rhel-ai/rhelai.goRepository: redhat-developer/mapt
Length of output: 15965
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "All Azure call sites of NewComputeSelector().Select"
rg -n --type=go "NewComputeSelector\\(\\)\\.Select\\(" pkg/provider/azure
echo
echo "All call sites of baseFeaturesSupported()"
rg -n --type=go "baseFeaturesSupported\\(" pkg/provider/azure/data
echo
echo "All uses of filterCPUsAndMemory(...)"
rg -n --type=go "filterCPUsAndMemory\\(" pkg/provider/azure/data/compute-request.go pkg/provider/azure -SRepository: redhat-developer/mapt
Length of output: 1090
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Search Azure actions for ComputeSizes usage ==="
rg -n --type=go "ComputeRequest:\s*&?computeReq|ComputeSizes\s*:" pkg/provider/azure | head -n 200
echo
echo "=== Search for ComputeRequestArgs constructions in azure provider (top matches) ==="
rg -n --type=go "ComputeRequest:\s*.*ComputeRequestArgs|ComputeRequestArgs\s*\{" pkg/provider/azure | head -n 200Repository: redhat-developer/mapt
Length of output: 854
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Azure: direct references to ComputeSizes on ComputeRequest ==="
rg -n --type=go "ComputeRequest\\.ComputeSizes|computeReq\\.ComputeSizes|len\\(.*ComputeSizes\\)|ComputeSizes:\\s*\\[\\]string\\{" pkg/provider/azure | head -n 200
echo
echo "=== Azure: constructions of ComputeRequestArgs (struct literals) ==="
rg -n --type=go "ComputeRequestArgs\\s*\\{" pkg | head -n 200Repository: redhat-developer/mapt
Length of output: 933
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Find call sites of azure/modules/allocation.Allocation ==="
rg -n --type=go "allocation\\.Allocation\\(" pkg/provider/azure | head -n 200
echo
echo "=== Find where ComputeRequestArgs is built and passed into allocation ==="
rg -n --type=go "ComputeRequestArgs|ComputeRequest:|ComputeRequest\\s*:\\s*&?computeReq" pkg/provider/azure | head -n 200Repository: redhat-developer/mapt
Length of output: 1664
🏁 Script executed:
#!/bin/bash
set -euo pipefail
for f in \
pkg/provider/azure/action/linux/linux.go \
pkg/provider/azure/action/windows/windows.go \
pkg/provider/azure/action/rhel/rhel.go \
pkg/provider/azure/action/kind/kind.go \
pkg/provider/azure/action/aks/aks.go \
pkg/provider/azure/action/rhel-ai/rhelai.go
do
echo "===== $f (ComputeSizes/ComputeRequest usage) ====="
rg -n --type=go "ComputeSizes|ComputeRequest" "$f"
doneRepository: redhat-developer/mapt
Length of output: 1544
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== ComputeRequestArgs definition (ComputeSizes field) ==="
fd -t f "compute-request*.go" pkg/provider/api/compute-request || true
rg -n --type=go "type ComputeRequestArgs" pkg/provider/api/compute-request -S || true
# Print the file with the struct
file=$(rg -n --type=go "type ComputeRequestArgs" pkg/provider/api/compute-request -S --files-with-matches | head -n 1)
if [ -n "${file:-}" ]; then
echo "--- $file (first ~120 lines) ---"
sed -n '1,160p' "$file"
fi
echo
echo "=== Inspect azure/action/linux/linux.go around allocation + ComputeRequest ==="
sed -n '1,140p' pkg/provider/azure/action/linux/linux.go
echo
echo "=== Inspect azure/action/windows/windows.go around allocation + ComputeRequest ==="
sed -n '1,160p' pkg/provider/azure/action/windows/windows.go
echo
echo "=== Inspect azure/action/rhel/rhel.go around allocation + ComputeRequest ==="
sed -n '1,120p' pkg/provider/azure/action/rhel/rhel.go
echo
echo "=== Inspect azure/action/kind/kind.go around allocation + ComputeRequest ==="
sed -n '1,120p' pkg/provider/azure/action/kind/kind.goRepository: redhat-developer/mapt
Length of output: 17456
Scope issue: baseFeaturesSupported() now broadens NVMe/local-storage filtering to generic Azure auto-selection
pkg/provider/azure/data/compute-request.go → ComputeSelector.Select() → filterCPUsAndMemory() calls vm.baseFeaturesSupported(), which now requires vm.noLocalStorageAttached() (i.e., MaxResourceVolumeMB == 0 && NvmeDiskSizeInMiB == 0). That selector is used by pkg/provider/azure/modules/allocation/allocation.go and pkg/provider/azure/data/spot.go whenever ComputeRequest.ComputeSizes is empty, and those allocation paths are shared by Azure actions like linux/windows/rhel/kind—so NVMe-backed sizes are excluded beyond pkg/provider/azure/action/rhel-ai/rhelai.go (which only filters provided ComputeSizes via FilterNoLocalStorageSizes when they’re non-empty). [Also affects ~307-323]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/provider/azure/data/compute-request.go` around lines 215 - 221, The
current change made virtualMachine.baseFeaturesSupported() include
noLocalStorageAttached(), which unintentionally filters out NVMe/local-storage
sizes globally when ComputeRequest.ComputeSizes is empty; revert that by
removing the noLocalStorageAttached() check from
virtualMachine.baseFeaturesSupported() and instead apply
noLocalStorageAttached() only in the code paths that should exclude NVMe-backed
SKUs (e.g., when callers explicitly want no-local-storage): move the
noLocalStorageAttached() check into the selector/filter logic used by allocation
paths—update ComputeSelector.Select() / filterCPUsAndMemory() to NOT rely on
baseFeaturesSupported() for NVMe exclusion, and add explicit calls to
FilterNoLocalStorageSizes or direct noLocalStorageAttached() checks in
allocation.go and spot.go (and any callers that pass ComputeRequest.ComputeSizes
empty) so only those flows drop NVMe-backed sizes while leaving
baseFeaturesSupported() to cover generic features.
There was a problem hiding this comment.
emptyDiskSupported() has been inside baseFeaturesSupported() since the file was first created (commit c6b2de1) — filtering out local-storage VMs from auto-selection was always a global constraint, not RHEL AI-specific. The only change here is closing the NVMe loophole: L-series VMs report MaxResourceVolumeMB = 0 but expose storage under NvmeDiskSizeInMiB, so the old check missed them. No Azure action in the repo references L-series VMs, and mapt's auto-selection is designed around managed disks, so broadening the filter to catch NVMe local storage is consistent with the original intent.
There was a problem hiding this comment.
@rishupk, that's a clear and well-reasoned explanation — thank you for the context. You're right: the local-storage exclusion was always a global constraint on baseFeaturesSupported(), and this PR simply closes the NVMe loophole for L-series VMs that previously slipped through by reporting MaxResourceVolumeMB = 0. My concern doesn't apply here.
📝 Learnings were identified and not saved (knowledge base disabled). Enable
- Relax FilterNoLocalStorageSizes to only reject NVMe-only local storage (L-series). VMs with temp disks (MaxResourceVolumeMB > 0) like Standard_NC64as_T4_v3 are now allowed — temp disks are ephemeral scratch space that does not interfere with RHEL AI's OS disk. - Add ensureAzureEnvs() to map ARM_* env vars to AZURE_* before any Azure SDK call. The Azure SDK reads AZURE_TENANT_ID etc from the environment, but callers (Claudio, Pulumi) typically set ARM_* vars. Previously this mapping only happened in provider.Init(), which runs after FilterNoLocalStorageSizes — causing credential failures when --compute-sizes is used. - Centralize AZURE_SUBSCRIPTION_ID reads through SubscriptionID() helper that falls back to ARM_SUBSCRIPTION_ID. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Rishabh Kothari <rkothari@redhat.com>
0a49cee to
5a1f868
Compare
|
Superseded by #826 (combined PR per reviewer request). |
Azure L-series VMs report
MaxResourceVolumeMB = 0but track NVMe storage under a separate capability,NvmeDiskSizeInMiB, which the local storage filter never read.Standard_L8aos_v4was passing the filter and landing in RHEL AI spot selection. AddedNvmeDiskSizeInMiBparsing andFilterNoLocalStorageSizes()for the explicit--compute-sizespath.E2E testing with
--compute-sizesexposed two additional issues fixed here:NVMe filter too aggressive:
FilterNoLocalStorageSizesrejected any VM withMaxResourceVolumeMB > 0(temp disk), includingStandard_NC64as_T4_v3which works fine for RHEL AI. Relaxed to only reject NVMe-only local storage (L-series). Temp disks are ephemeral scratch space and don't interfere with RHEL AI's OS disk.Credential env var mapping: The Azure SDK reads
AZURE_TENANT_ID/AZURE_SUBSCRIPTION_ID/ etc from the environment, but callers typically setARM_*vars. TheARM_*→AZURE_*mapping only happened inprovider.Init(), which runs afterFilterNoLocalStorageSizes— causing credential failures on the--compute-sizespath. AddedensureAzureEnvs()before allNewDefaultAzureCredentialcalls, and centralized subscription ID reads through aSubscriptionID()helper.Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
Signed-off-by: Rishabh Kothari rkothari@redhat.com