Skip to content

feat: adding support for amdama (supernova) gpus release 1.5#8643

Closed
mipresmsft wants to merge 10 commits into
mainfrom
mipres-updatema35d15
Closed

feat: adding support for amdama (supernova) gpus release 1.5#8643
mipresmsft wants to merge 10 commits into
mainfrom
mipres-updatema35d15

Conversation

@mipresmsft

@mipresmsft mipresmsft commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Updates node provisioning + e2e validation support for AMD AMA (Supernova / MA35D) GPUs release 1.5 in the Linux CSE flow, primarily targeting Azure Linux v3. New naming scheme should also be forward compatible with future driver/SW releases from AMD.

Changes:

Update AMD AMA install logic to support new file naming convention, separated driver/FW packages.
Add ShellSpec test code.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Linux CSE + e2e coverage to support provisioning/validation for AMD AMA (Supernova / MA35D) GPUs release 1.5, primarily on Azure Linux v3, and adjusts e2e firewall rules accordingly.

Changes:

  • Updates Azure Linux MA35D provisioning logic to discover/install AMD AMA driver/firmware and derive matching “core” package coordinates.
  • Adds/updates MA35D e2e validation and simplifies the scenario’s system pool SKU usage (defaults).
  • Removes download.microsoft.com from the e2e firewall allowlist.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
parts/linux/cloud-init/artifacts/cse_config.sh Reworks AMD AMA install flow (repoquery-based driver selection, firmware install, core package URL construction).
e2e/scenario_test.go Tweaks AzureLinuxV3 MA35D scenario config (removes explicit system pool SKU override).
e2e/aks_model.go Removes download.microsoft.com from Azure Firewall application rules used by e2e.

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh Outdated
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh Outdated
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh Outdated
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh Outdated
Copilot AI review requested due to automatic review settings June 17, 2026 22:03
@mipresmsft mipresmsft requested a review from runzhen as a code owner June 17, 2026 22:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Copilot AI review requested due to automatic review settings June 17, 2026 22:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
@djsly

djsly commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

AgentBaker Linux gate detective: CIS regression

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168488433
Failed job/task: build2204gen2containerd / Test, Scan, and Cleanup
Wiki signature: linux-vhd-prgate-cis-ubuntu2204-gen2-containerd-6131-logfiles

Detective summary: The build is partially succeeded because CIS scan comparison logged one regression: rule 6.1.3.1 (Ensure access to all logfiles has been configured) changed pass -> fail on Ubuntu 22.04 Gen2 containerd. The test and scan scripts exited 0 after uploading cis-report-l1/l2 and HTML reports, so this is a VHD/CIS compliance signal rather than an E2E failure.

Likely cause: Shared Ubuntu 22.04 VHD/CIS image-state drift or a generated logfile permission/access mismatch. This same CIS rule appeared earlier today on an unrelated PR in the ARM64 lane, so PR causality is currently unlikely.

Confidence: Medium-high. Evidence comes from build status, timeline/task issue, task log regression details, uploaded CIS report names, passing test-result preview, and recurrence across unrelated PRs/lanes.

Strongest alternative theory: This PR's AMD AMA CSE change somehow creates a non-compliant logfile in the Ubuntu VHD scan path. Less likely because the changed code is MA35D/Azure Linux AMD package install logic, while the failure is a generic Ubuntu 22.04 logfile-access CIS rule and recurred on an unrelated PR.

Recommended next action / owner: Linux VHD/CIS build owner should inspect the uploaded CIS reports for build 168488433 to identify the exact logfile path, owner/group, and mode, then decide whether to fix image generation permissions or update the CIS baseline for expected package drift.

Copilot AI review requested due to automatic review settings June 18, 2026 18:44

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh
Copilot AI review requested due to automatic review settings June 19, 2026 16:42
@mipresmsft mipresmsft requested a review from xuexu6666 as a code owner June 19, 2026 16:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
@mipresmsft mipresmsft closed this Jun 19, 2026
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux gate detective

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168752702
Failed job/stage/task: $(System.Collections.Hashtable.job)
First failing step/test: $(System.Collections.Hashtable.first)

RCA: The VHD image built, but post-build VHD content test failed inside the test VM while cloning AgentBaker: git fetch of refs/pull/8643/merge returned remote ref not found. This is a VHD content-test harness / PR merge-ref availability issue, not a Packer build failure. Secondary: Ubuntu 22.04 CIS 6.1.3.1 pass->fail matches repair item #38501652.

Confidence: High for the primary signature. Corroborated by timeline/status, focused failed logs, associated changes, and the flakiness wiki before publishing.

Strongest alternative: the PR changes broke MA35D validation. Less likely because the test aborts during AgentBaker checkout before content validation.

Recommended owner/action: VHD content-test/build harness owner: preserve source checkout or use a commit SHA/ref available from the test VM instead of relying on transient PR merge refs.

Wiki signature: $(System.Collections.Hashtable.sig) (source of truth)

@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux gate detective

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168523974
Failed job/stage/task: $job
RCA: Known VHD content-test harness failure: post-build content test tries to fetch a transient refs/pull/*/merge ref from inside the test VM and GitHub returns remote ref not found before content validation.

Confidence: High. Corroborated by prior focused logs, repeated unrelated PRs, and the wiki source of truth.

Strongest alternative: the PR product change caused content-test failure; less likely because the test aborts while cloning AgentBaker before validation.

Recommended owner/action: E2E/build harness owner should follow the linked wiki signature and repair item if present; rerun after the infra/harness issue is cleared.

Wiki signature: $sig (source of truth)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants