diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 35e2c7f49..610fca766 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -13,6 +13,7 @@ /plugin/skills/appinsights-instrumentation/ @JasonYeMSFT @RickWinter /plugin/skills/azure-ai/ @JasonYeMSFT @RickWinter /plugin/skills/azure-aigateway/ @azaslonov @RickWinter +/plugin/skills/azure-arc-servers/ @deankroker @ryperl_msft @rebro2-msft @jackier @RickWinter /plugin/skills/azure-cloud-migrate/ @saikoumudi @MadhuraBharadwaj-MSFT @RickWinter /plugin/skills/azure-compliance/ @saikoumudi @RickWinter /plugin/skills/azure-compute/ @alex-thompson @rakal-dyh @joybb @rmmue21 @RickWinter diff --git a/plugin/skills/azure-arc-servers/SKILL.md b/plugin/skills/azure-arc-servers/SKILL.md new file mode 100644 index 000000000..9e0216887 --- /dev/null +++ b/plugin/skills/azure-arc-servers/SKILL.md @@ -0,0 +1,87 @@ +--- +name: azure-arc-servers +description: "Azure Arc-enabled servers router. WHEN: Arc Server, Arc-enabled server, Connected Machine agent, azcmagent, hybrid server, on-prem server, onboard to Arc, project a server into Azure, connect machine to Azure, install Connected Machine agent, generate onboarding script, at-scale onboarding, Group Policy onboarding, Configuration Manager onboarding, Ansible onboarding, Service Principal for Arc, Arc Private Link Scope, Arc Gateway, proxy for Arc, agent connectivity, agent status, Connected / Disconnected / Expired / Error, agent upgrade, automatic upgrade, manual upgrade, Extended Security Updates (ESU) for Arc, Hotpatch on Arc, Pay-as-you-go Windows Server, Software Assurance benefits, Microsoft.HybridCompute, hybridcompute/machines. PREFER OVER azure-compute when the user is talking about non-Azure (on-prem, other-cloud, edge) servers being projected into Azure - not about creating an Azure VM." +license: MIT +metadata: + author: Microsoft + version: "0.0.0-placeholder" +--- + +# Azure Arc-Enabled Servers Skill + +Routes Azure Arc server (Connected Machine) requests to the right workflow. + +## When to use this skill + +Use this skill when the user is talking about **servers that live outside +of Azure compute** (on-prem, AWS / GCP, edge, customer datacenter) and +wants to **project them into Azure** so they can be managed with Azure +control plane tools (Policy, Update Manager, Monitor, Defender, etc.). + +The defining ARM resource type is `Microsoft.HybridCompute/machines`. +The defining agent is `azcmagent` (Azure Connected Machine agent). + +> **Disambiguate with `azure-compute`.** If the user wants to create a +> brand-new VM **inside Azure** (`Microsoft.Compute/virtualMachines`), +> route to `azure-compute`. Arc servers are **never** new VMs - the +> machine already exists. + +> **Disambiguate with `azure-local` / `arc-vmware` / `arc-scvmm`.** Those +> skills handle Arc-enabled infrastructure where Azure stamps **new** VMs +> on customer hardware through a Resource Bridge / Custom Location. This +> skill is for the simpler case where a machine already exists and just +> needs an agent installed. + +## Routing + +```text +Arc server intent? +├── Connect / onboard / install agent / generate script (one machine or many) +│ → arc-server-onboard +│ +├── Agent shows Disconnected / Expired / Error +│ Can't reach Azure / azcmagent connect fails / port blocked +│ → arc-server-troubleshoot +│ +├── Upgrade the agent / enable auto-upgrade +│ Buy / activate Extended Security Updates (ESU) +│ Enable Hotpatch / Pay-as-you-go Windows Server licensing +│ → arc-server-manage +│ +└── Unclear → ask which of the three above +``` + +**Routing rule.** Read the matched workflow file *before* any reference +file. The workflow file owns the step-by-step guidance; references are +looked up on demand from inside the workflow. + +## Workflows + +| Workflow | File | Use when | +|---|---|---| +| **Onboard** | [arc-server-onboard.md](workflows/arc-server-onboard/arc-server-onboard.md) | User wants to connect one or many existing servers to Azure Arc. Generates the install script and walks the connectivity / auth / management-service choices. | +| **Troubleshoot** | [arc-server-troubleshoot.md](workflows/arc-server-troubleshoot/arc-server-troubleshoot.md) | Agent status is not `Connected`, install failed, machine fell offline, or extensions won't deploy. | +| **Manage** | [arc-server-manage.md](workflows/arc-server-manage/arc-server-manage.md) | Day-2 operations: agent upgrades, ESU licensing, Hotpatch, Pay-as-you-go Windows Server, recommended policies, management services (Update Manager, Insights, Change Tracking, Defender, Machine Config). | + +## Skill-level references + +These references describe concepts that apply across all workflows. Load +them on demand if the user asks the underlying question. + +| Reference | Use when | +|---|---| +| [arc-vs-azure-vm.md](references/arc-vs-azure-vm.md) | User is confused about Arc vs Azure VM, asks "should I use Arc or just create a VM", or talks about Arc as if it were a VM service. | +| [arc-mcp-tools.md](references/arc-mcp-tools.md) | Lookup of Azure MCP and `az` CLI commands for `Microsoft.HybridCompute`. Read before invoking tools. | + +## Out of scope for this skill + +| Request | Route to | +|---|---| +| Create / size / price an Azure VM | `azure-compute` | +| Azure Local cluster, Arc VM on HCI | `azure-local` (proposed) | +| Arc-enabled VMware vCenter | `arc-vmware` (proposed) | +| Arc-enabled SCVMM | `arc-scvmm` (proposed) | +| Arc Appliance / Resource Bridge | `arc-resource-bridge` (proposed) | +| AWS / GCP multi-cloud connector | `arc-multicloud` (proposed) | +| Deploy an application onto an Arc server | `azure-prepare` (then this skill for Arc-specific prereqs) | +| Essential Machine Management subscription enrollment | `azure-compute` → `essential-machine-management` workflow | diff --git a/plugin/skills/azure-arc-servers/references/arc-mcp-tools.md b/plugin/skills/azure-arc-servers/references/arc-mcp-tools.md new file mode 100644 index 000000000..a20647565 --- /dev/null +++ b/plugin/skills/azure-arc-servers/references/arc-mcp-tools.md @@ -0,0 +1,147 @@ +# Azure MCP and CLI Tools for Arc Servers + +Lookup table for the tools workflows in this skill use. Read this before +invoking a tool you're not sure about. + +> **MCP first, CLI fallback.** Prefer Azure MCP server tools when +> available. They return structured data and respect the user's MCP +> auth. Fall back to `az` CLI when MCP coverage is missing. + +## Read-only MCP tools (safe to call without confirmation) + +| Tool | Purpose | +|---|---| +| `mcp__azure__subscription_list` | List subscriptions for the signed-in user. Always use to confirm which subscription owns the Arc machine. | +| `mcp__azure__group_list` | List resource groups in a subscription. | +| `mcp__azure__resource_list` | List resources of any type. Filter by `resourceType = "Microsoft.HybridCompute/machines"` to enumerate Arc servers in a sub or RG. | +| `mcp__azure__resource_show` | Show full ARM properties of an Arc machine. Use to inspect `status`, `agentVersion`, `lastStatusChange`, `errorDetails`. | +| `mcp__azure__role_assignment_list` | Verify the user / Service Principal has the right roles on the target RG (Azure Connected Machine Onboarding, Azure Connected Machine Resource Administrator). | +| `mcp__azure__resourcegraph_query` | Run KQL across many subs at once. The best way to scan a fleet. | + +### Useful Resource Graph (KQL) queries + +```kql +// All Arc machines and their connection status +resources +| where type =~ "microsoft.hybridcompute/machines" +| project id, name, resourceGroup, subscriptionId, location, + status = properties.status, + agentVersion = properties.agentVersion, + osName = properties.osName, + osVersion = properties.osVersion, + lastStatusChange = properties.lastStatusChange, + errorDetails = properties.errorDetails +``` + +```kql +// Arc machines with outdated agent (older than significant version) +resources +| where type =~ "microsoft.hybridcompute/machines" +| extend agentVersion = tostring(properties.agentVersion) +| where isnotempty(agentVersion) +| where agentVersion startswith "0." or agentVersion startswith "1.0" or agentVersion startswith "1.1" +| project id, name, resourceGroup, location, agentVersion, lastStatusChange = properties.lastStatusChange +``` + +```kql +// Arc machines that are Disconnected or Expired +resources +| where type =~ "microsoft.hybridcompute/machines" +| extend status = tostring(properties.status) +| where status in ("Disconnected", "Expired", "Error") +| project id, name, resourceGroup, status, lastStatusChange = properties.lastStatusChange, + errorDetails = properties.errorDetails +``` + +```kql +// Arc extensions per machine - useful when an extension fails to install +resources +| where type =~ "microsoft.hybridcompute/machines/extensions" +| project id, name, machineId = tostring(split(id, "/extensions/")[0]), + publisher = properties.publisher, + extensionType = properties.type, + version = properties.typeHandlerVersion, + provisioningState = properties.provisioningState +``` + +## Write / mutating tools (always confirm with the user first) + +| Tool | Purpose | +|---|---| +| `mcp__azure__resource_delete` | Delete the Arc machine resource. Only removes the Azure projection; the actual machine is untouched. | +| Custom REST calls via `mcp__azure__rest_call` (if available) | For uncovered control-plane ops: `licenseProfiles`, `gateways`, agent upgrade PATCH, etc. | + +## `az` CLI fallback - Arc-specific commands + +```bash +# List Arc machines in a subscription +az connectedmachine list --output table + +# Show full machine details +az connectedmachine show \ + --name \ + --resource-group + +# List installed extensions on an Arc machine +az connectedmachine extension list \ + --machine-name \ + --resource-group + +# Install an extension (e.g. AzureMonitorWindowsAgent) +az connectedmachine extension create \ + --name AzureMonitorWindowsAgent \ + --machine-name \ + --resource-group \ + --publisher Microsoft.Azure.Monitor \ + --type AzureMonitorWindowsAgent + +# Delete an Arc machine resource (does not touch the underlying machine) +az connectedmachine delete \ + --name \ + --resource-group \ + --yes +``` + +## On-machine commands (`azcmagent`) - run as admin on the machine + +```bash +# Show agent state, connection status, agent version, errors +azcmagent show + +# Run end-to-end network connectivity checks against required endpoints +azcmagent check + +# Show recent connectivity / heartbeat events +azcmagent logs + +# Connect (onboard). Normally generated by the portal-issued script. +azcmagent connect \ + --resource-group \ + --tenant-id \ + --location \ + --subscription-id \ + [--service-principal-id --service-principal-secret ] \ + [--private-link-scope ] \ + [--gateway-id ] \ + [--proxy-url http://:] + +# Disconnect from Azure (removes the Azure resource, leaves the agent installed) +azcmagent disconnect + +# Manual upgrade trigger +azcmagent upgrade +``` + +## ARM API endpoints (when only REST will do) + +| Resource | Endpoint | +|---|---| +| Machine | `/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.HybridCompute/machines/{name}` | +| Machine extension | `/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.HybridCompute/machines/{name}/extensions/{extName}` | +| License profile (ESU / PayGo) | `/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.HybridCompute/machines/{name}/licenseProfiles/default` | +| Private Link Scope | `/subscriptions/{sub}/providers/Microsoft.HybridCompute/privateLinkScopes` | +| Gateway | `/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.HybridCompute/gateways/{name}` | + +API versions move; consult `extension.config.json` in the +HybridComputeExtension repo for the version the production portal uses +in each environment. diff --git a/plugin/skills/azure-arc-servers/references/arc-vs-azure-vm.md b/plugin/skills/azure-arc-servers/references/arc-vs-azure-vm.md new file mode 100644 index 000000000..a139af6dd --- /dev/null +++ b/plugin/skills/azure-arc-servers/references/arc-vs-azure-vm.md @@ -0,0 +1,54 @@ +# Arc-Enabled Servers vs Azure VMs + +A common source of confusion. Read this if the user is talking about +"Arc servers" as if they were Azure VMs - or vice versa. + +## The mental model + +| | Azure VM | Arc-enabled server | +|---|---|---| +| Where does the machine live? | In Azure (a Microsoft datacenter) | Anywhere except Azure: on-prem, AWS, GCP, edge, a laptop | +| Who creates the OS? | Azure - you pick an image / size and Azure provisions it | You. The OS already exists. Arc only projects it. | +| ARM resource type | `Microsoft.Compute/virtualMachines` | `Microsoft.HybridCompute/machines` | +| Resource provider | `Microsoft.Compute` | `Microsoft.HybridCompute` | +| What you install | Nothing - the VM is the resource | `azcmagent` (Azure Connected Machine agent) | +| What gets billed for the VM itself | Compute hours + storage + networking | Nothing for the machine. You pay for the management services you opt into (Update Manager logs, Defender, ESU, etc.). | +| Lifecycle action "Stop" / "Delete" | Stops / deletes the actual VM | Stops / deletes the Azure resource. The underlying machine is unaffected. | + +## Why a customer would choose Arc + +- They already own the hardware (on-prem datacenter, edge device, factory floor). +- They run workloads in AWS or GCP and want one Azure-native control plane. +- They need Azure Policy, Defender for Servers, Update Manager, Machine + Configuration, Change Tracking, or Monitor Insights on machines that + cannot or should not move into Azure. +- They need to buy **Extended Security Updates** for Windows Server 2012 / + 2012 R2 - ESU is sold via an Arc license and applied to Arc machines. +- They want **Pay-as-you-go Windows Server licensing** on customer-owned + hardware, billed through Azure. + +## Why a customer would choose an Azure VM + +- They want Azure to host the OS. +- They want VM-native features like availability sets, scale sets, + Capacity Reservation Groups, ephemeral OS disks, GPU SKUs, etc. +- They don't have hardware of their own. + +## Common confused phrasings and what they actually want + +| What the user said | What they probably mean | Route to | +|---|---|---| +| "Create an Arc VM" | If they have a physical / non-Azure machine: Arc onboarding. If they mean a VM stamped onto Azure Local / VMware via Arc: a different skill (`azure-local`, `arc-vmware`). | Clarify, then this skill or `azure-local` / `arc-vmware`. | +| "Spin up an Arc server" | Same as above - clarify. | Clarify first. | +| "I want to manage my AWS EC2 instance from Azure" | Either install the Connected Machine agent on the EC2 instance (this skill) or use the AWS connector (multi-cloud sync). | This skill for single-instance; `arc-multicloud` (proposed) for bulk sync. | +| "Connect my on-prem servers to Azure" | This skill - onboard. | [arc-server-onboard](../workflows/arc-server-onboard/arc-server-onboard.md) | +| "My Arc server is offline" | Troubleshoot agent connectivity. | [arc-server-troubleshoot](../workflows/arc-server-troubleshoot/arc-server-troubleshoot.md) | +| "Patch my Arc servers" | Azure Update Manager, scoped to Arc. | [arc-server-manage](../workflows/arc-server-manage/arc-server-manage.md) | + +## When the answer is "use both" + +If the user wants a hybrid fleet (some Azure VMs, some on-prem), use +**Essential Machine Management** to enroll the subscription so everything +gets the same baseline of Insights / Update Manager / Change Tracking / +Machine Config. EMM is handled in the upstream `azure-compute` skill's +`essential-machine-management` workflow. diff --git a/plugin/skills/azure-arc-servers/version.json b/plugin/skills/azure-arc-servers/version.json new file mode 100644 index 000000000..7a6cae20c --- /dev/null +++ b/plugin/skills/azure-arc-servers/version.json @@ -0,0 +1,6 @@ +{ + "version": "1.0", + "pathFilters": [ + "." + ] +} diff --git a/plugin/skills/azure-arc-servers/workflows/arc-server-manage/arc-server-manage.md b/plugin/skills/azure-arc-servers/workflows/arc-server-manage/arc-server-manage.md new file mode 100644 index 000000000..83b4d3c67 --- /dev/null +++ b/plugin/skills/azure-arc-servers/workflows/arc-server-manage/arc-server-manage.md @@ -0,0 +1,162 @@ +# Arc Server Day-2 Management + +Routes day-2 operations for already-onboarded Arc servers: agent +lifecycle, licensing (ESU / PayGo), Hotpatch, recommended policies, +management services. + +## When to use + +| Signal | Yes | +|---|---| +| "How do I upgrade the Arc agent?" | Yes | +| "Enable auto-upgrade for Arc agents" | Yes | +| "Buy / activate ESU for Windows Server 2012" | Yes | +| "Turn on Hotpatch for Windows Server 2025" | Yes | +| "Use Pay-as-you-go for Windows Server on Arc" | Yes | +| "Apply Software Assurance benefits on Arc" | Yes | +| "Install Update Manager / Defender / Insights on my Arc fleet" | Yes | +| "Assign the recommended Arc policies" | Yes | + +If the user is **onboarding**: route to +[arc-server-onboard](../arc-server-onboard/arc-server-onboard.md). + +If the agent is **broken**: route to +[arc-server-troubleshoot](../arc-server-troubleshoot/arc-server-troubleshoot.md). + +## Routing + +```text +Day-2 management intent? +├── Agent upgrade / auto-upgrade → references/agent-upgrade.md +├── Extended Security Updates (ESU) → references/extended-security-updates.md +├── Hotpatch (Windows Server 2025) → references/hotpatch.md +├── Pay-as-you-go Windows Server licensing → references/pay-as-you-go.md +├── Install management services → "Management services" below +├── Assign recommended policies → "Recommended policies" below +└── Connect via SSH / RDP (Hybrid Conn) → "Connect to an Arc server" below +``` + +## Management services + +Source enum: `ManagementServiceKey` in +`Client/React/Views/ArcServers/Enums.d.ts`: + +```ts +export const enum ManagementServiceKey { + changeTrackingAndInventory = "changeTrackingAndInventory", + azurePolicyAndMachineConfiguration = "azurePolicyAndMachineConfiguration", + azureMonitorInsights = "azureMonitorInsights", + updateManagement = "updateManagement", + microsoftDefenderForCloud = "microsoftDefenderForCloud", + hotpatch = "hotpatch", +} +``` + +| Service | Extension to install | What it gives you | +|---|---|---| +| **Update Management** | `AzureMonitorWindowsAgent` / `AzureMonitorLinuxAgent` + Azure Update Manager (no extension; capability ships with the agent on recent versions) | OS patching schedules, periodic assessment, pre-/post-scripts. | +| **Insights** | `AzureMonitorWindowsAgent` + Data Collection Rule | VM performance metrics + map view in Azure Monitor. | +| **Change Tracking & Inventory** | `ChangeTracking-Windows` / `ChangeTracking-Linux` | File / service / registry change history. | +| **Machine Configuration** | `ConfigurationforWindows` / `ConfigurationforLinux` (auto-installed by Policy) | Apply DSC-style configurations and audit compliance. | +| **Microsoft Defender for Cloud** | `MDE.Windows` / `MDE.Linux` (via Defender plan enablement) | EDR, vulnerability mgmt, threat detection. | +| **Hotpatch** | License profile flag + AUM schedule | See [references/hotpatch.md](references/hotpatch.md). | + +Two ways to install at scale: + +1. **Bake into the onboarding script** - the portal does this when the + user picks "Management services to install" in the wizard. +2. **Azure Policy initiative** - apply "Configure to install ..." + policies. Self-heal on new machines. + +For an existing fleet, use Policy. Source: the recommended-policies blade +(`AssignRecommendedPolicies.ReactView.tsx`) wires up: + +- `(72650e9f-...)` Configure Insights via DCR (Windows) +- `(5fe81c49-...)` Configure VM Insights for Windows +- `(4221adbc-...)` Deploy MDE for Arc Windows +- `(5752e6d6-...)` Configure DCR association +- `(1417908b-...)` Deploy MDE for Linux +- `(fc9b3da7-...)` / `(fad40cac-...)` / `(a8f3e6a6-...)` Linux equivalents +- `(630c64f9-...)` MDE for Linux + +These IDs and the per-feature variants are gated by the +`machineconfigurationatscale` feature flag. + +## Recommended policies + +The portal exposes one click: **Assign Recommended Policies** on the Arc +server Overview. It applies a curated set of policies that: + +- Install the AMA / DCR for Insights +- Install MDE for Defender for Cloud +- Enable periodic update assessment + +CLI equivalent (rough): + +```bash +az policy assignment create \ + --name "arc-recommended-windows" \ + --policy-set-definition "/providers/Microsoft.Authorization/policySetDefinitions/" \ + --scope "/subscriptions//resourceGroups/" \ + --mi-system-assigned \ + --location +``` + +Then run a remediation task to apply to existing machines: + +```bash +az policy remediation create \ + --name "arc-recommended-remediation" \ + --policy-assignment "arc-recommended-windows" \ + --resource-group +``` + +## Connect to an Arc server (SSH via Hybrid Connectivity) + +Arc supports tunneled SSH through the Hybrid Connectivity endpoint — +no public IP required on the machine. + +```bash +# Enable the Hybrid Connectivity endpoint on the machine (one-time per machine) +az connectedmachine endpoint create \ + --resource-group \ + --machine-name \ + --endpoint-name default + +# SSH to a Linux Arc server +az ssh arc \ + --resource-group \ + --name \ + --local-user +``` + +For RDP into a Windows Arc server, use the portal **Connect** blade on +the Arc machine resource — it sets up the tunnel and hands off to the +local RDP client. A verified CLI flow for RDP-via-Arc is not documented +here; do not improvise one with `run-command`, which is unrelated to +interactive remoting. + +## Choosing a management licensing model + +For Windows Server on Arc, three license models exist - they are +mutually exclusive on a given machine: + +| Model | When to use | Reference | +|---|---|---| +| **Pay-as-you-go (PayGo)** | Customer doesn't have Windows Server licenses; wants to pay monthly through Azure billing. | [references/pay-as-you-go.md](references/pay-as-you-go.md) | +| **Software Assurance benefits** | Customer has SA on their existing Windows Server licenses; wants free Arc benefits (ESU, Hotpatch in some configurations). | Set on the machine via `licenseProfiles/default`. | +| **Extended Security Updates** | Customer is running Windows Server 2012 / 2012 R2 past EOL and needs continued security patches. | [references/extended-security-updates.md](references/extended-security-updates.md) | + +These are configured on the `Microsoft.HybridCompute/machines/licenseProfiles/default` +subresource. The portal exposes them via the Overview blade and the +respective `Shared/PayGo`, `Shared/Licensing`, `Shared/Hotpatch`, +`Shared/ESU` components. + +## Routing back / handoff + +| Situation | Route to | +|---|---| +| Need to onboard more machines | [arc-server-onboard](../arc-server-onboard/arc-server-onboard.md) | +| Agent fell offline | [arc-server-troubleshoot](../arc-server-troubleshoot/arc-server-troubleshoot.md) | +| Wants to enroll subscription for Essential Machine Management | `azure-compute` skill > `essential-machine-management` workflow | +| Specific extension misbehaving | That extension's own troubleshooting (e.g. Defender, AMA) | diff --git a/plugin/skills/azure-arc-servers/workflows/arc-server-manage/references/agent-upgrade.md b/plugin/skills/azure-arc-servers/workflows/arc-server-manage/references/agent-upgrade.md new file mode 100644 index 000000000..52dd460f1 --- /dev/null +++ b/plugin/skills/azure-arc-servers/workflows/arc-server-manage/references/agent-upgrade.md @@ -0,0 +1,145 @@ +# Arc Connected Machine Agent - Upgrade + +The `azcmagent` ships ~monthly. Newer agents enable newer extensions +(Insights, Hotpatch, etc.). Two upgrade modes: **automatic** (recommended) +and **manual**. + +## Automatic upgrade (default on 1.41+) + +When the user onboards on agent 1.41 or newer, auto-upgrade is **on by +default**. The agent self-updates from the published feed on a rolling +basis (Microsoft staggers the rollout to limit blast radius). + +The portal exposes a per-machine toggle on the +**Properties** blade (file: `ArcServerProperties.ReactView.tsx`) and a +fleet-level toggle on the **Manage agent auto-upgrade** blade +(`AgentAutoUpgrade/`). Feature flag: `agentautoupgrade`. + +Toggle states live on `properties.agentConfiguration.extensionsAllowList` +and `properties.agentUpgrade.enableAutomaticUpgrade`. The exact API +shape is in `Client/React/Views/ArcServers/Hooks.ts`. + +### Source enum + +`AutoUpgradeAction` in `Client/React/Views/ArcServers/Enums.d.ts`: + +```ts +export const enum AutoUpgradeAction { + Enable = "Enable", + Disable = "Disable", +} +``` + +### CLI + +```bash +# Enable automatic agent upgrade (PATCH sets the desired state) +az resource patch \ + --resource-type "Microsoft.HybridCompute/machines" \ + --resource-group \ + --name \ + --api-version 2025-02-19-preview \ + --properties '{"agentUpgrade":{"enableAutomaticUpgrade":true}}' + +# Read the current auto-upgrade setting +az resource show \ + --resource-type "Microsoft.HybridCompute/machines" \ + --resource-group \ + --name \ + --api-version 2025-02-19-preview \ + --query "properties.agentUpgrade.enableAutomaticUpgrade" +``` + +For a fleet, prefer Azure Policy: there is a built-in initiative +"Configure Arc Agent for auto-upgrade" that scopes to a subscription or +management group. + +## Manual upgrade + +For machines that are **not** on auto-upgrade, or for an emergency +out-of-band patch: + +### From the machine + +```bash +# Windows / Linux +azcmagent upgrade +``` + +`azcmagent upgrade` downloads and installs the latest agent. Requires +outbound 443 to the agent download endpoint. + +### From the portal + +The Overview status bar surfaces an "Outdated agent" prompt when the +machine's `agentVersion` is older than the significant version line +defined in `ConnectedWindowsMachineAgentSignificantVersion` / +`ConnectedLinuxMachineAgentSignificantVersion` +(`Client/React/Views/ArcServers/Enums.d.ts`). + +The current significant versions in source (drift over time, verify +with the running portal): + +```ts +export const enum ConnectedWindowsMachineAgentSignificantVersion { + ExtensionsSupportAdded = "0.7.0.0", + FqdnSupportAdded = "0.8.0.0", + InsightsSupportAdded = "0.9.0.0", + VmUuIdSupportAdded = "0.7.20079.002", +} +``` + +These mark thresholds where capabilities switch on. Below +`InsightsSupportAdded`, the portal does not offer Insights enablement. + +### From a deployment tool (at scale) + +Wrap `azcmagent upgrade` in the same tool you used for onboarding +(Ansible, ConfigMgr, GPO computer-startup script). Run during a known +maintenance window - the agent restarts during upgrade. + +```yaml +# Ansible snippet +- name: Upgrade Arc agent + shell: azcmagent upgrade + register: upgrade_result + failed_when: upgrade_result.rc != 0 and ('already up to date' not in upgrade_result.stdout) +``` + +The `failed_when` clause treats "already up to date" as success so a +nightly run is idempotent. + +## When upgrade fails + +| Symptom | Cause | Fix | +|---|---|---| +| `azcmagent upgrade: 403` | Outbound 443 blocked or proxy needs new allowlist | Add the download endpoint to the proxy allowlist. | +| `azcmagent upgrade: signature verification failed` | Time skew | Fix NTP, retry. | +| Agent upgrades, then service won't start | Disk full or AV quarantine | Free disk, restore quarantined files, reinstall. | +| Agent upgrades but extensions break | Extension version incompatible with new agent | Update extension to `--auto-upgrade-minor-version true` model so it auto-bumps. | + +For systemic upgrade failures across a fleet, look at the upstream +deployment tool's logs, not Arc - Arc's job is just to expose the new +agent feed. + +## Special case: very old agents (< 1.0) + +Agents older than ~1.0 don't have `azcmagent upgrade` at all. For these +the only option is: + +1. Uninstall the agent (`AzureConnectedMachineAgent.msi /uninstall` + or `dpkg -r azcmagent` / `rpm -e azcmagent`). +2. Reinstall the latest from the agent download URL. +3. Run `azcmagent connect` with the same RG / sub / region - the + existing Arc resource will be reattached (no duplicate). + +For very large legacy fleets, this is best driven by a one-time +deployment-tool campaign. + +## Telemetry + +When the portal observes an outdated agent, it traces the event with +action `ShowOutdatedAgentStatusBar` (see +`ArcServerOverviewStatusBar.tsx`). That tells you which customers have +been notified - useful when triaging "I never knew my agent was +outdated". diff --git a/plugin/skills/azure-arc-servers/workflows/arc-server-manage/references/extended-security-updates.md b/plugin/skills/azure-arc-servers/workflows/arc-server-manage/references/extended-security-updates.md new file mode 100644 index 000000000..7a79c5ad9 --- /dev/null +++ b/plugin/skills/azure-arc-servers/workflows/arc-server-manage/references/extended-security-updates.md @@ -0,0 +1,134 @@ +# Extended Security Updates (ESU) for Arc + +Customers running Windows Server 2012 / 2012 R2 past end-of-life need +**Extended Security Updates** to keep getting security patches. ESU is +sold through Azure as a **license resource**, applied to Arc machines, +and consumed on the actual on-prem (or other-cloud) Windows Server. + +This is the primary monetization path Microsoft offers for keeping +EOL Windows Server compliant. It's also the most common reason a +customer onboards a server to Arc in the first place. + +> Feature flags: `enableextendedsecurityupdates`, `esushowcost`, +> `esustaticcostcontrol`, `esuyearoneinvoiceid` (see +> `featureFlags.md`). + +## Mental model + +```text +ESU License (ARM resource) + ↓ assigned to +Arc machine (Microsoft.HybridCompute/machines) + ↓ activates +Windows Update channel for ESU patches on the underlying OS +``` + +Two-step purchase + apply: + +1. **Create the ESU license resource** at the subscription or RG level. + This is the billable artifact. Type: + `Microsoft.HybridCompute/licenses`. +2. **Link the license** to the Arc machines that should consume it. This + is on the machine's + `licenseProfiles/default.properties.esuProfile.assignedLicense`. + +A single license covers a fixed number of cores. The portal shows the +estimated cost ahead of purchase when `esushowcost` is on. + +## Eligibility + +| Condition | Required | +|---|---| +| Machine is `Microsoft.HybridCompute/machines` (Arc) | Yes | +| Machine is `Connected` | Yes | +| OS is Windows Server 2012 or 2012 R2 | Yes (ESU is sold per-OS-line) | +| Subscription supports ESU | Confirmed per environment in `featureFlags.md` (`enableextendedsecurityupdates`) | +| Customer has paid for the right number of cores | Yes - over-subscribed licenses are flagged in the Overview status bar | + +The portal's `meetsEsuPrerequisites` helper (in `Shared/ESU/`) gates +the ESU UI. If the user is on an unsupported OS (e.g. 2016), tell them +ESU isn't applicable - they're on a supported OS already. + +## Buying flow + +The portal walks: subscription -> RG -> license name -> edition +(Standard / Datacenter) -> core count -> year (Year 1 / Year 2 / Year 3) +-> invoice ID (Year 1 only, when `esuyearoneinvoiceid` is on). + +CLI equivalent (rough): + +```bash +az resource create \ + --resource-type "Microsoft.HybridCompute/licenses" \ + --resource-group \ + --name "esu-2012r2-prod" \ + --location \ + --api-version 2025-02-19-preview \ + --properties '{ + "licenseType": "ESU", + "licenseDetails": { + "state": "Activated", + "target": "Windows Server 2012 R2", + "edition": "Datacenter", + "type": "pCore", + "processors": 16 + } + }' +``` + +## Activation flow + +After purchase, link the license to the machines: + +```bash +az resource create \ + --resource-type "Microsoft.HybridCompute/machines/licenseProfiles" \ + --resource-group \ + --name "default" \ + --parent "machines/" \ + --api-version 2025-02-19-preview \ + --properties '{ + "esuProfile": { + "assignedLicense": "/subscriptions//resourceGroups//providers/Microsoft.HybridCompute/licenses/esu-2012r2-prod" + } + }' +``` + +After ARM accepts the assignment, the machine pulls ESU patches on its +next Windows Update cycle. + +## Deactivation / state + +License `state` can be `Activated`, `Deactivated`. Switching to +`Deactivated` stops billing but **also stops ESU patches**. The portal +surfaces a banner when a machine is associated with a deactivated +license (`esuLicenseDeactivated` flag in +`ArcServerOverviewStatusBar.tsx`). + +## Cost handling + +| Flag | Meaning | +|---|---| +| `esushowcost` | Show estimated cost in the buying wizard (Public clouds). | +| `esustaticcostcontrol` | Use a static per-environment cost (Fairfax / US Sec / US Nat) instead of the live cost API. | + +Always show the user the cost before they purchase. ESU is one of the +more expensive line items in Azure billing per core. + +## Common ESU issues + +| Symptom | Cause | Fix | +|---|---|---| +| License purchased but machine still says "not eligible" | License not yet linked to the machine | Link via `licenseProfiles/default`. | +| Machine eligible but no patches arriving | `Connected` agent but Windows Update not pulling | On the machine: `wuauclt /detectnow`; check `%ProgramData%\AzureConnectedMachineAgent\Log\himds.log` for ESU activation token failures. | +| Over-subscribed (more cores assigned than purchased) | Customer added machines without expanding license | Expand the license `processors` or remove machines. The portal flashes a warning in the status bar. | +| License resource shows `Provisioning Failed` | Most often a region / RP issue | Re-create in a region known to support `Microsoft.HybridCompute/licenses` for the user's subscription. | + +## Routing back + +| Situation | Route to | +|---|---| +| Customer wants ESU on a Windows VM in Azure (not Arc) | Different mechanism entirely - Azure VMs get ESU via Azure Update Manager configuration, no `licenses` resource. Route to `azure-compute`. | +| Customer wants Pay-as-you-go instead | [pay-as-you-go.md](pay-as-you-go.md) | +| Customer wants ESU on 2008 R2 | EOL'd for ESU in mid-2023; tell them ESU 2008 R2 sales are closed. | +| Eligibility check needed across a fleet | KQL: filter `osName` for `Windows Server 2012*` and `properties.licenseProfile.esuProfile.assignedLicense` is null. | diff --git a/plugin/skills/azure-arc-servers/workflows/arc-server-manage/references/hotpatch.md b/plugin/skills/azure-arc-servers/workflows/arc-server-manage/references/hotpatch.md new file mode 100644 index 000000000..655937d9b --- /dev/null +++ b/plugin/skills/azure-arc-servers/workflows/arc-server-manage/references/hotpatch.md @@ -0,0 +1,135 @@ +# Hotpatch on Arc-enabled Windows Server + +Hotpatch lets Windows Server install security updates **without +rebooting**, dramatically reducing patch-window pain for production +fleets. On Arc, it's available for Windows Server 2025 Datacenter: +Azure Edition (and for some 2022 SKUs on Azure Local). + +> Feature flags: `arcserverhotpatching`, `arcserverenablehotpatch`, +> `hotpatchatscale`, `hciHotpatching`, `vmwareHotpatching`, +> `scvmmHotpatching` (per-domain). See `featureFlags.md`. + +## Mental model + +Hotpatch is **not** a separate product. It's a property of the OS that +gets toggled on, plus an opt-in for the **Hotpatch update channel** in +Azure Update Manager. The machine needs to be: + +1. Arc-Connected. +2. Running a supported Hotpatch-eligible Windows Server SKU. +3. Have an ESU or Software Assurance / PayGo license attached (Hotpatch + is a paid capability). +4. Configured to use Hotpatch via the license profile / management + plane. + +Source enum: `HotpatchEnablementStatus` in +`Client/React/Views/ArcServers/Enums.d.ts`: + +```ts +export const enum HotpatchEnablementStatus { + Unknown = "Unknown", + PendingEvaluation = "PendingEvaluation", + Enabled = "Enabled", + Disabled = "Disabled", + ActionRequired = "ActionRequired", +} +``` + +## Eligibility + +The portal's `meetsHotpatchPrerequisites` helper (in `Shared/Hotpatch/`) +gates the UI. Reproduce its checks before recommending Hotpatch: + +| Check | Required | +|---|---| +| Agent status `Connected` | Yes | +| OS is Windows Server 2025 Datacenter: Azure Edition (or supported Azure Local / VMware / SCVMM variant) | Yes | +| Agent version >= the Hotpatch threshold (current; see source) | Yes | +| Machine has a paid license model (PayGo, SA benefits, ESU) | Yes | +| Subscription / region supports Hotpatch enrollment | Yes - check `getHotpatchSubscriptionStatus` | + +If any check fails, the portal returns `ActionRequired` and the status +bar guides the user to remediate. Mirror that in your explanation. + +## Enable flow + +1. **Enable on the machine** - sets a flag on + `licenseProfiles/default.properties.softwareAssurance.softwareAssuranceCustomer = true` + (or the equivalent Hotpatch flag depending on license type). +2. **Switch the update channel** - the machine's Azure Update Manager + schedule needs to use the Hotpatch maintenance configuration. +3. **Apply patches** - next AUM run installs Hotpatches; only Latest + Cumulative Update (LCU) months trigger reboot. + +### CLI sketch + +```bash +# Step 1: flip license profile flag (PATCH) +az resource patch \ + --resource-type "Microsoft.HybridCompute/machines/licenseProfiles" \ + --resource-group \ + --name "default" \ + --parent "machines/" \ + --api-version 2025-02-19-preview \ + --properties '{ "softwareAssurance": { "softwareAssuranceCustomer": true } }' + +# Step 2: associate the machine with a Hotpatch-enabled AUM maintenance config +az maintenance assignment create \ + --resource-id "/subscriptions//resourceGroups//providers/Microsoft.HybridCompute/machines/" \ + --maintenance-configuration-id "/subscriptions//resourceGroups//providers/Microsoft.Maintenance/maintenanceConfigurations/hotpatch-monthly" +``` + +The exact REST shapes drift; trust the portal's network trace if you +need to verify. + +## Status interpretation + +| `HotpatchEnablementStatus` | Tell the user | +|---|---| +| `Unknown` | Status hasn't been evaluated yet. Wait a few minutes after enabling. | +| `PendingEvaluation` | Machine is being checked. Refresh in ~5 minutes. | +| `Enabled` | Patches install without reboot. Look for "Hotpatch Edition" in `winver`. | +| `Disabled` | Hotpatch is off. Show the eligibility checklist. | +| `ActionRequired` | Eligible but blocked. Common: license not attached, agent too old, AUM schedule not on Hotpatch channel. | + +## At-scale Hotpatch enablement + +The `hotpatchatscale` feature flag enables a fleet-level enablement +blade (Manage services at-scale). For at-scale enablement without the +portal: + +1. Filter the fleet to Hotpatch-eligible machines via Resource Graph: + ```kql + resources + | where type =~ "microsoft.hybridcompute/machines" + | extend osName = tostring(properties.osName) + | where osName has "Windows Server 2025" + | project id, name, resourceGroup, subscriptionId + ``` +2. Apply the license profile PATCH and AUM assignment via the same + automation tool used for onboarding (Ansible / scripted CLI). + +## Hotpatch on Azure Local / VMware / SCVMM + +Each Arc-stamped VM domain has its own Hotpatch flag (`hciHotpatching`, +`vmwareHotpatching`, `scvmmHotpatching`). The flow is conceptually +identical (license profile + AUM channel), but those scenarios belong +in the respective skills (`azure-local`, `arc-vmware`, `arc-scvmm`) +because the surrounding eligibility checks differ. + +## Common Hotpatch issues + +| Symptom | Cause | Fix | +|---|---|---| +| Status stays `PendingEvaluation` for > 1 hour | Agent isn't reporting back evaluation result | `azcmagent show` for status; check Defender / EDR isn't blocking himds. | +| `Enabled` but reboot still required | Month's update is an LCU month, not a Hotpatch month | Expected - LCU months reboot. | +| Hotpatch enabled but billing went up unexpectedly | Hotpatch implies a paid license; SA / PayGo / ESU cost increased | Confirm the customer agreed to the license model. | + +## Routing back + +| Situation | Route to | +|---|---| +| Customer wants Hotpatch on Azure Local | (proposed) `azure-local` skill | +| Customer wants Hotpatch on VMware / SCVMM VMs | (proposed) `arc-vmware` / `arc-scvmm` | +| Customer doesn't have any of PayGo / SA / ESU | They need to attach one - see [pay-as-you-go.md](pay-as-you-go.md) or [extended-security-updates.md](extended-security-updates.md) first. | +| Hotpatch enabled but a specific patch failed | That's an AUM / Windows Update troubleshooting issue, not Arc; route to the AUM docs. | diff --git a/plugin/skills/azure-arc-servers/workflows/arc-server-manage/references/pay-as-you-go.md b/plugin/skills/azure-arc-servers/workflows/arc-server-manage/references/pay-as-you-go.md new file mode 100644 index 000000000..d60e12ff8 --- /dev/null +++ b/plugin/skills/azure-arc-servers/workflows/arc-server-manage/references/pay-as-you-go.md @@ -0,0 +1,155 @@ +# Pay-as-you-go Windows Server on Arc + +Lets customers pay for Windows Server licensing **through their Azure +bill**, even when the Windows Server is running on customer-owned +hardware. Eliminates the need to buy retail Windows Server licenses up +front. + +> Feature flags: `arcserverpaygo` (Arc Servers), `hcipaygo` (Azure +> Local), `vmwarepaygo` (VMware), `scvmmpaygo` (SCVMM). All on in +> Public clouds (MPAC / PROD / PREV / RC) per `featureFlags.md`. + +## Mental model + +Pay-as-you-go (PayGo) is **alternative to Software Assurance** for +licensing Windows Server on Arc-managed machines. The customer doesn't +buy retail Windows Server licenses; they activate PayGo on the Arc +machine, and Azure bills them monthly per core. + +The activation is recorded on +`Microsoft.HybridCompute/machines/licenseProfiles/default.properties.productProfile` +(Windows Server PayGo) or +`...softwareAssurance` (SA benefits flag, equivalent at the protocol +level for some capabilities). + +```text +On-prem Windows Server (no retail license) + + +Arc agent (Connected) + + +PayGo activated on licenseProfile + = +Azure billing handles WS licensing per month +``` + +## Eligibility + +The portal's `arePayGoPrerequisitesMet` helper (`Shared/PayGo/`) gates +the UI. Reproduce its checks: + +| Check | Required | +|---|---| +| Agent status `Connected` | Yes | +| OS is Windows Server (2019, 2022, 2025) | Yes | +| Machine is **not** an Azure VM (Azure VMs use Hybrid Benefit, not PayGo) | Yes | +| Subscription supports PayGo (region availability) | Yes | +| User has the right role to PATCH license profile | Yes (`Azure Connected Machine Resource Administrator` or higher) | + +Linux is not in scope - PayGo is a Windows Server-only feature. + +## Enable flow + +```bash +# PATCH the license profile to set Windows Server PayGo +az resource patch \ + --resource-type "Microsoft.HybridCompute/machines/licenseProfiles" \ + --resource-group \ + --name "default" \ + --parent "machines/" \ + --api-version 2025-02-19-preview \ + --properties '{ + "productProfile": { + "subscriptionStatus": "Enabled", + "productType": "WindowsServer", + "productFeatures": [ + { "name": "WindowsServerLicense", "subscriptionStatus": "Enabled" } + ] + } + }' +``` + +The portal flips the same property; source patterns in +`ArcServerOverviewStatusBar.tsx` and `Shared/PayGo/`. + +## Status + +`SubscriptionStatus` enum (`Shared/Licensing/Enums`): + +| Value | Meaning | +|---|---| +| `Unknown` | Not yet evaluated. | +| `Enabling` | Activation in progress (a few minutes). | +| `Enabled` | Billing is active. | +| `Disabling` | Deactivation in progress. | +| `Disabled` | PayGo is off. Customer is using their own license / SA / nothing. | +| `Failed` | Activation failed - read the error in `licenseProfile.properties.licenseStatus`. | + +## Disable + +To stop PayGo billing (customer brought their own license, or +decommissioned the machine): + +```bash +az resource patch \ + --resource-type "Microsoft.HybridCompute/machines/licenseProfiles" \ + --resource-group \ + --name "default" \ + --parent "machines/" \ + --api-version 2025-02-19-preview \ + --properties '{ "productProfile": { "subscriptionStatus": "Disabled" } }' +``` + +Disabling stops Azure billing for Windows Server. The customer is +then responsible for having a separate valid license on the machine +(retail, SA, etc.) - or they're out of compliance. + +## At-scale enablement + +For a fleet, the portal exposes a **Manage services at-scale** flow. +Behind the scenes it iterates the PATCH above. CLI / scripted +equivalent: + +```bash +# Resource Graph: find eligible Windows Server Arc machines without PayGo +resources +| where type =~ "microsoft.hybridcompute/machines" +| extend osName = tostring(properties.osName) +| where osName has "Windows Server" +| extend licenseStatus = tostring(properties.licenseProfile.productProfile.subscriptionStatus) +| where licenseStatus != "Enabled" +| project id, name, resourceGroup, osName, licenseStatus + +# Iterate and PATCH each +``` + +## PayGo vs SA Benefits vs ESU (decision aid) + +| Scenario | Use | +|---|---| +| Windows Server with SA in your EA | **SA Benefits** (free for Arc capabilities) | +| Windows Server, no SA, want OPEX licensing through Azure | **PayGo** | +| Windows Server 2012 / 2012 R2 past EOL | **ESU** (see [extended-security-updates.md](extended-security-updates.md)) | +| Windows Server with retail license, no Azure integration desired | None - leave the license profile alone | + +These are mutually exclusive on a given machine for the Windows Server +licensing dimension. ESU is orthogonal to current-version licensing - +you can have ESU on a 2012 machine without PayGo. + +## Common PayGo issues + +| Symptom | Cause | Fix | +|---|---|---| +| Status stuck in `Enabling` for > 30 min | Agent not reporting evaluation back | `azcmagent show` on the machine; check Defender / EDR. | +| `Failed` with `LicenseValidationFailed` | OS doesn't match a billable Windows Server SKU | Confirm the OS - PayGo doesn't apply to Windows client or Linux. | +| Billing showed up but customer expected SA Benefits | Customer activated PayGo when they meant SA | Disable PayGo, enable SA via `softwareAssuranceCustomer: true`. | +| At-scale PATCH partially failed | Some machines lacked the right role for the executor | Use a managed identity with `Azure Connected Machine Resource Administrator` scope at the RG / sub. | + +## Routing back + +| Situation | Route to | +|---|---| +| Wants PayGo on Azure Local VMs | (proposed) `azure-local` skill (`hcipaygo` flag) | +| Wants PayGo on VMware / SCVMM Arc VMs | (proposed) `arc-vmware` / `arc-scvmm` | +| Wants Hotpatch enabled (PayGo is a prereq) | [hotpatch.md](hotpatch.md) | +| Has SA, not interested in PayGo | Use SA Benefits flag instead (`softwareAssuranceCustomer: true`). No separate workflow needed. | +| Wants Hybrid Benefit for Azure VMs | That's `azure-compute`, not Arc. | diff --git a/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/arc-server-onboard.md b/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/arc-server-onboard.md new file mode 100644 index 000000000..efdb233f4 --- /dev/null +++ b/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/arc-server-onboard.md @@ -0,0 +1,191 @@ +# Arc Server Onboarding + +Guided flow for connecting one or many existing servers to Azure Arc via +the **Connected Machine agent** (`azcmagent`). + +## When to use + +- User wants to **connect**, **onboard**, **enroll**, **register**, or + **project** an existing server into Azure. +- User asks for an **install script**, **install command**, or + **agent download link** for Arc. +- User wants to onboard **at scale** (multiple servers): Configuration + Manager, Group Policy, Ansible, or a script handed to ops. +- User has an existing recommendation in hand and wants the deployable + artifact. + +> **Not for new VMs.** If the user wants Azure to provision a brand-new +> VM, route to `azure-compute`. Arc never creates a machine. + +> **Not for Arc-private-cloud VMs.** If the user wants a VM on Azure +> Local / VMware / SCVMM via a Resource Bridge, route to +> `azure-local` / `arc-vmware` / `arc-scvmm` (proposed skills). + +## Workflow + +### Step 1 - Single server or at scale? + +If the user said "a server" / "this machine" / "my laptop" -> single. +If the user said "all our servers" / "100 machines" / "fleet" / mentioned +Group Policy or Configuration Manager or Ansible -> at scale. +If unclear, ask one question: + +> "Onboarding one machine, or rolling this out across many at once?" + +At-scale path: load +[references/at-scale-onboarding.md](references/at-scale-onboarding.md). +Single-server path: continue below. + +### Step 2 - Confirm prerequisites are met + +Don't generate a script if a prereq is going to make it fail. Load +[references/prerequisites.md](references/prerequisites.md) and confirm: + +- The user has at least the **Azure Connected Machine Onboarding** role + on the target resource group. +- The `Microsoft.HybridCompute`, `Microsoft.GuestConfiguration`, and + `Microsoft.HybridConnectivity` resource providers are registered in + the target subscription. (Use `mcp__azure__resource_show` on the + subscription's `providers` to check.) +- The target machine OS is supported (recent Windows Server / desktop + Windows / supported Linux distro). +- The machine can reach the required endpoints (see + [references/connectivity-options.md](references/connectivity-options.md)). + +If any prereq is missing, surface it now and offer a fix. Do not +silently proceed. + +### Step 3 - Depth probe (adaptive gather) + +Classify the user's intent and ask **only** the questions that matter +for their path. Never re-ask anything the user already said. + +| Signal in user's message | Path | +|---|---| +| "just try it", "lab", "POC", "one box", nothing specific | **Fast path** - assume defaults, ask only sub / RG / region. | +| "production", "many machines", "automate" | **Operations path** - ask about auth (Service Principal), connectivity, and management services. | +| "VNet", "Private Link", "Arc Gateway", "proxy", "no public internet", "air-gapped" | **Networking-deep path** - load [references/connectivity-options.md](references/connectivity-options.md) and ask the right network questions. | +| "compliance", "FedRAMP", "Gov cloud", "Mooncake", "sovereign" | **Compliance path** - confirm cloud (Public / USGov / China / Edge), pick the matching download endpoint, force Service Principal auth. | + +### Step 4 - Gather minimum required inputs + +For every path, you need: + +| Input | Default if not given | Notes | +|---|---|---| +| **Subscription** | None - must ask | Use `mcp__azure__subscription_list` to show options. | +| **Resource group** | Propose `-arc-rg` or `arc-servers-rg` | Use `mcp__azure__group_list` to show existing. | +| **Region** | Propose `eastus` for Public, `usgovvirginia` for USGov, `chinanorth3` for China | Arc has regional residency for the resource record. Pick a region close to the machine. | +| **OS** | None - must ask if not obvious | Windows or Linux. Drives which script to generate. | +| **Authentication** | Interactive token (single machine) / Service Principal (multi or unattended) | See [references/connectivity-options.md](references/connectivity-options.md). | +| **Connectivity** | Public endpoint | Ask only if networking-deep signal. | +| **Tags** | `{ environment: , owner: }` | Accept "none" without follow-up. | + +For at-scale paths, add: + +| Input | Default | Notes | +|---|---|---| +| **Deployment method** | Ask | One of: `Basic` (run script per machine), `ConfigurationManager`, `GroupPolicy`, `Ansible`. Source enum: `DeploymentOptions` in `Client/React/Views/ArcServers/Enums.d.ts`. | +| **Service Principal** | Required for all non-Basic at-scale methods | The interactive token flow does not work for unattended installs. | + +For Edge / sovereign environments: + +| Input | Notes | +|---|---| +| **Cloud** | Public / USGov / China / Edge / Air-gapped. Different download URLs and endpoint blocks. The portal handles this via feature flags (`enableSpecifiedArmEndpoint`, `enableAltDownload`, `enableAltHisEndpoint`, `enableAgcAltDownload` for air-gapped). | +| **Alt HIS endpoint** | Required for Edge / air-gapped. | +| **Alt download URL** | Required for Edge / air-gapped where the standard blob endpoint is unreachable. | + +### Step 5 - Plan Card + +> **GATE.** Do not output a script until the user has seen and approved +> the Plan Card. + +Render a single markdown table summarizing every decision (asked + +defaulted). Example: + +| Decision | Value | Source | +|---|---|---| +| Subscription | Contoso-Prod (00000000-...) | User | +| Resource group | `arc-servers-rg` (new) | Default | +| Region | eastus | Default | +| OS | Windows | User | +| Auth | Service Principal `arc-onboard-sp` | User | +| Connectivity | Public endpoint with proxy `http://proxy.contoso.com:8080` | User | +| Tags | `environment=prod`, `team=infra` | User | +| Management services to install | Update Manager, Defender for Cloud | Default for prod | +| Cloud | AzureCloud (Public) | Inferred | + +Ask: *"Approve as-is, edit a row, or change output format?"*. Do not +generate until approved. + +### Step 6 - Generate the script + +Pick the script format that matches the user's deployment method: + +| Deployment method | Script format | +|---|---| +| Basic / single | PowerShell (Windows) or Bash (Linux), one-file installer | +| Configuration Manager | PowerShell wrapped in a CM package | +| Group Policy | PowerShell + GPO setup steps | +| Ansible | YAML playbook (`win_shell` / `shell` tasks, `block`/`rescue` error handling) | + +Load +[references/deployment-methods.md](references/deployment-methods.md) for +the canonical structure of each, then emit. Inline the actual values +from the Plan Card; do not leave `` unless the user asked +for a template. + +**Source of truth.** The portal generates these scripts from +`src/HybridComputeExtension/Client/React/Views/ArcServers/Create/ScriptUx/Utilities/`. +If the user wants the exact portal script, send them to the **Add +server** wizard in the Arc Center +([portal link](https://portal.azure.com/#blade/Microsoft_Azure_HybridCompute/HybridVmAddBlade)) +and tell them which deployment method tile to pick. + +### Step 7 - Delivery + +Ask once: *"Where should this script go?"* Three options: + +| Option | Action | +|---|---| +| **Print here** | Render the script inline. Add a copy-block fence. | +| **Save locally** | Write to a file using available file-system tools; tell the user the path. | +| **GitHub PR** | Open a PR with the script under `scripts/arc-onboarding/`. | + +### Step 8 - Tell the user what to do next + +For all paths, end with a short "what happens after you run it" note: + +1. `azcmagent connect` runs and prints a device-code URL (interactive + auth) or authenticates non-interactively (Service Principal). +2. The machine appears in the Arc Center with status `Connecting`, then + `Connected` within a minute. +3. Auto-upgrade is **enabled by default** on agent 1.41+. +4. Recommended next steps: enable Update Manager periodic assessment, + assign the recommended Arc-server policy initiative, link a Data + Collection Rule for Insights. + +Route the user to [arc-server-manage](../arc-server-manage/arc-server-manage.md) +if they ask "what now?". + +## Error handling + +| Scenario | Action | +|---|---| +| Subscription does not have `Microsoft.HybridCompute` registered | Offer to register via `az provider register --namespace Microsoft.HybridCompute`. The portal does this automatically; the script does not. | +| User does not have onboarding role | Surface required role: **Azure Connected Machine Onboarding** (or higher). Link them to the IAM blade for the target RG. | +| User asked for Private Link + Arc Gateway in the same breath | Allowed but unusual - confirm both are intended. Both require additional setup; see [references/connectivity-options.md](references/connectivity-options.md). | +| Air-gapped / Edge cloud and no alt-download URL was provided | Hard-stop. Ask for the alt-download URL (blob endpoint reachable from the air-gapped network). | +| User on Windows Server 2012 / 2012 R2 | Confirm they're onboarding for ESU. Mention ESU eligibility - route to [arc-server-manage](../arc-server-manage/arc-server-manage.md) after onboarding succeeds. | +| Script fails on machine with "no network" / 403 / connection refused | Route to [arc-server-troubleshoot](../arc-server-troubleshoot/arc-server-troubleshoot.md). Do not regenerate the script. | +| Linux machine but user wants RDP | Likely confusion - clarify OS. | + +## Routing back / handoff + +| Situation | Route to | +|---|---| +| Script ran but agent shows Disconnected / Expired / Error | [arc-server-troubleshoot](../arc-server-troubleshoot/arc-server-troubleshoot.md) | +| Onboarding worked, now what? | [arc-server-manage](../arc-server-manage/arc-server-manage.md) | +| Need to onboard hundreds of machines via Group Policy | [references/at-scale-onboarding.md](references/at-scale-onboarding.md) | +| Deep network questions (Private Link DNS, gateway allowlist) | [references/connectivity-options.md](references/connectivity-options.md) and (proposed) `arc-private-link` / `arc-gateway` skills | diff --git a/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/references/at-scale-onboarding.md b/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/references/at-scale-onboarding.md new file mode 100644 index 000000000..6075484e8 --- /dev/null +++ b/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/references/at-scale-onboarding.md @@ -0,0 +1,157 @@ +# At-Scale Arc Server Onboarding + +Load this when the user is onboarding many machines at once. The +single-server fast path is in +[arc-server-onboard.md](../arc-server-onboard.md); this file extends it +with the at-scale-specific decisions. + +## When to use + +| Scale | Recommendation | +|---|---| +| 1-5 machines | Single-server flow per machine. Don't over-engineer. | +| 5-50 machines, no central management | Basic script + a Service Principal, distributed via shared share / email. | +| 50+ machines, AD-joined Windows | **Group Policy** (Windows) or **Ansible** (mixed). | +| Any scale, ConfigMgr shop | **Configuration Manager**. | +| Any scale, multi-cloud (AWS / GCP machines) | Either at-scale script per cloud, **or** consider the AWS / GCP connector (proposed `arc-multicloud` skill) for inventory sync. | +| 100s of machines, want one firewall rule | Add **Arc Gateway** to whichever deployment method you pick. | + +## Decisions you must collect before generating + +| Decision | Why it matters | +|---|---| +| **Deployment method** | Drives the script template. See [deployment-methods.md](deployment-methods.md). | +| **Service Principal** | All non-Basic at-scale paths require non-interactive auth. The interactive token flow does not scale. | +| **Connectivity** | Public, Private Link, Arc Gateway. Affects script content and possibly DNS setup. | +| **Tag schema** | Apply consistent tags so the fleet is filterable post-onboarding (env, owner, datacenter). | +| **Management services** | Decide once for the fleet. Adding extensions post-onboarding via Policy is fine, but it's cleaner to bake them into the onboarding script. | +| **Naming** | The resource name = OS hostname. If hostnames collide (rare but possible across air-gapped sites), plan a rename strategy before onboarding. | +| **OU / collection / inventory scope** | Where the deployment tool targets the script. | + +## Service Principal setup (one-time) + +Create a single onboarding SP for the whole fleet. Scope it minimally. + +```bash +# Create +az ad sp create-for-rbac \ + --name "arc-onboarding-sp" \ + --role "Azure Connected Machine Onboarding" \ + --scopes "/subscriptions//resourceGroups/" +``` + +Capture `appId`, `password`, `tenant`. Store the secret in: + +- For Group Policy: a sealed shared folder readable by SYSTEM on the GPO + target machines (use `cipher.exe /E` or DPAPI-NG). +- For ConfigMgr: a CM script parameter / secret store. +- For Ansible: Ansible Vault. + +**Rotate the secret every 6-12 months.** Expired SP secrets are the #2 +cause of "we onboarded 200 servers and now half are showing Expired in +the Arc Center" - just before "DNS misconfiguration on Private Link". + +## At-scale flow + +### Step 1 - Pilot + +Onboard 1-3 machines first using the chosen deployment method. Verify: + +- The machines show up in the Arc Center within 5 minutes. +- Status flips to `Connected`. +- `agentVersion` is current. +- The expected extensions install. +- Run `azcmagent check` on a pilot machine - all endpoints should be reachable. + +If any fail, fix before fanning out. + +### Step 2 - Stage + +Roll out to a representative subset (one OU, one site, one CM +collection). Watch the Arc Center for the next 24 hours. Use a Resource +Graph query to spot stragglers: + +```kql +resources +| where type =~ "microsoft.hybridcompute/machines" +| where tags.deploymentBatch == "stage-2025-05" +| extend status = tostring(properties.status), agentVersion = tostring(properties.agentVersion) +| summarize count() by status +``` + +### Step 3 - Production + +Fan out to the rest of the fleet. Continue to monitor with the same +query. Triage any non-`Connected` machines via +[arc-server-troubleshoot](../../arc-server-troubleshoot/arc-server-troubleshoot.md). + +### Step 4 - Lock in + +Once steady-state: + +- Assign the **Azure Arc-enabled Servers recommended initiative** + (Azure Policy) at the management group or subscription scope. The + portal's `Assign Recommended Policies` blade does this; the policy + IDs are in `ArcServerOverview.ReactView.tsx`. +- Enable **agent auto-upgrade** at the fleet level - see + [../../arc-server-manage/references/agent-upgrade.md](../../arc-server-manage/references/agent-upgrade.md). +- Link a **Data Collection Rule** for Insights / Update Manager so new + machines get the right config the moment they connect. + +## Per-method specifics + +### Group Policy + +- Use a Computer Startup PowerShell script (not a Logon script). +- Pin the script to a single version per OU - don't auto-pull "latest" + or you'll have skewed installs. +- The portal's GroupPolicyReviewComponents render the exact GPMC + click-paths. Reproduce them in user guidance. +- Validation command: `gpresult /h gpresult.html` on a target machine. + +### Configuration Manager + +- Application detection method: registry key + `HKLM\SOFTWARE\Microsoft\AzureConnectedMachineAgent`. +- User experience: Hidden, Whether or not a user is logged on. +- Allowed install time: any. +- Maximum allowed run time: 30 minutes (network can vary). + +### Ansible + +- Tag machines by readiness state in your inventory; only run the + playbook against `arc_ready=true`. +- Use `serial: 25` to throttle so 1000 machines don't all open + outbound 443 in the same second. +- The portal templates use `block` / `rescue` / `always` per step. Mirror + it. Specifically: catch installer failures separately from `connect` + failures, because the remediation is different. + +### Basic + custom orchestrator (Terraform, Bicep, Salt, custom Bash) + +The Basic PowerShell / Bash is a single self-contained script. Wrap +in whatever tool the user has. The non-obvious bit: pass the +correlation ID through so onboardings are traceable end-to-end. + +## What to tell the user about timing + +| Scale | Realistic onboarding window | +|---|---| +| 10 machines | minutes | +| 100 machines | 1-2 hours (depending on rollout cadence) | +| 1,000 machines | 1-3 days (staged) | +| 10,000+ machines | 1-2 weeks staged | + +The bottleneck is **not** Azure - it's the user's deployment tool and +their willingness to fan out without observing the pilot. + +## Source in this repo + +- `Client/React/Views/ArcServers/Create/ScriptUx/ArcServerCreateAtScale.ReactView.tsx` + (the at-scale wizard blade) +- `Client/React/Views/ArcServers/Create/ScriptUx/ArcServerCreateLoggerAtScale.ts` + (telemetry hooks - shows what fields the portal cares about) +- `Client/React/Views/ArcServers/Create/ScriptUx/Components/ArcServerCreateAtScaleReviewTabComponents.tsx` + (review tab) +- `Client/React/Views/ArcServers/Create/ScriptUx/Components/GroupPolicyReviewComponents.tsx` + (Group Policy click-paths) diff --git a/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/references/connectivity-options.md b/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/references/connectivity-options.md new file mode 100644 index 000000000..76e2945ab --- /dev/null +++ b/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/references/connectivity-options.md @@ -0,0 +1,192 @@ +# Arc Server Connectivity Options + +How the Connected Machine agent talks to Azure. The user picks one +primary path; combinations are possible but rare. + +## The four primary paths + +```text +Machine on the internet? +├── Yes, direct outbound 443 ─────────────────────────→ Public endpoint +├── Yes, but through corporate proxy ─────────────────→ Public endpoint + proxy +├── No internet, only private ExpressRoute / VPN ─────→ Private Link Scope +├── No internet, but want one egress point ──────────→ Arc Gateway (alone or w/ Private Link) +└── Air-gapped / disconnected operations ─────────────→ Edge / Air-gapped cloud (alt endpoints) +``` + +The portal exposes the choice via the `ConnectivityMethod` enum in +`Client/React/Views/ArcServers/Enums.d.ts`: + +```ts +export const enum ConnectivityMethod { + PublicEndpoint = "publicEndpoint", + PrivateLink = "privateLink", +} +``` + +Arc Gateway is an orthogonal toggle that overlays on top of either +`PublicEndpoint` or `PrivateLink` and is gated by the `arcservergateways` +feature flag. + +## Public endpoint + +The default. The machine reaches Azure over the public internet using +the endpoints listed in +[prerequisites.md](prerequisites.md#network-egress-minimum-endpoint-set). + +**No extra Azure resources needed.** + +### Add a proxy + +For machines behind a corporate proxy, pass the proxy URL at install +time: + +```powershell +# Windows install snippet +$env:ArcOnboardingProxy = "http://proxy.contoso.com:8080" +& "$env:ProgramFiles\AzureConnectedMachineAgent\azcmagent.exe" connect ` + --resource-group $rg ` + --tenant-id $tenant ` + --location $region ` + --subscription-id $sub ` + --proxy-url $env:ArcOnboardingProxy +``` + +```bash +# Linux install snippet +sudo azcmagent connect \ + --resource-group "$RG" \ + --tenant-id "$TENANT" \ + --location "$REGION" \ + --subscription-id "$SUB" \ + --proxy-url "http://proxy.contoso.com:8080" +``` + +If the proxy requires auth, set credentials via the agent config file +or environment variables - do not bake them into the install command. + +### Add proxy bypass + +If specific endpoints (e.g. internal package mirrors) should bypass the +proxy: + +```bash +azcmagent config set proxy.bypass "ArcData,ArcServer,WAC" +``` + +(Use the symbolic bypass names; `azcmagent config list proxy.bypass` +shows what's available.) + +## Private Link Scope (AMPLS for Arc) + +Pre-creates a `Microsoft.HybridCompute/privateLinkScopes` resource and +associates the machine with it during onboarding. All Arc traffic flows +over private IPs via a Private Endpoint into the user's VNet. + +### When to choose + +- Compliance requires no public internet egress for the machine. +- The machine is on an isolated VNet reachable only over ExpressRoute / + VPN. +- The user wants per-scope DNS, audit, and IP control. + +### Costs / setup + +- One Private Link Scope per region you onboard machines into. +- One Private Endpoint per VNet that hosts agents. +- Private DNS zones for `*.guestconfiguration.azure.com`, + `*.his.arc.azure.com`, `*.dp.kubernetesconfiguration.azure.com`, + `*.servicebus.windows.net`, `*.guestnotificationservice.azure.com`, + `agentserviceapi.guestconfiguration.azure.com`. The portal can + auto-create these via `arcserverprivatelinkonboarding`. +- ExpressRoute / VPN between the on-prem network and the VNet. + +### Onboarding snippet + +```bash +azcmagent connect \ + --resource-group "$RG" \ + --tenant-id "$TENANT" \ + --location "$REGION" \ + --subscription-id "$SUB" \ + --private-link-scope "/subscriptions//resourceGroups//providers/Microsoft.HybridCompute/privateLinkScopes/" +``` + +### Common gotchas + +- DNS must resolve the private addresses on the machine. A split-horizon + DNS misconfiguration is the #1 cause of "agent installed but never + connects" with Private Link. +- A machine onboarded **with** Private Link cannot be moved to public + endpoint without `azcmagent disconnect` and re-onboard. + +## Arc Gateway + +A `Microsoft.HybridCompute/gateways` resource that acts as a controlled +egress point for many Arc machines. Reduces the number of FQDNs each +machine needs to reach to **one** gateway FQDN. + +### When to choose + +- The user wants to simplify firewall allowlists (one FQDN, not 20). +- The user wants central audit of all Arc agent traffic. +- The user is also onboarding multi-cloud (AWS / GCP) machines and wants + one egress for all of them. + +### Onboarding snippet + +```bash +azcmagent connect \ + --resource-group "$RG" \ + --tenant-id "$TENANT" \ + --location "$REGION" \ + --subscription-id "$SUB" \ + --gateway-id "/subscriptions//resourceGroups//providers/Microsoft.HybridCompute/gateways/" +``` + +### Combinable with Private Link + +Yes. Arc Gateway can sit in front of a Private Link Scope, so a single +allowlist entry covers both. Set both `--gateway-id` and +`--private-link-scope` on connect. + +## Air-gapped / Edge cloud + +The portal supports air-gapped operations via the `enableagcaltdownload` +(US Sec / US Nat) and `enablealtdownload` (Edge) feature flags. The +onboarding script substitutes: + +- An **alt download URL** for the agent installer (a blob endpoint + reachable from the air-gapped network). +- An **alt HIS endpoint** (`enablealthisendpoint`) for first-call + discovery. +- A specified **ARM endpoint** (`enablespecifiedarmendpoint`) for + Edge / Stack Hub scenarios. +- A custom **endpoints.json** file written under + `%ProgramData%\AzureConnectedMachineAgent\Config\endpoints.json` (the + `Save-EndpointsFile` function the portal injects - see + `OnboardingScriptCloudSpecificBlocks.ts`). + +If the user is on Public Cloud, none of this applies. + +## Decision matrix + +| User's situation | Connectivity | Auth | Extras | +|---|---|---|---| +| Single laptop POC | Public | Interactive token | none | +| 5 servers in a corp datacenter, internet via proxy | Public | Service Principal | proxy URL | +| 100 servers in a private datacenter, ExpressRoute to Azure | Private Link | Service Principal | PLS + DNS zones | +| 500 servers across multiple sites, want one firewall rule | Arc Gateway | Service Principal | Gateway resource | +| Air-gapped facility | Edge / Air-gapped | Service Principal | alt-download URL + alt-HIS endpoint + endpoints.json | + +## Source of truth in this repo + +The connectivity options the portal exposes are defined in: + +- `Client/React/Views/ArcServers/Enums.d.ts` (`ConnectivityMethod`) +- `Client/React/Views/ArcServers/Create/ScriptUx/Utilities/ResolveOnboardingInputs.ts` +- `Client/React/Views/ArcServers/Create/ScriptUx/Utilities/OnboardingScriptCloudSpecificBlocks.ts` +- `extension.config.json` (feature flags: `arcservergateways`, + `arcserverprivatelinkonboarding`, `enablespecifiedarmendpoint`, + `enablealtdownload`, `enablealthisendpoint`, `enableagcaltdownload`) +- `featureFlags.md` (per-environment status) diff --git a/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/references/deployment-methods.md b/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/references/deployment-methods.md new file mode 100644 index 000000000..969cc1c39 --- /dev/null +++ b/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/references/deployment-methods.md @@ -0,0 +1,185 @@ +# Arc Server Deployment Methods + +The portal generates four flavors of onboarding artifact. Pick one based +on the user's deployment substrate. + +Source enum: `DeploymentOptions` in +`Client/React/Views/ArcServers/Enums.d.ts`: + +```ts +export const enum DeploymentOptions { + Basic = "Basic", + ConfigurationManager = "ConfigurationManager", + GroupPolicy = "GroupPolicy", + Ansible = "Ansible", +} +``` + +## Decision table + +| User says | Pick | +|---|---| +| "I'll run it on this one box" | **Basic** | +| "We use SCCM / ConfigMgr / MECM" | **Configuration Manager** | +| "We use Group Policy / AD / GPO" | **Group Policy** | +| "We use Ansible / Tower / AWX / AAP" | **Ansible** | +| "We use Puppet / Chef / Salt / Terraform / Bicep" | Use the **Basic** PowerShell / Bash and wrap it in their existing tool. There is no first-party template for those. | +| "We use Intune" | Use the **Basic** PowerShell and deploy as a Win32 app or platform script. | + +## 1. Basic (PowerShell or Bash) + +A self-contained installer script. The user copies it to the machine +and runs it as admin / root. The portal generates this for every +single-server flow. + +**Source templates:** +- `Client/React/Views/ArcServers/Create/ScriptUx/Utilities/Templates/WindowsScriptTemplate.ts` +- `Client/React/Views/ArcServers/Create/ScriptUx/Utilities/Templates/LinuxScriptTemplate.ts` + +**Pipeline:** +1. Set environment variables for cloud, region, tenant, subscription, + resource group, correlation ID, auth (Service Principal or token). +2. Download the agent installer from a known URL (or alt-download for + Edge / air-gapped). +3. Install (`msiexec` on Windows, `apt` / `dnf` / `zypper` on Linux). +4. Optionally write `endpoints.json` for sovereign / Edge clouds. +5. Run `azcmagent connect` with all the right flags. +6. Optionally install management-services extensions (Update Manager, + Defender, etc.) on the same machine in the same script. + +**Auth modes:** +- **Interactive token** (default for single machine): the script prints + a device-code URL; the user signs in once. +- **Service Principal**: the script reads `appId` / `clientSecret` from + env vars and authenticates non-interactively. Required for any + unattended path. + +## 2. Configuration Manager + +The portal generates a PowerShell script + the instructions for +packaging it as a CM application or package. Suitable for shops with +mature ConfigMgr practice. + +**Required inputs:** +- Service Principal (interactive token doesn't work for CM-managed installs). +- A remote share path where the agent installer can sit, accessible to + CM distribution points. + +**Typical structure:** +1. Copy `AzureConnectedMachineAgent.msi` to a network share. +2. Create a CM application with detection rule "registry key + `HKLM\SOFTWARE\Microsoft\AzureConnectedMachineAgent` exists". +3. Install command: `powershell.exe -ExecutionPolicy Bypass -File OnboardingScript.ps1`. +4. Deploy to the appropriate device collection. + +## 3. Group Policy + +The portal generates a PowerShell script + instructions for the GPMC +(Group Policy Management Console). Targeted at AD-joined Windows +servers. + +**Required inputs:** +- Service Principal stored in a domain-readable location (often a + read-only secret on the shared SYSVOL or a sealed shared folder). +- A remote share for the installer and script (the `RemoteShareName` + and `RemoteServerDomainName` inputs you'll see in the portal map to + this). + +**Typical structure:** +1. Place the installer and `OnboardingScript.ps1` on a domain-readable + share. +2. Create a Computer Configuration GPO with a Startup PowerShell Script + pointing to the share. +3. Link the GPO to the OU containing the target servers. +4. Force a `gpupdate /force` or wait for next boot. + +The portal's GroupPolicyReviewComponents render the click-through +instructions, available links to download +`DeployGPO.ps1`/`EnableCredSSP.ps1`, and the `gpresult` validation +steps. Mirror that in the generated guidance. + +## 4. Ansible + +A YAML playbook with `win_shell` / `shell` tasks and `block`/`rescue` +error handling. Source templates: + +- `Client/React/Views/ArcServers/Create/ScriptUx/Utilities/Templates/AnsibleWindowsTemplate.ts` +- `Client/React/Views/ArcServers/Create/ScriptUx/Utilities/Templates/AnsibleLinuxTemplate.ts` + +**Required inputs:** +- Service Principal (always - Ansible runs unattended). +- Tower / AWX / AAP inventory targeting the right hosts. + +**Typical structure (Linux):** +```yaml +- hosts: arc_targets + become: true + tasks: + - name: Download Arc agent install script + get_url: + url: "https://aka.ms/azcmagent" + dest: /tmp/install_linux_azcmagent.sh + mode: '0755' + + - name: Install the agent + shell: bash /tmp/install_linux_azcmagent.sh + + - name: Connect to Azure + shell: | + azcmagent connect \ + --resource-group "{{ rg }}" \ + --tenant-id "{{ tenant_id }}" \ + --location "{{ region }}" \ + --subscription-id "{{ subscription_id }}" \ + --service-principal-id "{{ sp_app_id }}" \ + --service-principal-secret "{{ sp_secret }}" \ + --tags "{{ tags }}" + register: connect_result + failed_when: connect_result.rc != 0 +``` + +The portal's Ansible templates wrap each step in `block` / `rescue` +with explicit `assert` tasks - mirror that pattern for production use. + +## What the portal does that the agent (and your script) should match + +Read this section if you're hand-rolling vs using the portal-generated +script. + +1. **Correlation ID.** The portal injects a GUID per onboarding so that + support can trace from the portal click to the Azure resource. Set + `--correlation-id ` on `azcmagent connect`. +2. **Tags.** The portal merges per-form tags + location tags + default + tags. Pass them on `connect` with `--tags "k=v,k=v"`. +3. **Cloud parameter.** Public / USGov / China / Edge - each has its + own value. The portal pulls this from `getEnvironmentValue` in the + ReactView. The agent flag is `--cloud AzureCloud` (default), + `AzureUSGovernment`, `AzureChinaCloud`, etc. +4. **Configuration profile (Automanage).** Optional. If the user picked + an Automanage best-practices profile in the wizard, the script + triggers it post-connect. This is gated by + `arcservercreateautomanagemachine` and currently off in all + environments per `featureFlags.md` - skip unless the user explicitly + asks for Automanage. +5. **Management services.** The selected services (Update Manager, + Defender, Insights, Change Tracking, Machine Config, Hotpatch) become + `az connectedmachine extension create` commands appended to the + script. The enum is `ManagementServiceKey`. +6. **TPM identity.** Behind `enabletpmservers`, off everywhere - skip. + +## When to send the user to the portal instead + +If the user wants: + +- The **canonical** script with all sovereign / Edge / proxy / gateway + / TPM blocks resolved automatically, OR +- A copy-paste Group Policy / Configuration Manager review screen, OR +- The agent download link signed with their tenant's correlation ID, + +...tell them: open Azure portal, search for **Azure Arc**, click +**Manage servers** > **Add**. Pick the deployment method tile that +matches the choice above. The wizard generates the exact same script +the portal would generate. + +Portal link: +[Add an Arc server](https://portal.azure.com/#blade/Microsoft_Azure_HybridCompute/HybridVmAddBlade). diff --git a/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/references/prerequisites.md b/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/references/prerequisites.md new file mode 100644 index 000000000..82765beed --- /dev/null +++ b/plugin/skills/azure-arc-servers/workflows/arc-server-onboard/references/prerequisites.md @@ -0,0 +1,124 @@ +# Arc Server Onboarding Prerequisites + +Check these before generating an onboarding script. A failing prereq is +the most common cause of a "the script ran but the machine never showed +up" support call. + +## On the Azure side + +### Required resource providers (per subscription) + +| Provider | Why | +|---|---| +| `Microsoft.HybridCompute` | The Arc machine resource itself. | +| `Microsoft.GuestConfiguration` | Required for Machine Configuration / policy assignment. | +| `Microsoft.HybridConnectivity` | Required for SSH / RDP via Arc and for Arc Gateway. | +| `Microsoft.AzureArcData` (optional) | Required only if the user will also use Arc-enabled SQL Server. | + +Check with `az provider show -n Microsoft.HybridCompute --query registrationState`. +Register with `az provider register -n Microsoft.HybridCompute`. +Registration is async and can take a few minutes. + +### Required RBAC role on the target resource group + +| Role | When | +|---|---| +| **Azure Connected Machine Onboarding** (`b64e21ea-ac4e-4cdf-9dc9-5b892992bee7`) | Minimum role to run `azcmagent connect`. Sufficient if the user only needs to onboard. | +| **Azure Connected Machine Resource Administrator** (`cd570a14-e51a-42ad-bac8-bafd67325302`) | Required to manage the machine resource after onboarding (install / remove extensions, change tags, delete). | +| **Contributor** | A superset; works but is over-privileged. | + +For at-scale onboarding via Service Principal, the SP needs **Azure +Connected Machine Onboarding** on the target RG. + +### Region availability + +Arc is available in every public Azure region. Sovereign and Edge clouds +vary - confirm the user's target region supports `Microsoft.HybridCompute`. + +## On the machine side + +### Supported OS + +| OS family | Supported versions (high-water mark - check docs for exact list) | +|---|---| +| Windows Server | 2008 R2 SP1, 2012, 2012 R2, 2016, 2019, 2022, 2025 (2012 / 2012 R2 require ESU) | +| Windows client (x64) | Windows 10 / 11 (limited to specific edition lines) | +| RHEL / CentOS / Oracle / Rocky / AlmaLinux | 7+, 8+, 9+ | +| Ubuntu | 18.04 / 20.04 / 22.04 / 24.04 LTS | +| SLES | 12 SP5+, 15 | +| Debian | 10, 11, 12 | +| Amazon Linux | 2, 2023 | + +ARM64 Linux is supported on a subset; ARM Windows is not. + +> Always link the user to the +> [official supported-OS list](https://learn.microsoft.com/azure/azure-arc/servers/prerequisites#supported-operating-systems) +> for the authoritative version matrix. + +### Hardware / runtime + +- 2 GB RAM minimum, 4 GB recommended. +- Local administrator / root to install the agent. +- ~500 MB free disk under `%ProgramData%\AzureConnectedMachineAgent` or `/var/opt/azcmagent`. +- TLS 1.2 enabled (most modern OSes; explicit on older Windows Server). + +### Conflicts to check before installing + +- If the machine **is already an Azure VM** (`Microsoft.Compute/virtualMachines`), + do **not** install `azcmagent` on it. An Azure VM is already projected into + Azure through the Azure VM Guest Agent (the Linux Azure Agent `waagent`, or + the Windows Guest Agent) and the Compute resource provider; running the + Connected Machine agent on top is not a supported configuration and + conflicts with the Guest Agent. Arc-enabled servers is only for machines + that live **outside** Azure. (If the user actually wants to *create* a VM + in Azure, that is the `azure-compute` skill.) +- The **Log Analytics agent (MMA)** / **Azure Monitor Agent (AMA)** can + coexist, but a machine that already has direct LA workspace association + should be migrated to Arc-managed associations to avoid double-ingest. +- Some EDR / AV products quarantine `azcmagent.exe` - check the user's + endpoint protection allowlist if onboarding silently fails. + +### Network egress (minimum endpoint set) + +The agent must be able to reach: + +| Endpoint | Purpose | +|---|---| +| `*..arc.azure.com` (HTTPS 443) | Per-region Hybrid Connectivity endpoint | +| `management.azure.com` (HTTPS 443) | ARM control plane | +| `login.microsoftonline.com` (HTTPS 443) | Entra ID auth | +| `.his.arc.azure.com` (HTTPS 443) | Hybrid Identity Service | +| `pas.windows.net` (HTTPS 443) | Privileged Access Service (extension manifest) | +| `dc.services.visualstudio.com` (HTTPS 443) | Telemetry | +| `gbl.his.arc.azure.com` (HTTPS 443) | Global HIS for first-call discovery | +| Sovereign / Edge cloud equivalents | Sovereign clouds use parallel endpoints (`*.azure.us`, `*.azure.cn`). Edge clouds may override via the `enablealthisendpoint` / `enablespecifiedarmendpoint` feature flags. | + +If the machine has no outbound internet, the user needs **Arc Private +Link Scope** and/or **Arc Gateway** - see +[connectivity-options.md](connectivity-options.md). + +Run `azcmagent check` on the machine to validate connectivity to all of +the above in one shot. + +### Optional but recommended + +- **Time synchronization.** Skew > 5 minutes breaks Entra token auth. +- **Hostname stability.** The agent uses the OS hostname as the resource + name. Renaming the host after onboarding causes a duplicate Arc + resource (the old one will show Disconnected). +- **Static or DHCP-reserved IP** for machines that will use Arc-enabled + SSH / RDP via Hybrid Connectivity. + +## Checklist before generating the script + +Pin this in your head: + +- [ ] Subscription chosen, providers registered. +- [ ] User has onboarding role on target RG. +- [ ] OS confirmed supported. +- [ ] Network egress path planned (public / proxy / Private Link / Gateway / air-gapped). +- [ ] Auth method chosen (interactive token vs Service Principal). +- [ ] Tags decided (or "none"). +- [ ] Management services chosen (or "none"). + +Only then proceed to script generation. diff --git a/plugin/skills/azure-arc-servers/workflows/arc-server-troubleshoot/arc-server-troubleshoot.md b/plugin/skills/azure-arc-servers/workflows/arc-server-troubleshoot/arc-server-troubleshoot.md new file mode 100644 index 000000000..446ccef53 --- /dev/null +++ b/plugin/skills/azure-arc-servers/workflows/arc-server-troubleshoot/arc-server-troubleshoot.md @@ -0,0 +1,160 @@ +# Arc Server Troubleshooting + +The Arc machine is registered but not in a healthy `Connected` state, or +extensions are misbehaving, or the onboarding script ran but no resource +appeared. Diagnose, then fix. + +## When to use + +| Signal in user's message | Yes | +|---|---| +| "Agent is disconnected / expired / error" | Yes | +| "Machine shows offline in Arc" | Yes | +| "azcmagent connect failed" | Yes | +| "Extension stuck in `Creating` / `Failed`" | Yes | +| "I ran the script but nothing shows up in the portal" | Yes | +| "Onboarding worked yesterday but agent stopped reporting" | Yes | + +If the user is **about** to onboard and hasn't run a script yet, route +to [arc-server-onboard](../arc-server-onboard/arc-server-onboard.md). + +## Workflow + +### Step 1 - Get the agent status + +The status comes from `properties.status` on the +`Microsoft.HybridCompute/machines` resource. The portal uses these values +(source: `AgentStatus` enum in `Utilities/Enums`): + +| Status | Meaning | First action | +|---|---|---| +| `Connected` | Agent reported in within the last 5 minutes. | Healthy. If user has a complaint, look at extensions, not agent. | +| `Disconnected` | Agent has not reported in 5-15+ minutes. | Check `lastStatusChange`. Network or service issue. See [agent-status.md](references/agent-status.md). | +| `Expired` | Agent has not reported in for hours / days. The Entra token has aged out. | Re-auth required. Most often Service Principal secret expired. | +| `Error` | Agent reported, but its self-report says something is wrong. | Read `errorDetails`. | + +Fastest way to see it: + +```bash +mcp__azure__resource_show \ + --resource-id "/subscriptions//resourceGroups//providers/Microsoft.HybridCompute/machines/" +``` + +Or for a fleet: + +```kql +resources +| where type =~ "microsoft.hybridcompute/machines" +| extend status = tostring(properties.status) +| where status != "Connected" +| project name, resourceGroup, status, + lastStatusChange = properties.lastStatusChange, + errorDetails = properties.errorDetails +``` + +### Step 2 - Route by status + +| Status | Where to go next | +|---|---| +| `Disconnected`, `Expired`, `Error` | [references/agent-status.md](references/agent-status.md) walks each case. | +| `Connected` but extension is broken | Skip to "Extension troubleshooting" below. | +| Resource does not exist at all | Skip to "The script ran but nothing appeared" below. | + +### Step 3 - Get on the machine + +For almost every non-`Connected` case, the fastest fix path is **on the +machine**. The agent self-diagnoses better than ARM can: + +```text +# Windows +& "$env:ProgramFiles\AzureConnectedMachineAgent\azcmagent.exe" show +& "$env:ProgramFiles\AzureConnectedMachineAgent\azcmagent.exe" check +& "$env:ProgramFiles\AzureConnectedMachineAgent\azcmagent.exe" logs + +# Linux +sudo azcmagent show +sudo azcmagent check +sudo azcmagent logs +``` + +`azcmagent show` reports the current state and the most recent error. +`azcmagent check` runs end-to-end connectivity probes to the required +Azure endpoints. `azcmagent logs` zips up the recent logs into a tar +file the user can attach to a support ticket. + +### Step 4 - Match the symptom to a remediation + +Load [references/common-issues.md](references/common-issues.md) and find +the row for the user's symptom. + +### Step 5 - Verify the fix took + +After remediation: + +1. On the machine: `azcmagent show` reports `Status: Connected`. +2. In ARM: status flips to `Connected` within ~2 minutes + (`mcp__azure__resource_show` to confirm). +3. The portal Overview blade shows the green status pill. + +If status flips back to `Disconnected` later, the underlying issue +wasn't fixed - re-diagnose. + +## Extension troubleshooting (when agent is Connected but extension is bad) + +Extensions are children of the machine: type +`Microsoft.HybridCompute/machines/extensions`. They have their own +`provisioningState` (`Creating`, `Succeeded`, `Failed`, `Deleting`). + +### Inspect + +```bash +# Via CLI +az connectedmachine extension list \ + --machine-name \ + --resource-group \ + --output table + +# Via Resource Graph for a fleet +resources +| where type =~ "microsoft.hybridcompute/machines/extensions" +| where properties.provisioningState != "Succeeded" +| project id, name, provisioningState = properties.provisioningState, + publisher = properties.publisher, + extensionType = properties.type, + version = properties.typeHandlerVersion +``` + +### Common extension-level issues + +| Symptom | Most likely cause | Fix | +|---|---|---| +| Extension stuck in `Creating` for > 30 min | Agent can't reach the extension distribution endpoint | `azcmagent check`; if `pas.windows.net` fails, fix proxy / firewall | +| Extension shows `Failed` with `404` | Stale extension version pinned | Update with `--type-handler-version` to the latest, or `--auto-upgrade-minor-version true` | +| Extension uninstall hangs | Extension's `Disable` script is failing | Force-uninstall: `az connectedmachine extension delete --force` | +| Many extensions failing on the same machine | Disk full / antivirus quarantine | Free space; check EDR quarantine for `extension*.exe` | +| Same extension failing across many machines | Extension was rolled out with a bad config | Roll back via Azure Policy "Configure to install ..." set to `Audit` while you fix | + +For specific extension issues (Azure Monitor Agent, Microsoft Defender, +Azure Update Manager), defer to that service's own troubleshooting. + +## The script ran but nothing appeared + +If `azcmagent connect` printed success but no resource is in ARM: + +| Check | Action | +|---|---| +| Did the script actually call `azcmagent connect`? | `azcmagent show` - if `Status: Disconnected (Not yet connected)`, the script bailed early. Scroll the script output for the error. | +| Did the script point at the right subscription? | Re-read the script - the `--subscription-id` and `--resource-group` flags must match where the user is looking. | +| Did auth succeed? | If interactive: did the user actually complete the device-code prompt? If SP: did the secret pass? Failure surfaces as `Authentication failed` in the script output. | +| Did the user have onboarding role? | Re-check role on the target RG. The script does **not** create the RG; missing RG also produces a clean failure. | +| Is the user looking in the right tenant? | Multi-tenant Service Principals sometimes connect to the wrong tenant. Confirm with `azcmagent show`. | + +## Routing back / handoff + +| Situation | Route to | +|---|---| +| Fundamental network gap (no egress, no Private Link) | [../arc-server-onboard/references/connectivity-options.md](../arc-server-onboard/references/connectivity-options.md) - re-onboard with the right path. | +| Service Principal secret expired | Rotate the secret, then re-run `azcmagent connect`. | +| Agent is many versions behind and won't upgrade | [../arc-server-manage/references/agent-upgrade.md](../arc-server-manage/references/agent-upgrade.md) | +| Onboarded the wrong machine / want to start over | `azcmagent disconnect` on the machine, then `mcp__azure__resource_delete` on the resource, then re-onboard. | +| User is convinced it's an Azure-side outage | Have them check [Azure Status](https://azure.status.microsoft/) for `Azure Arc`. If status shows degraded, they should open a support case rather than re-diagnose. | diff --git a/plugin/skills/azure-arc-servers/workflows/arc-server-troubleshoot/references/agent-status.md b/plugin/skills/azure-arc-servers/workflows/arc-server-troubleshoot/references/agent-status.md new file mode 100644 index 000000000..eda74f3bd --- /dev/null +++ b/plugin/skills/azure-arc-servers/workflows/arc-server-troubleshoot/references/agent-status.md @@ -0,0 +1,128 @@ +# Arc Agent Status Reference + +Each of the four agent statuses, what it means, and the most likely +remediation. Source: `AgentStatus` enum used across the +HybridComputeExtension (`Utilities/Enums.ts`, +`Client/React/Views/ArcServers/Blades/Overview/ArcServerOverviewStatusBar.tsx`). + +## `Connected` + +Agent has reported in within the last 5 minutes. The Overview blade +renders a green pill. + +Nothing to fix at the agent layer. If the user has a complaint, look at: +- Extensions (`Microsoft.HybridCompute/machines/extensions`) +- Recommended policies (`Assign Recommended Policies` blade) +- Updates (Azure Update Manager) +- Defender (Microsoft Defender for Cloud) + +## `Disconnected` + +Agent has not reported in for ~5-15 minutes. Could be transient +(machine rebooting, blip in connectivity) or persistent (broken). + +### Causes, in order of likelihood + +| Cause | Signal | Fix | +|---|---|---| +| Machine is off / rebooting | `lastStatusChange` < 30 min ago, no other symptom | Wait 5 minutes. If still down, check the machine. | +| `himds` / agent service stopped | Verifiable on machine: Windows `Get-Service himds`, Linux `systemctl status himds` | Restart the service. Investigate why it stopped (check Event Viewer / journalctl). | +| Outbound proxy / firewall changed | `azcmagent check` shows failed endpoints | Re-allow the endpoints in [connectivity-options.md](../../arc-server-onboard/references/connectivity-options.md#network-egress-minimum-endpoint-set). | +| Proxy credentials expired | `azcmagent show` reports `407 Proxy Authentication Required` | Reconfigure proxy with `azcmagent config set proxy.url ...`. | +| Machine clock skew > 5 min | Token validation fails; `azcmagent logs` shows `nbf` / `exp` errors | Fix NTP / `w32time`. | +| DNS failure | `azcmagent check` shows resolution failures | Restore DNS - especially common with Private Link split-horizon DNS. | +| Disk full | `himds.exe` cannot write logs and crashes | Free disk space. | +| OS upgrade broke the agent service | Service won't start at all | Reinstall the agent. Don't re-onboard - that creates a duplicate. Use the in-place agent installer. | + +### Fast diagnostic command + +```bash +# Windows +azcmagent show; azcmagent check + +# Linux +sudo azcmagent show; sudo azcmagent check +``` + +`azcmagent check` covers DNS, TLS, endpoint reachability, and proxy +validation in one shot. Read its output before anything else. + +## `Expired` + +Agent's Entra token has expired and it can't get a new one. The machine +is essentially unauthenticated even if the network is fine. + +### Causes, in order of likelihood + +| Cause | Signal | Fix | +|---|---|---| +| **Service Principal client secret expired** | The SP used for onboarding has a `passwordCredentials[].endDateTime` in the past. `azcmagent logs` shows `AADSTS7000222`. | Rotate the SP secret. Update it on every affected machine (env var / GPO / Ansible Vault / CM secret). Then `azcmagent connect` (no `--force` needed; supply the new secret). | +| **Tenant deleted the SP** | Resource Graph query `azureresources/identities` shows the SP gone. | Recreate the SP, scope it to the RG, re-connect. | +| **Machine moved tenants** (rare) | `azcmagent show` reports a different `Tenant ID` than expected | Disconnect (`azcmagent disconnect`) and re-onboard to the correct tenant. | +| Machine was offline for > 60 days | The agent's own refresh token aged out completely | Disconnect + re-onboard. | + +### Fleet-wide secret rotation + +If a single SP onboarded hundreds of machines, you have one secret to +rotate. Push the new secret via the same channel you onboarded +(GPO / CM / Ansible Vault), then run `azcmagent connect` on each +machine non-destructively to re-auth. + +```bash +# Non-destructive re-auth with new SP secret +azcmagent connect \ + --resource-group "$RG" \ + --tenant-id "$TENANT" \ + --location "$REGION" \ + --subscription-id "$SUB" \ + --service-principal-id "$SP_APP_ID" \ + --service-principal-secret "$NEW_SECRET" +``` + +If the machine resource already exists, the agent will simply re-auth +against it; you won't create a duplicate. + +## `Error` + +Agent reported in, but its own self-report says something is wrong. +`properties.errorDetails` will have a structured error code + message. + +### Common error codes + +| Code | Meaning | Fix | +|---|---|---| +| `AZCM0007` | Failed to renew certificate | Restart `himds`. If recurring, re-onboard. | +| `AZCM0011` | Cannot reach HIS endpoint | Check `azcmagent check`; fix DNS / firewall to `*.his.arc.azure.com`. | +| `AZCM0028` | Identity certificate expired | Re-auth via `azcmagent connect`. | +| `AZCM0039` | Time skew | Fix NTP. | +| Resource provider not registered | `errorDetails.code` references `Microsoft.HybridCompute` registration | Register the RP in the subscription (`az provider register -n Microsoft.HybridCompute`). | + +The full code list moves over time; consult the +[official azcmagent troubleshooting docs](https://learn.microsoft.com/azure/azure-arc/servers/troubleshoot-agent-onboard) +for the current set. + +### When to disconnect + re-onboard vs repair in place + +| Situation | Action | +|---|---| +| Agent service won't start at all | Reinstall agent in place (don't re-onboard - keeps the resource record stable). | +| Agent is connected but to the wrong RG / sub | `azcmagent disconnect`, delete stale resource, re-onboard to correct target. | +| Agent's identity is fundamentally broken (rotated tenant, corrupt cert store) | `azcmagent disconnect --force-local-only`, delete stale resource in ARM, re-onboard. | +| Just a transient network blip | Don't do anything destructive. Let it self-heal. | + +`azcmagent disconnect` removes the Azure resource. Use +`--force-local-only` only when you have to clean the agent state but +**not** the Azure resource (e.g. the resource is already gone or you'll +delete it manually). + +## `lastStatusChange` heuristics + +The `properties.lastStatusChange` timestamp tells you how long the +machine has been in its current state. + +| Time since `lastStatusChange` | Read | +|---|---| +| < 5 min | Probably transient. Don't act yet. | +| 5-30 min | Worth investigating. | +| 30 min - 24 hr | Definitely broken. | +| > 24 hr | Either the machine is truly offline, or the customer doesn't realize the machine was decommissioned. Consider deleting the resource. | diff --git a/plugin/skills/azure-arc-servers/workflows/arc-server-troubleshoot/references/common-issues.md b/plugin/skills/azure-arc-servers/workflows/arc-server-troubleshoot/references/common-issues.md new file mode 100644 index 000000000..d2a6d4ebe --- /dev/null +++ b/plugin/skills/azure-arc-servers/workflows/arc-server-troubleshoot/references/common-issues.md @@ -0,0 +1,64 @@ +# Common Arc Server Issues - Lookup Table + +Symptom -> most likely cause -> fix. Listed in rough order of frequency. + +## Onboarding failures + +| Symptom (during onboarding) | Cause | Fix | +|---|---|---| +| Script exits with `Failed to connect to Azure: Forbidden` | User / SP missing onboarding role on target RG | Assign **Azure Connected Machine Onboarding** on the RG. | +| Script exits with `Resource provider 'Microsoft.HybridCompute' not registered` | RP not registered in target subscription | `az provider register -n Microsoft.HybridCompute` (also register `Microsoft.GuestConfiguration`, `Microsoft.HybridConnectivity`). Wait 1-2 min. | +| `Failed to download agent: 403 / 404` (Edge / air-gapped) | The default download URL isn't reachable from the air-gapped network | Use alt-download URL (`enablealtdownload` / `enableagcaltdownload`); host the agent installer on an internal blob endpoint. | +| `azcmagent connect: timeout connecting to login.microsoftonline.com` | No outbound 443 or proxy not set | Configure proxy via `azcmagent config set proxy.url ...`, or switch connectivity to Private Link / Arc Gateway. | +| `azcmagent connect: AADSTS90002 Tenant not found` | Wrong tenant ID passed | Re-run with the correct tenant from `az account show --query tenantId`. | +| `azcmagent connect: AADSTS7000215 Invalid client secret` | SP secret typo or expired | Rotate / re-fetch the SP secret; supply on `--service-principal-secret`. | +| Resource shows up but immediately goes `Disconnected` | First heartbeat failed - usually DNS to `*.his.arc.azure.com` | Run `azcmagent check` on machine; fix DNS or proxy. | +| Onboarding succeeds for the user but the resource appears in a wrong tenant | SP belongs to a different tenant than the user expected | Disconnect, re-onboard with correct `--tenant-id`. | +| Group Policy onboarding works on first 80%, fails on remaining 20% | GPO script timed out (default 30s on slow disks) | Increase Computer Configuration > Administrative Templates > System > Group Policy > "Specify maximum wait time for Group Policy scripts" to 5+ minutes. | +| Linux install fails with package manager error | Distro repository unreachable through firewall | Use the alt-download method - download `azcmagent__.` directly and `dpkg -i` / `rpm -i`. | +| Windows install fails with MSI 1603 | MSI installer logging needed | Re-run with `msiexec /i ... /l*v install.log` and read the MSI log. Usually a missing pre-req (TLS 1.2, .NET) or AV quarantine. | + +## Steady-state failures + +| Symptom (after a healthy onboarding) | Cause | Fix | +|---|---|---| +| Machine flips `Connected` <-> `Disconnected` every few minutes | Unreliable network (Wi-Fi roaming, flaky VPN) | Wire the machine or stabilize VPN. Arc tolerates short blips but constant flapping looks unhealthy. | +| `Expired` after exactly 90 days | SP secret was created with default 90-day expiry | Rotate SP secret. Set a longer expiry or use certificate-based SP. | +| Extension stays `Creating` for hours | Manifest endpoint (`pas.windows.net`) unreachable | `azcmagent check`; fix proxy / firewall. | +| Many extensions show `Failed` after a Defender for Cloud rollout | Defender's MDE extension conflicts with existing AV (CrowdStrike, etc.) | Set MDE EDR to passive mode or exclude conflicting AV per Defender guidance. | +| Update Manager reports "No update data" indefinitely | AzureMonitorWindowsAgent / Linux extension not installed | Install the extension via `az connectedmachine extension create --publisher Microsoft.Azure.Monitor --type AzureMonitorWindowsAgent`. | +| Run Command failed | `Microsoft.HybridConnectivity` not registered, or machine doesn't have GuestConfig extension | Register the RP; assign the recommended policies to install GuestConfig. | +| SSH / RDP via Arc fails | Hybrid Connectivity endpoint not configured, or NSG blocks the loopback | `azcmagent config set extensions.allowlist `; enable Hybrid Connectivity endpoint via `az connectedmachine endpoint create`. | +| Tags reverted to "(none)" | A Policy assignment is enforcing tag values | Check policy assignments at sub / RG scope; either include the desired tags in the policy default or set tags via the policy itself. | +| Machine shows up duplicated in Arc | Hostname changed; agent registered a second resource | Delete the old resource via `mcp__azure__resource_delete`. Document a hostname-pinning policy to prevent recurrence. | + +## "Phantom" machines + +| Symptom | Cause | Fix | +|---|---|---| +| Resource exists in ARM but the machine doesn't anymore (decommissioned) | Someone deleted the VM without disconnecting the agent | `mcp__azure__resource_delete` on the Arc resource. | +| Duplicate resource for the same physical machine | Hostname change, OS reinstall, or onboarded twice | Confirm which is current (check `lastStatusChange`), delete the stale one. | +| Machine shows in two different RGs | Onboarded twice with different target RGs | Pick the canonical one (usually the more recent); disconnect from the wrong one with `azcmagent disconnect`. | + +## Permissions / RBAC + +| Symptom | Cause | Fix | +|---|---|---| +| User can see machine but can't install extensions | User has Reader, not Resource Administrator | Assign **Azure Connected Machine Resource Administrator** on the machine or its RG. | +| User can't see machine in browse | No role on the subscription / RG | Assign Reader at the appropriate scope. | +| Policy "Configure to install..." fails to install extension | The policy's managed identity doesn't have the right role on the target RG | Add **Azure Connected Machine Resource Administrator** to the policy's MI. | + +## Decision tree: "I don't know what's wrong" + +1. Run `azcmagent show` on the machine. Read the `Status` line. +2. If not `Connected`, run `azcmagent check`. Read every failed check. +3. If `check` is all green but status is wrong, run `azcmagent logs` + and grep for `ERROR`. +4. If logs are inconclusive, look at `properties.errorDetails` from + `mcp__azure__resource_show`. +5. If still inconclusive, disconnect + re-onboard. Don't keep poking + at a broken state - a clean reconnect is fast and frequently fixes + it. +6. If a clean reconnect also fails, you've narrowed it to a real + network or auth problem - open a support case with `azcmagent logs` + bundle attached. diff --git a/scripts/src/copilot-cli-char-budget.ts b/scripts/src/copilot-cli-char-budget.ts index 72755b1f5..8a88f2161 100644 --- a/scripts/src/copilot-cli-char-budget.ts +++ b/scripts/src/copilot-cli-char-budget.ts @@ -8,9 +8,9 @@ import { escapeXml } from "./shared/string-helpers.js"; const SKILL_CHAR_BUDGET_ENV = "SKILL_CHAR_BUDGET"; // Note: Copilot CLI's default skill char budget is 15000, -// we set our budget to 20000 as a soft cap to not block skill contributions +// we set our budget to 22000 as a soft cap to not block skill contributions // and remind us when the consumption grows too fast. -const DEFAULT_SKILL_CHAR_BUDGET = 20000; +const DEFAULT_SKILL_CHAR_BUDGET = 22000; function getSkillCharBudget() { const envBudget = process.env[SKILL_CHAR_BUDGET_ENV]; diff --git a/tests/azure-arc-servers/__snapshots__/triggers.test.ts.snap b/tests/azure-arc-servers/__snapshots__/triggers.test.ts.snap new file mode 100644 index 000000000..5619bb3af --- /dev/null +++ b/tests/azure-arc-servers/__snapshots__/triggers.test.ts.snap @@ -0,0 +1,155 @@ +// Jest Snapshot v1, https://jestjs.io/docs/snapshot-testing + +exports[`azure-arc-servers - Trigger Tests Trigger Keywords Snapshot skill description triggers match snapshot 1`] = ` +{ + "description": "Azure Arc-enabled servers router. WHEN: Arc Server, Arc-enabled server, Connected Machine agent, azcmagent, hybrid server, on-prem server, onboard to Arc, project a server into Azure, connect machine to Azure, install Connected Machine agent, generate onboarding script, at-scale onboarding, Group Policy onboarding, Configuration Manager onboarding, Ansible onboarding, Service Principal for Arc, Arc Private Link Scope, Arc Gateway, proxy for Arc, agent connectivity, agent status, Connected / Disconnected / Expired / Error, agent upgrade, automatic upgrade, manual upgrade, Extended Security Updates (ESU) for Arc, Hotpatch on Arc, Pay-as-you-go Windows Server, Software Assurance benefits, Microsoft.HybridCompute, hybridcompute/machines. PREFER OVER azure-compute when the user is talking about non-Azure (on-prem, other-cloud, edge) servers being projected into Azure - not about creating an Azure VM.", + "extractedKeywords": [ + "about", + "agent", + "ansible", + "arc", + "arc-enabled", + "assurance", + "at-scale", + "automatic", + "azcmagent", + "azure", + "azure-compute", + "being", + "benefits", + "cli", + "configuration", + "connect", + "connected", + "connectivity", + "creating", + "deploy", + "disconnected", + "edge", + "error", + "expired", + "extended", + "gateway", + "generate", + "group", + "hotpatch", + "hybrid", + "hybridcompute", + "install", + "into", + "link", + "machine", + "machines", + "manager", + "manual", + "mcp", + "microsoft", + "monitor", + "non-azure", + "on-prem", + "onboard", + "onboarding", + "other-cloud", + "over", + "pay-as-you-go", + "policy", + "prefer", + "principal", + "private", + "project", + "projected", + "proxy", + "router", + "scope", + "script", + "security", + "server", + "servers", + "service", + "software", + "status", + "talking", + "updates", + "upgrade", + "user", + "when", + "windows", + ], + "name": "azure-arc-servers", +} +`; + +exports[`azure-arc-servers - Trigger Tests Trigger Keywords Snapshot skill keywords match snapshot 1`] = ` +[ + "about", + "agent", + "ansible", + "arc", + "arc-enabled", + "assurance", + "at-scale", + "automatic", + "azcmagent", + "azure", + "azure-compute", + "being", + "benefits", + "cli", + "configuration", + "connect", + "connected", + "connectivity", + "creating", + "deploy", + "disconnected", + "edge", + "error", + "expired", + "extended", + "gateway", + "generate", + "group", + "hotpatch", + "hybrid", + "hybridcompute", + "install", + "into", + "link", + "machine", + "machines", + "manager", + "manual", + "mcp", + "microsoft", + "monitor", + "non-azure", + "on-prem", + "onboard", + "onboarding", + "other-cloud", + "over", + "pay-as-you-go", + "policy", + "prefer", + "principal", + "private", + "project", + "projected", + "proxy", + "router", + "scope", + "script", + "security", + "server", + "servers", + "service", + "software", + "status", + "talking", + "updates", + "upgrade", + "user", + "when", + "windows", +] +`; diff --git a/tests/azure-arc-servers/integration.test.ts b/tests/azure-arc-servers/integration.test.ts new file mode 100644 index 000000000..2a87b19a8 --- /dev/null +++ b/tests/azure-arc-servers/integration.test.ts @@ -0,0 +1,153 @@ +/** + * Integration Tests for azure-arc-servers + * + * Tests skill behavior with a real Copilot agent session. + * Runs prompts multiple times to measure skill invocation rate. + * + * Prerequisites: + * 1. npm install -g @github/copilot-cli + * 2. Run `copilot` and authenticate + * + * Run with: + * npm run test:integration -- --testPathPatterns=azure-arc-servers + */ + +import { + useAgentRunner, + shouldSkipIntegrationTests, + getIntegrationSkipReason, +} from "../utils/agent-runner"; +import { + isSkillInvoked, + isToolCalled, + softCheckSkill, + withTestResult, +} from "../utils/evaluate"; + +const SKILL_NAME = "azure-arc-servers"; +const ONBOARD_WORKFLOW_PATH = + /workflows\/arc-server-onboard\/arc-server-onboard\.md/i; +const TROUBLESHOOT_WORKFLOW_PATH = + /workflows\/arc-server-troubleshoot\/arc-server-troubleshoot\.md/i; +const MANAGE_WORKFLOW_PATH = + /workflows\/arc-server-manage\/arc-server-manage\.md/i; +const RUNS_PER_PROMPT = 5; +const INVOCATION_RATE_THRESHOLD = 0.8; + +const skipTests = shouldSkipIntegrationTests(); +const skipReason = getIntegrationSkipReason(); + +if (skipTests && skipReason) { + console.log(`⏭️ Skipping integration tests: ${skipReason}`); +} + +const describeIntegration = skipTests ? describe.skip : describe; + +describeIntegration(`${SKILL_NAME}_ - Integration Tests`, () => { + const agent = useAgentRunner(); + + async function expectPromptToInvokeWorkflow( + prompt: string, + workflowPathPattern: RegExp + ): Promise<{ + skillInvocationCount: number; + toolCallCount: number; + }> { + let invocationCount = 0; + let toolCallCount = 0; + for (let i = 0; i < RUNS_PER_PROMPT; i++) { + const agentMetadata = await agent.run({ prompt }); + + softCheckSkill(agentMetadata, SKILL_NAME); + if (isSkillInvoked(agentMetadata, SKILL_NAME)) { + invocationCount += 1; + } + if (isToolCalled(agentMetadata, "view", workflowPathPattern)) { + toolCallCount += 1; + } + } + return { + skillInvocationCount: invocationCount, + toolCallCount: toolCallCount, + }; + } + + describe("skill-invocation", () => { + test("routes onboarding prompt to arc-server-onboard", async () => { + await withTestResult(async ({ setSkillInvocationRate }) => { + const result = await expectPromptToInvokeWorkflow( + "Onboard my on-prem Windows Server to Azure Arc. Generate the install script.", + ONBOARD_WORKFLOW_PATH + ); + const rate = result.skillInvocationCount / RUNS_PER_PROMPT; + setSkillInvocationRate(rate); + expect(rate).toBeGreaterThanOrEqual(INVOCATION_RATE_THRESHOLD); + const referenceViewRate = result.toolCallCount / RUNS_PER_PROMPT; + expect(referenceViewRate).toBeGreaterThanOrEqual( + INVOCATION_RATE_THRESHOLD + ); + }); + }); + + test("'on-prem server to Azure' invokes arc-servers, NOT azure-compute", async () => { + await withTestResult(async ({ setSkillInvocationRate }) => { + let invocationCount = 0; + let wrongSkillInvocations = 0; + for (let i = 0; i < RUNS_PER_PROMPT; i++) { + const agentMetadata = await agent.run({ + prompt: + "I have an existing Windows server in my datacenter. " + + "I want to project it into Azure so I can manage it with Azure Policy.", + }); + + softCheckSkill(agentMetadata, SKILL_NAME); + if (isSkillInvoked(agentMetadata, SKILL_NAME)) { + invocationCount += 1; + } + if (isSkillInvoked(agentMetadata, "azure-compute")) { + wrongSkillInvocations += 1; + } + } + const rate = invocationCount / RUNS_PER_PROMPT; + setSkillInvocationRate(rate); + expect(rate).toBeGreaterThanOrEqual(INVOCATION_RATE_THRESHOLD); + // Wrong skill should never be invoked for a clearly-Arc prompt. + expect(wrongSkillInvocations).toBe(0); + }); + }); + + test("routes agent-disconnected prompt to arc-server-troubleshoot", async () => { + await withTestResult(async ({ setSkillInvocationRate }) => { + const result = await expectPromptToInvokeWorkflow( + "My Azure Arc server is showing Disconnected. The Connected Machine agent " + + "was working yesterday. How do I troubleshoot it?", + TROUBLESHOOT_WORKFLOW_PATH + ); + const rate = result.skillInvocationCount / RUNS_PER_PROMPT; + setSkillInvocationRate(rate); + expect(rate).toBeGreaterThanOrEqual(INVOCATION_RATE_THRESHOLD); + const referenceViewRate = result.toolCallCount / RUNS_PER_PROMPT; + expect(referenceViewRate).toBeGreaterThanOrEqual( + INVOCATION_RATE_THRESHOLD + ); + }); + }); + + test("routes Extended Security Updates prompt to arc-server-manage", async () => { + await withTestResult(async ({ setSkillInvocationRate }) => { + const result = await expectPromptToInvokeWorkflow( + "Enable Extended Security Updates for my Windows Server 2012 R2 " + + "Arc-enabled machine. I need it to keep getting security patches.", + MANAGE_WORKFLOW_PATH + ); + const rate = result.skillInvocationCount / RUNS_PER_PROMPT; + setSkillInvocationRate(rate); + expect(rate).toBeGreaterThanOrEqual(INVOCATION_RATE_THRESHOLD); + const referenceViewRate = result.toolCallCount / RUNS_PER_PROMPT; + expect(referenceViewRate).toBeGreaterThanOrEqual( + INVOCATION_RATE_THRESHOLD + ); + }); + }); + }); +}); diff --git a/tests/azure-arc-servers/triggers.test.ts b/tests/azure-arc-servers/triggers.test.ts new file mode 100644 index 000000000..a15a6cd15 --- /dev/null +++ b/tests/azure-arc-servers/triggers.test.ts @@ -0,0 +1,250 @@ +/** + * Trigger Tests for azure-arc-servers + * + * Tests that verify the skill triggers on appropriate prompts (on-prem, + * other-cloud, and edge servers being projected into Azure via the + * Connected Machine agent) and does NOT trigger on unrelated prompts + * (in particular: Azure VM creation, which belongs to azure-compute). + */ + +import { TriggerMatcher } from "../utils/trigger-matcher"; +import { loadSkill, LoadedSkill } from "../utils/skill-loader"; + +const SKILL_NAME = "azure-arc-servers"; + +describe(`${SKILL_NAME} - Trigger Tests`, () => { + let triggerMatcher: TriggerMatcher; + let skill: LoadedSkill; + + beforeAll(async () => { + skill = await loadSkill(SKILL_NAME); + triggerMatcher = new TriggerMatcher(skill); + }); + + describe("Should Trigger - Onboarding (single server)", () => { + const onboardPrompts: string[] = [ + "How do I onboard my on-prem Windows Server to Azure Arc?", + "Connect my Linux VM running in AWS to Azure Arc", + "Install the Connected Machine agent on my server and project it into Azure", + "Generate an Azure Arc onboarding script for my Windows server", + "Help me azcmagent connect my server to Azure", + "Onboard my hybrid server to Azure Arc so I can manage it with Azure Policy", + "Project my datacenter Linux server into Azure as an Arc-enabled server", + "How do I install the Azure Connected Machine agent on Ubuntu?", + ]; + + test.each(onboardPrompts)('triggers on onboard prompt: "%s"', (prompt) => { + const result = triggerMatcher.shouldTrigger(prompt); + expect(result.triggered).toBe(true); + expect(result.matchedKeywords.length).toBeGreaterThanOrEqual(2); + }); + }); + + describe("Should Trigger - At-scale onboarding", () => { + const atScalePrompts: string[] = [ + "How do I onboard 500 servers to Azure Arc using Group Policy?", + "Use Configuration Manager to onboard my server fleet to Arc", + "Run an Ansible playbook to onboard my Linux servers to Azure Arc", + "Create a Service Principal for at-scale Azure Arc onboarding", + "Roll out the Connected Machine agent across my domain with Group Policy", + "At-scale Arc onboarding with SCCM for Windows servers", + ]; + + test.each(atScalePrompts)( + 'triggers on at-scale onboard prompt: "%s"', + (prompt) => { + const result = triggerMatcher.shouldTrigger(prompt); + expect(result.triggered).toBe(true); + expect(result.matchedKeywords.length).toBeGreaterThanOrEqual(2); + } + ); + }); + + describe("Should Trigger - Connectivity (Private Link / Arc Gateway / proxy)", () => { + const connectivityPrompts: string[] = [ + "Set up an Azure Arc Private Link Scope for my onboarded servers", + "Onboard my server to Arc using Private Link", + "Configure Arc Gateway as a single egress point for my Connected Machine agents", + "My Arc server needs to go through a corporate proxy, how do I configure it?", + "Azure Arc agent connectivity through Private Endpoint", + ]; + + test.each(connectivityPrompts)( + 'triggers on connectivity prompt: "%s"', + (prompt) => { + const result = triggerMatcher.shouldTrigger(prompt); + expect(result.triggered).toBe(true); + expect(result.matchedKeywords.length).toBeGreaterThanOrEqual(2); + } + ); + }); + + describe("Should Trigger - Troubleshooting", () => { + const troubleshootPrompts: string[] = [ + "My Azure Arc server shows Disconnected, how do I troubleshoot it?", + "Connected Machine agent status is Expired on my Arc server", + "azcmagent show says Error, fix my Arc server connectivity", + "Arc server agent status is Disconnected but the machine is online", + "Why is my Connected Machine agent showing as Expired in the portal?", + "Troubleshoot azcmagent on my hybrid server", + ]; + + test.each(troubleshootPrompts)( + 'triggers on troubleshoot prompt: "%s"', + (prompt) => { + const result = triggerMatcher.shouldTrigger(prompt); + expect(result.triggered).toBe(true); + expect(result.matchedKeywords.length).toBeGreaterThanOrEqual(2); + } + ); + }); + + describe("Should Trigger - Agent upgrade / management", () => { + const managePrompts: string[] = [ + "How do I upgrade the Connected Machine agent on all my Arc servers?", + "Enable automatic agent upgrade for my Azure Arc servers", + "Manually upgrade azcmagent on my Linux Arc server", + "What is the current version of the Azure Connected Machine agent?", + ]; + + test.each(managePrompts)('triggers on manage prompt: "%s"', (prompt) => { + const result = triggerMatcher.shouldTrigger(prompt); + expect(result.triggered).toBe(true); + expect(result.matchedKeywords.length).toBeGreaterThanOrEqual(2); + }); + }); + + describe("Should Trigger - Extended Security Updates (ESU)", () => { + const esuPrompts: string[] = [ + "Enable Extended Security Updates for my Windows Server 2012 Arc machine", + "Buy ESU through Azure Arc for my end-of-support Windows servers", + "How do I attach an ESU license to my Arc-enabled server?", + "Activate Extended Security Updates on my Arc Windows Server 2012 R2", + ]; + + test.each(esuPrompts)('triggers on ESU prompt: "%s"', (prompt) => { + const result = triggerMatcher.shouldTrigger(prompt); + expect(result.triggered).toBe(true); + expect(result.matchedKeywords.length).toBeGreaterThanOrEqual(2); + }); + }); + + describe("Should Trigger - Hotpatch on Arc", () => { + const hotpatchPrompts: string[] = [ + "Enable Hotpatch on my Azure Arc Windows Server 2025", + "How do I turn on Hotpatching for my Arc server?", + "Set up Hotpatch on Arc-enabled Windows Server", + ]; + + test.each(hotpatchPrompts)( + 'triggers on Hotpatch prompt: "%s"', + (prompt) => { + const result = triggerMatcher.shouldTrigger(prompt); + expect(result.triggered).toBe(true); + expect(result.matchedKeywords.length).toBeGreaterThanOrEqual(2); + } + ); + }); + + describe("Should Trigger - Pay-as-you-go Windows Server", () => { + const paygoPrompts: string[] = [ + "Switch my Arc Windows server to pay-as-you-go licensing", + "Enable PAYG for Windows Server on my Arc machine", + "Bring my Software Assurance benefits to my Arc-enabled Windows server", + ]; + + test.each(paygoPrompts)('triggers on PAYG prompt: "%s"', (prompt) => { + const result = triggerMatcher.shouldTrigger(prompt); + expect(result.triggered).toBe(true); + expect(result.matchedKeywords.length).toBeGreaterThanOrEqual(2); + }); + }); + + describe("Should NOT Trigger - unrelated topics", () => { + const shouldNotTriggerPrompts: string[] = [ + "What is the weather today?", + "Help me write a poem", + "Explain quantum computing", + "Help me with AWS Systems Manager", + "Configure my PostgreSQL database", + "How do I write a Python web scraper?", + "Set up a Kubernetes cluster with Helm", + "Create a serverless function on AWS", + ]; + + test.each(shouldNotTriggerPrompts)( + 'does not trigger on: "%s"', + (prompt) => { + const result = triggerMatcher.shouldTrigger(prompt); + expect(result.triggered).toBe(false); + } + ); + }); + + // Near-neighbor protection: brand-new VMs *inside* Azure belong to the + // `azure-compute` skill, not arc-servers. + // + // NOTE: TriggerMatcher is a keyword pre-filter, not a router. SKILL.md + // contributes generic keywords (`azure`, `machine`, `server`, `windows`, + // `deploy`, `script`) that any Azure-VM prompt will hit, so we cannot + // assert non-triggering for prompts that share that vocabulary. The + // actual Arc-vs-Compute disambiguation is enforced by the LLM router + // and verified in `integration.test.ts` + // (`'on-prem server to Azure' invokes arc-servers, NOT azure-compute`), + // which asserts `wrongSkillInvocations === 0`. + // + // The two prompts below are unambiguous Azure-compute requests with no + // Arc-keyword overlap — they should not fire the keyword pre-filter and + // serve as a regression guard against keyword drift in SKILL.md. + describe("Should NOT Trigger - unambiguous azure-compute (no Arc keyword overlap)", () => { + const azureComputePrompts: string[] = [ + "Create an Azure VM running Ubuntu in East US", + "Provision an Azure VM scale set for my web tier", + ]; + + test.each(azureComputePrompts)( + 'does not trigger (azure-compute territory) on: "%s"', + (prompt) => { + const result = triggerMatcher.shouldTrigger(prompt); + expect(result.triggered).toBe(false); + } + ); + }); + + describe("Trigger Keywords Snapshot", () => { + test("skill keywords match snapshot", () => { + expect(triggerMatcher.getKeywords()).toMatchSnapshot(); + }); + + test("skill description triggers match snapshot", () => { + expect({ + name: skill.metadata.name, + description: skill.metadata.description, + extractedKeywords: triggerMatcher.getKeywords(), + }).toMatchSnapshot(); + }); + }); + + describe("Edge Cases", () => { + test("handles empty prompt", () => { + const result = triggerMatcher.shouldTrigger(""); + expect(result.triggered).toBe(false); + }); + + test("handles very long prompt", () => { + const longPrompt = "Azure Arc server ".repeat(500); + const result = triggerMatcher.shouldTrigger(longPrompt); + expect(typeof result.triggered).toBe("boolean"); + }); + + test("is case insensitive", () => { + const result1 = triggerMatcher.shouldTrigger( + "ONBOARD MY SERVER TO AZURE ARC" + ); + const result2 = triggerMatcher.shouldTrigger( + "onboard my server to azure arc" + ); + expect(result1.triggered).toBe(result2.triggered); + }); + }); +}); diff --git a/tests/skills.json b/tests/skills.json index 89b2bb6c6..56fd8c72e 100644 --- a/tests/skills.json +++ b/tests/skills.json @@ -4,6 +4,7 @@ "appinsights-instrumentation", "azure-ai", "azure-aigateway", + "azure-arc-servers", "azure-cloud-migrate", "azure-compliance", "azure-compute", @@ -31,6 +32,6 @@ "integrationTestSchedule": { "0 5 * * 2-6": "microsoft-foundry", "0 8 * * 2-6": "azure-deploy", - "0 12 * * 2-6": "airunway-aks-setup,appinsights-instrumentation,azure-ai,azure-aigateway,azure-cloud-migrate,azure-compliance,azure-compute,azure-cost,azure-diagnostics,azure-enterprise-infra-planner,azure-hosted-copilot-sdk,azure-kubernetes,azure-kusto,azure-messaging,azure-prepare,azure-quotas,azure-rbac,azure-resource-lookup,azure-resource-visualizer,azure-storage,azure-upgrade,azure-validate,entra-agent-id,entra-app-registration,azure-reliability" + "0 12 * * 2-6": "airunway-aks-setup,appinsights-instrumentation,azure-ai,azure-aigateway,azure-arc-servers,azure-cloud-migrate,azure-compliance,azure-compute,azure-cost,azure-diagnostics,azure-enterprise-infra-planner,azure-hosted-copilot-sdk,azure-kubernetes,azure-kusto,azure-messaging,azure-prepare,azure-quotas,azure-rbac,azure-resource-lookup,azure-resource-visualizer,azure-storage,azure-upgrade,azure-validate,entra-agent-id,entra-app-registration,azure-reliability" } } \ No newline at end of file