Step-by-step instructions for deploying the Azure SRE Agent closed-loop demo.
| Tool | Version | Notes |
|---|---|---|
| Terraform | >= 1.5 | |
| Azure CLI | latest | |
| PowerShell | >= 7 | |
| GitHub CLI | latest | Required for OIDC setup |
| Azure subscription | Contributor + User Access Administrator | |
| Azure DevOps organization | Required for SRE Agent work item filing | |
| GitHub Copilot Enterprise | Required for coding agent PR creation |
Region constraint: The SRE Agent resource (
Microsoft.App/agents) is currently available only ineastus2,swedencentralandaustraliaeast. Setlocationinterraform.tfvarsaccordingly.
git clone https://github.com/jonathanscholtes/azure-sre-agent-github-demo
cd azure-sre-agent-github-demo
az login.\deploy.ps1 -Subscription "<subscription-name-or-id>"The script runs four phases automatically:
| Phase | What happens |
|---|---|
| 0 – Bootstrap | Creates Azure Storage for Terraform remote state |
| 1 – Infrastructure | terraform init / plan / apply across all modules |
| 2 – Containers | Builds and pushes sre-api and sre-loadgen to ACR |
| 3 – Demo data | Seeds Cosmos DB with sample customers and orders |
Subsequent runs skip the bootstrap step:
.\deploy.ps1 -Subscription "<subscription-name-or-id>" -SkipBootstrapWire up OIDC federated credentials and GitHub repository secrets:
.\deploy.ps1 -Subscription "<subscription-name-or-id>" -SetupGitHub
# or standalone:
.\scripts\New-GitHubOidc.ps1This creates an Entra app registration, grants it Contributor on the subscription, and sets the following GitHub secrets:
AZURE_CLIENT_IDAZURE_TENANT_IDAZURE_SUBSCRIPTION_IDAZURE_SP_OBJECT_ID
Set one additional secret manually after the bootstrap phase completes:
TF_STATE_STORAGE_ACCOUNT = <storage account name printed by deploy.ps1 phase 0>
In the SRE Agent resource, configure inbound incident routing from Azure Monitor:
- Open the SRE Agent resource in the Azure portal
- Go to Builder → Incident Platform
- Select Azure Monitor as the incident platform and complete the connection
- Go to Builder → Incident response plans and create (or customize) a plan for this demo
- Set plan filters (for example severity/service/title) and set run mode to Autonomous for this demo
- Save and enable the plan
Azure SRE leverages Memories, for resolution, tracking and routing intent (for example: create an Azure DevOps Issue and route code-fix work to GitHub Copilot). This helps keep behavior consistent across incidents.
You can use the built-in SRE Agent chat to refine investigation and routing behavior Memories, then update the response plan configuration based on those results.
When incidents match the enabled response plan filters, the agent runs automatically from that incident context.
References:
After terraform apply completes:
- Open the SRE Agent resource in the Azure portal
- Go to Builder → Connectors and create an Azure DevOps connector (org, project, board, auth)
- Select the target Azure DevOps project/repository context for this demo
- Save — when incidents match the enabled response plan, the agent automatically creates/updates Azure DevOps work items using this connector
Known issue — PAT authentication: When using a Personal Access Token, the connector may fall back to Managed Identity, resulting in error
TF401444. If this occurs, instruct the SRE Agent in chat to call the Azure DevOps REST API directly using the configured PAT.
In the same or a parallel workflow in the SRE Agent portal:
- Go to Builder → Connectors and create a GitHub connector for this repository
- Select this repository in the connector configuration
- Save — for each matching incident, the agent automatically opens a GitHub Issue with diagnostic context and assigns GitHub Copilot to complete the fix
Connector reference: Azure SRE Agent connectors documentation
Both the Azure DevOps work item and the GitHub Issue come from the same SRE Agent investigation. Azure DevOps tracks the incident operationally; the GitHub Issue is the work item Copilot acts on.
You can use the following prompt with the SRE Chat (New Chat thread) to ensure the instructions (ReadMe) are ready for your self-healing pipeline with Azure DevOps and GitHub Copilot. Please complete the steps asked by the SRE agent to complete set-up
For every incident:
1. ALWAYS create (or reuse if existing) an Azure DevOps Issue.
- Title: [INC-{id}] {summary}
- Include full incident details
- Ensure no duplicates (key = Incident ID)
2. Determine remediation type:
- Infrastructure / configuration issue
- Code defect requiring repository change
3. If Infrastructure / Configuration:
- Execute or recommend remediation (e.g., restart, scale, config update, rollback, IaC change)
- Record actions taken and outcome in the DevOps Issue
- Stop
4. If Code Fix Required:
- Create (or reuse) a GitHub Issue
- Link the Azure DevOps Issue
- Include root cause, repro steps, expected fix
- Assign to Copilot
- Label: incident, bug
- Trigger PR workflow on dev branch:
- Copilot proposes fix + tests
- PR must:
- Pass all existing CI checks
- Include or update tests validating the fix
- Reference both the GitHub Issue and DevOps Issue
5. Ensure all artifacts are linked (Incident ↔ DevOps ↔ GitHub ↔ PR when applicable)
Rules:
- Never skip DevOps issue creation
- Never create duplicates
- Prefer reuse over new artifacts
This enforces the human gate — Copilot's fix PR cannot merge without a review.
- Go to Settings → Branches → Add branch ruleset
- Target:
main - Enable:
- Require a pull request before merging (1 required approval)
- Require status checks to pass → add the check run from
.github/workflows/validate.yml(typicallyLint) - Block force pushes
Copy infra/terraform.tfvars.example to infra/terraform.tfvars and fill in your values. deploy.ps1 generates this file automatically during deployment.
| Variable | Required | Description |
|---|---|---|
subscription_id |
Yes | Azure subscription ID |
environment_name |
Yes | Short environment label, e.g. dev |
project_name |
Yes | Used in resource naming |
location |
Yes | Azure region — see region constraint above |
resource_token |
No | Unique suffix; auto-generated if empty |
enable_sre_agent |
No | Deploy the SRE Agent resource (default: false) |
sre_agent_name |
No | Name of the SRE Agent resource (default: sre-agent) |
sre_agent_access_level |
No | High (Reader + Contributor) or Low (Reader only; default: High) |
sre_agent_target_resource_groups |
No | Additional resource groups for the agent to monitor |
.\deploy.ps1 -Subscription "<subscription-name-or-id>" -DestroyThe Terraform remote state storage account is in a separate resource group (
rg-tfstate-sre) and is not deleted by the above command. Delete it manually when no longer needed.
| Symptom | Likely cause | Fix |
|---|---|---|
deploy.ps1 fails before Terraform |
Not authenticated to the correct subscription | Run az login, then rerun with -Subscription |
GitHub OIDC setup fails on gh secret set |
GitHub CLI is not authenticated | Run gh auth login and retry Step 3 |
| SRE Agent does not create incidents from alerts | Azure Monitor incident platform or response plan is not enabled | Recheck Step 4 and confirm the response plan is turned on |
| SRE Agent does not create work items/issues | Azure DevOps/GitHub connector project or repository mapping is not configured correctly | Recheck Steps 5 and 6 and verify connector scope |
| Branch rule blocks merge unexpectedly | Required check name mismatch | Use the check run generated by .github/workflows/validate.yml |