Skip to content

Clean up documentation and resolve role duplicates#12

Merged
jonathanscholtes merged 1 commit into
mainfrom
dev
May 4, 2026
Merged

Clean up documentation and resolve role duplicates#12
jonathanscholtes merged 1 commit into
mainfrom
dev

Conversation

@jonathanscholtes
Copy link
Copy Markdown
Owner

This pull request introduces major improvements to monitoring, alerting, and deployment automation for the SRE Agent demo environment. The changes include adding robust alerting capabilities in the infrastructure, enhancing deployment scripts for idempotency and uniqueness, and updating documentation to guide users through new workflows and known issues.

Monitoring and Alerting Enhancements:

  • Added configurable alerting to the monitor Terraform module, including variables and resources for action groups, email/webhook receivers, and alert rules for failed requests, response time, and exception spikes. This enables automated incident detection and notification via Application Insights and Azure Monitor. [1] [2] [3] [4]
  • Exposed alert configuration variables (enable_alerts, alert_email_receivers, alert_webhook_receivers, thresholds) at the root level for easy customization per environment. [1] [2] [3]

Deployment and Automation Improvements:

  • Updated deployment scripts to generate a unique resource token per deployment using both subscription ID and timestamp, ensuring resource name uniqueness and preventing collisions. The script now skips token regeneration if a token already exists, making deployments idempotent. [1] [2] [3]
  • Simplified and clarified SRE Agent role assignment logic in Terraform, removing redundant resource group role assignments and ensuring all necessary roles are assigned via the target resource groups list. [1] [2] [3]

Documentation Updates:

  • Added a new step to the deployment guide detailing how to configure a self-healing pipeline using SRE Chat, including a comprehensive workflow for incident handling, remediation, and integration between Azure DevOps and GitHub Copilot.
  • Documented a known issue with Azure DevOps PAT authentication and provided a workaround for users.

These changes collectively improve the reliability, observability, and usability of the SRE Agent demo, making it easier to detect, respond to, and remediate incidents in a production-like environment.

@jonathanscholtes jonathanscholtes merged commit c56aad6 into main May 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant