-
Notifications
You must be signed in to change notification settings - Fork 4
Governance Policies Reliability Backup And Recovery
Governance policies for Backup Recovery
Domain: reliability
| Description | Instead |
|---|---|
| Deploying databases without any backup configuration | Configure automated backups with retention matching environment tier (7+ days dev, 30+ days prod) |
| Using locally-redundant backup storage for production workloads | Use geo-redundant backup storage (GRS) for Recovery Services vaults and SQL databases |
| Deploying VMs without Recovery Services vault protection | Protect every production VM with a Recovery Services vault backup policy |
| Setting backup retention to the minimum without business justification | Set retention based on recovery requirements — 14+ days short-term, 12+ months long-term for production |
| Using Cosmos DB Periodic backup mode for production | Use Continuous backup mode for near-zero RPO and point-in-time restore capability |
| Disabling Key Vault purge protection | Always enable purge protection — it prevents permanent destruction of secrets, keys, and certificates |
- Azure Well-Architected Framework — Design for recovery
- SQL Database automated backups
- Cosmos DB continuous backup
- Recovery Services vault overview
- PostgreSQL Flexible backup and restore
| Check | Severity | Description |
|---|---|---|
| WAF-REL-BKP-001 | Required | Configure automated backup for ALL data services. Every database, storage account, and key vault MUST have automated backup enabled with retention policies matching the environment tier. SQL Database and PostgreSQL Flexible have built-in automated backups — configure retention. Cosmos DB has continuous backup mode. Storage accounts use soft delete and versioning. Key Vault uses soft delete and purge protection. NEVER deploy a data service without backup configuration. |
| WAF-REL-BKP-002 | Required | Deploy a Recovery Services vault for VM backups with geo-redundant storage, soft delete, immutability, and backup policies. Every production VM MUST be protected by a Recovery Services vault. Configure backup policies with daily backups, weekly/monthly/yearly retention, and cross-region restore capability. |
| WAF-REL-BKP-003 | Required | Configure point-in-time restore (PITR) for all production databases. SQL Database supports PITR within the short-term retention window (7-35 days). Cosmos DB Continuous backup enables PITR to any second within the retention period (7 or 30 days). PostgreSQL Flexible supports PITR within the backup retention window (7-35 days). PITR is the primary recovery mechanism for application bugs, accidental deletes, and data corruption — it is NOT optional. |
| WAF-REL-BKP-004 | Required | Configure geo-redundant backup storage for all production data services. SQL Database must use Geo backup storage redundancy. PostgreSQL Flexible must enable geoRedundantBackup. Storage accounts must use GZRS or RA-GZRS for critical data. Recovery Services vaults must use GeoRedundant storage with cross-region restore enabled. Backup data must survive a full regional outage. |
| WAF-REL-BKP-005 | Recommended | Implement backup verification and restore testing automation. Deploy Azure Automation runbooks or Logic Apps that periodically validate backup health, test restores to a staging environment, and alert on backup failures. Backup without tested restores is a false sense of security. Use Recovery Services vault backup reports and Azure Monitor alerts to track backup health. |
Configure automated backup for ALL data services. Every database, storage account, and key vault MUST have automated backup enabled with retention policies matching the environment tier. SQL Database and PostgreSQL Flexible have built-in automated backups — configure retention. Cosmos DB has continuous backup mode. Storage accounts use soft delete and versioning. Key Vault uses soft delete and purge protection. NEVER deploy a data service without backup configuration.
Severity: Required
Rationale: Data loss is the most severe reliability failure. Automated backups are the last line of defense against accidental deletion, corruption, ransomware, and application bugs. Manual backups are unreliable because they depend on human discipline.
Agents: terraform-agent, bicep-agent, cloud-architect
- Microsoft.Sql/servers/databases
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.DBforPostgreSQL/flexibleServers
- Microsoft.DBforMySQL/flexibleServers
- Microsoft.Sql/servers/databases
- Microsoft.Sql/servers/databases/backupShortTermRetentionPolicies
- Microsoft.Sql/servers/databases/backupLongTermRetentionPolicies
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.DBforPostgreSQL/flexibleServers
- Microsoft.Storage/storageAccounts/blobServices
- Microsoft.KeyVault/vaults
- Microsoft.KeyVault/vaults
- Microsoft.RecoveryServices/vaults
- Microsoft.DataProtection/backupVaults
- Microsoft.ContainerService/managedClusters
- Microsoft.Compute/virtualMachines
Deploy a Recovery Services vault for VM backups with geo-redundant storage, soft delete, immutability, and backup policies. Every production VM MUST be protected by a Recovery Services vault. Configure backup policies with daily backups, weekly/monthly/yearly retention, and cross-region restore capability.
Severity: Required
Rationale: Recovery Services vault is the central backup management plane for VMs, SQL in VMs, and file shares. Without vault protection, VM data is lost on disk failure or accidental deletion. GRS ensures backups survive regional disasters.
Agents: terraform-agent, bicep-agent, cloud-architect
- Microsoft.Sql/servers/databases
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.DBforPostgreSQL/flexibleServers
- Microsoft.DBforMySQL/flexibleServers
- Microsoft.RecoveryServices/vaults
- Microsoft.RecoveryServices/vaults/backupstorageconfig
- Microsoft.RecoveryServices/vaults/backupPolicies
- Microsoft.RecoveryServices/vaults/backupFabrics/protectionContainers/protectedItems
- Microsoft.KeyVault/vaults
- Microsoft.RecoveryServices/vaults
- Microsoft.DataProtection/backupVaults
- Microsoft.ContainerService/managedClusters
- Microsoft.Compute/virtualMachines
| Resource | Name | Purpose |
|---|---|---|
| Microsoft.Network/privateEndpoints | pe-resource | Private endpoint for Recovery Services vault (groupId: AzureBackup) |
| Microsoft.Network/privateDnsZones | privatelink.service.azure.com | Private DNS zone privatelink.{region}.backup.windowsazure.com for vault private endpoint |
| Microsoft.Insights/diagnosticSettings | diag-udr | Diagnostic settings to route vault operation logs and backup health to Log Analytics |
Configure point-in-time restore (PITR) for all production databases. SQL Database supports PITR within the short-term retention window (7-35 days). Cosmos DB Continuous backup enables PITR to any second within the retention period (7 or 30 days). PostgreSQL Flexible supports PITR within the backup retention window (7-35 days). PITR is the primary recovery mechanism for application bugs, accidental deletes, and data corruption — it is NOT optional.
Severity: Required
Rationale: PITR enables recovery to the exact moment before a data-corrupting event. Traditional full-backup restore loses all data since the last backup (hours of RPO). PITR provides near-zero RPO (seconds for Cosmos, minutes for SQL/PostgreSQL).
Agents: terraform-agent, bicep-agent, cloud-architect
- Microsoft.Sql/servers/databases
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.DBforPostgreSQL/flexibleServers
- Microsoft.DBforMySQL/flexibleServers
- Microsoft.Sql/servers/databases/backupShortTermRetentionPolicies
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.DBforPostgreSQL/flexibleServers
- Microsoft.KeyVault/vaults
- Microsoft.RecoveryServices/vaults
- Microsoft.DataProtection/backupVaults
- Microsoft.ContainerService/managedClusters
- Microsoft.Compute/virtualMachines
Configure geo-redundant backup storage for all production data services. SQL Database must use Geo backup storage redundancy. PostgreSQL Flexible must enable geoRedundantBackup. Storage accounts must use GZRS or RA-GZRS for critical data. Recovery Services vaults must use GeoRedundant storage with cross-region restore enabled. Backup data must survive a full regional outage.
Severity: Required
Rationale: Locally-redundant backups are lost in a regional disaster (earthquake, flood, extended power outage). Geo-redundant backups are replicated to the Azure paired region, ensuring recovery even when the entire primary region is unavailable.
Agents: terraform-agent, bicep-agent, cloud-architect
- Microsoft.Sql/servers/databases
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.DBforPostgreSQL/flexibleServers
- Microsoft.DBforMySQL/flexibleServers
- Microsoft.Sql/servers/databases
- Microsoft.DBforPostgreSQL/flexibleServers
- Microsoft.Storage/storageAccounts
- Microsoft.RecoveryServices/vaults/backupstorageconfig
- Microsoft.KeyVault/vaults
- Microsoft.RecoveryServices/vaults
- Microsoft.DataProtection/backupVaults
- Microsoft.ContainerService/managedClusters
- Microsoft.Compute/virtualMachines
Implement backup verification and restore testing automation. Deploy Azure Automation runbooks or Logic Apps that periodically validate backup health, test restores to a staging environment, and alert on backup failures. Backup without tested restores is a false sense of security. Use Recovery Services vault backup reports and Azure Monitor alerts to track backup health.
Severity: Recommended
Rationale: Untested backups frequently fail at restore time due to corruption, missing dependencies, or configuration drift. Regular restore testing proves recoverability and measures actual RTO. Backup health monitoring catches failures before they become critical.
Agents: terraform-agent, bicep-agent, cloud-architect
- Microsoft.Sql/servers/databases
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.DBforPostgreSQL/flexibleServers
- Microsoft.DBforMySQL/flexibleServers
- Microsoft.Insights/scheduledQueryRules
- Microsoft.KeyVault/vaults
- Microsoft.RecoveryServices/vaults
- Microsoft.DataProtection/backupVaults
- Microsoft.ContainerService/managedClusters
- Microsoft.Compute/virtualMachines
| Resource | Name | Purpose |
|---|---|---|
| Microsoft.Insights/actionGroups | ag-ops | Action group for backup failure notifications (email, SMS, webhook) |
| Microsoft.Insights/diagnosticSettings | diag-resource | Diagnostic settings on Recovery Services vault to send backup logs to Log Analytics |
Getting Started
Stages
Interfaces
Configuration
Agent System
Features
- Backlog Generation
- Cost Analysis
- Error Analysis
- Docs & Spec Kit
- MCP Integration
- Knowledge System
- Escalation
Quality
Help
Policies — Azure
AI Services
Compute
Data Services
- Azure SQL
- Backup Vault
- Cosmos Db
- Data Factory
- Databricks
- Event Grid
- Event Hubs
- Fabric
- IoT Hub
- Mysql Flexible
- Postgresql Flexible
- Recovery Services
- Redis Cache
- Service Bus
- Stream Analytics
- Synapse Workspace
Identity
Management
Messaging
Monitoring
Networking
- Application Gateway
- Bastion
- CDN
- DDoS Protection
- DNS Zones
- Expressroute
- Firewall
- Load Balancer
- Nat Gateway
- Network Interface
- Private Endpoints
- Public Ip
- Route Tables
- Traffic Manager
- Virtual Network
- Vpn Gateway
- WAF Policy
Security
Storage
Web & App
Policies — Well-Architected
Reliability
Security
Cost Optimization
Operational Excellence
Performance Efficiency
Integration