Skip to content

Governance Policies Reliability Backup And Recovery

Joshua Davis edited this page Apr 5, 2026 · 2 revisions

Backup & Recovery

Governance policies for Backup Recovery

Domain: reliability

Anti-Patterns

Description Instead
Deploying databases without any backup configuration Configure automated backups with retention matching environment tier (7+ days dev, 30+ days prod)
Using locally-redundant backup storage for production workloads Use geo-redundant backup storage (GRS) for Recovery Services vaults and SQL databases
Deploying VMs without Recovery Services vault protection Protect every production VM with a Recovery Services vault backup policy
Setting backup retention to the minimum without business justification Set retention based on recovery requirements — 14+ days short-term, 12+ months long-term for production
Using Cosmos DB Periodic backup mode for production Use Continuous backup mode for near-zero RPO and point-in-time restore capability
Disabling Key Vault purge protection Always enable purge protection — it prevents permanent destruction of secrets, keys, and certificates

References


Checks (5)

Check Severity Description
WAF-REL-BKP-001 Required Configure automated backup for ALL data services. Every database, storage account, and key vault MUST have automated backup enabled with retention policies matching the environment tier. SQL Database and PostgreSQL Flexible have built-in automated backups — configure retention. Cosmos DB has continuous backup mode. Storage accounts use soft delete and versioning. Key Vault uses soft delete and purge protection. NEVER deploy a data service without backup configuration.
WAF-REL-BKP-002 Required Deploy a Recovery Services vault for VM backups with geo-redundant storage, soft delete, immutability, and backup policies. Every production VM MUST be protected by a Recovery Services vault. Configure backup policies with daily backups, weekly/monthly/yearly retention, and cross-region restore capability.
WAF-REL-BKP-003 Required Configure point-in-time restore (PITR) for all production databases. SQL Database supports PITR within the short-term retention window (7-35 days). Cosmos DB Continuous backup enables PITR to any second within the retention period (7 or 30 days). PostgreSQL Flexible supports PITR within the backup retention window (7-35 days). PITR is the primary recovery mechanism for application bugs, accidental deletes, and data corruption — it is NOT optional.
WAF-REL-BKP-004 Required Configure geo-redundant backup storage for all production data services. SQL Database must use Geo backup storage redundancy. PostgreSQL Flexible must enable geoRedundantBackup. Storage accounts must use GZRS or RA-GZRS for critical data. Recovery Services vaults must use GeoRedundant storage with cross-region restore enabled. Backup data must survive a full regional outage.
WAF-REL-BKP-005 Recommended Implement backup verification and restore testing automation. Deploy Azure Automation runbooks or Logic Apps that periodically validate backup health, test restores to a staging environment, and alert on backup failures. Backup without tested restores is a false sense of security. Use Recovery Services vault backup reports and Azure Monitor alerts to track backup health.

WAF-REL-BKP-001

Configure automated backup for ALL data services. Every database, storage account, and key vault MUST have automated backup enabled with retention policies matching the environment tier. SQL Database and PostgreSQL Flexible have built-in automated backups — configure retention. Cosmos DB has continuous backup mode. Storage accounts use soft delete and versioning. Key Vault uses soft delete and purge protection. NEVER deploy a data service without backup configuration.

Severity: Required
Rationale: Data loss is the most severe reliability failure. Automated backups are the last line of defense against accidental deletion, corruption, ransomware, and application bugs. Manual backups are unreliable because they depend on human discipline.
Agents: terraform-agent, bicep-agent, cloud-architect

Targets

  • Microsoft.Sql/servers/databases
  • Microsoft.DocumentDB/databaseAccounts
  • Microsoft.DBforPostgreSQL/flexibleServers
  • Microsoft.DBforMySQL/flexibleServers
  • Microsoft.Sql/servers/databases
  • Microsoft.Sql/servers/databases/backupShortTermRetentionPolicies
  • Microsoft.Sql/servers/databases/backupLongTermRetentionPolicies
  • Microsoft.DocumentDB/databaseAccounts
  • Microsoft.DBforPostgreSQL/flexibleServers
  • Microsoft.Storage/storageAccounts/blobServices
  • Microsoft.KeyVault/vaults
  • Microsoft.KeyVault/vaults
  • Microsoft.RecoveryServices/vaults
  • Microsoft.DataProtection/backupVaults
  • Microsoft.ContainerService/managedClusters
  • Microsoft.Compute/virtualMachines

WAF-REL-BKP-002

Deploy a Recovery Services vault for VM backups with geo-redundant storage, soft delete, immutability, and backup policies. Every production VM MUST be protected by a Recovery Services vault. Configure backup policies with daily backups, weekly/monthly/yearly retention, and cross-region restore capability.

Severity: Required
Rationale: Recovery Services vault is the central backup management plane for VMs, SQL in VMs, and file shares. Without vault protection, VM data is lost on disk failure or accidental deletion. GRS ensures backups survive regional disasters.
Agents: terraform-agent, bicep-agent, cloud-architect

Targets

  • Microsoft.Sql/servers/databases
  • Microsoft.DocumentDB/databaseAccounts
  • Microsoft.DBforPostgreSQL/flexibleServers
  • Microsoft.DBforMySQL/flexibleServers
  • Microsoft.RecoveryServices/vaults
  • Microsoft.RecoveryServices/vaults/backupstorageconfig
  • Microsoft.RecoveryServices/vaults/backupPolicies
  • Microsoft.RecoveryServices/vaults/backupFabrics/protectionContainers/protectedItems
  • Microsoft.KeyVault/vaults
  • Microsoft.RecoveryServices/vaults
  • Microsoft.DataProtection/backupVaults
  • Microsoft.ContainerService/managedClusters
  • Microsoft.Compute/virtualMachines

Companion Resources

Resource Name Purpose
Microsoft.Network/privateEndpoints pe-resource Private endpoint for Recovery Services vault (groupId: AzureBackup)
Microsoft.Network/privateDnsZones privatelink.service.azure.com Private DNS zone privatelink.{region}.backup.windowsazure.com for vault private endpoint
Microsoft.Insights/diagnosticSettings diag-udr Diagnostic settings to route vault operation logs and backup health to Log Analytics

WAF-REL-BKP-003

Configure point-in-time restore (PITR) for all production databases. SQL Database supports PITR within the short-term retention window (7-35 days). Cosmos DB Continuous backup enables PITR to any second within the retention period (7 or 30 days). PostgreSQL Flexible supports PITR within the backup retention window (7-35 days). PITR is the primary recovery mechanism for application bugs, accidental deletes, and data corruption — it is NOT optional.

Severity: Required
Rationale: PITR enables recovery to the exact moment before a data-corrupting event. Traditional full-backup restore loses all data since the last backup (hours of RPO). PITR provides near-zero RPO (seconds for Cosmos, minutes for SQL/PostgreSQL).
Agents: terraform-agent, bicep-agent, cloud-architect

Targets

  • Microsoft.Sql/servers/databases
  • Microsoft.DocumentDB/databaseAccounts
  • Microsoft.DBforPostgreSQL/flexibleServers
  • Microsoft.DBforMySQL/flexibleServers
  • Microsoft.Sql/servers/databases/backupShortTermRetentionPolicies
  • Microsoft.DocumentDB/databaseAccounts
  • Microsoft.DBforPostgreSQL/flexibleServers
  • Microsoft.KeyVault/vaults
  • Microsoft.RecoveryServices/vaults
  • Microsoft.DataProtection/backupVaults
  • Microsoft.ContainerService/managedClusters
  • Microsoft.Compute/virtualMachines

WAF-REL-BKP-004

Configure geo-redundant backup storage for all production data services. SQL Database must use Geo backup storage redundancy. PostgreSQL Flexible must enable geoRedundantBackup. Storage accounts must use GZRS or RA-GZRS for critical data. Recovery Services vaults must use GeoRedundant storage with cross-region restore enabled. Backup data must survive a full regional outage.

Severity: Required
Rationale: Locally-redundant backups are lost in a regional disaster (earthquake, flood, extended power outage). Geo-redundant backups are replicated to the Azure paired region, ensuring recovery even when the entire primary region is unavailable.
Agents: terraform-agent, bicep-agent, cloud-architect

Targets

  • Microsoft.Sql/servers/databases
  • Microsoft.DocumentDB/databaseAccounts
  • Microsoft.DBforPostgreSQL/flexibleServers
  • Microsoft.DBforMySQL/flexibleServers
  • Microsoft.Sql/servers/databases
  • Microsoft.DBforPostgreSQL/flexibleServers
  • Microsoft.Storage/storageAccounts
  • Microsoft.RecoveryServices/vaults/backupstorageconfig
  • Microsoft.KeyVault/vaults
  • Microsoft.RecoveryServices/vaults
  • Microsoft.DataProtection/backupVaults
  • Microsoft.ContainerService/managedClusters
  • Microsoft.Compute/virtualMachines

WAF-REL-BKP-005

Implement backup verification and restore testing automation. Deploy Azure Automation runbooks or Logic Apps that periodically validate backup health, test restores to a staging environment, and alert on backup failures. Backup without tested restores is a false sense of security. Use Recovery Services vault backup reports and Azure Monitor alerts to track backup health.

Severity: Recommended
Rationale: Untested backups frequently fail at restore time due to corruption, missing dependencies, or configuration drift. Regular restore testing proves recoverability and measures actual RTO. Backup health monitoring catches failures before they become critical.
Agents: terraform-agent, bicep-agent, cloud-architect

Targets

  • Microsoft.Sql/servers/databases
  • Microsoft.DocumentDB/databaseAccounts
  • Microsoft.DBforPostgreSQL/flexibleServers
  • Microsoft.DBforMySQL/flexibleServers
  • Microsoft.Insights/scheduledQueryRules
  • Microsoft.KeyVault/vaults
  • Microsoft.RecoveryServices/vaults
  • Microsoft.DataProtection/backupVaults
  • Microsoft.ContainerService/managedClusters
  • Microsoft.Compute/virtualMachines

Companion Resources

Resource Name Purpose
Microsoft.Insights/actionGroups ag-ops Action group for backup failure notifications (email, SMS, webhook)
Microsoft.Insights/diagnosticSettings diag-resource Diagnostic settings on Recovery Services vault to send backup logs to Log Analytics

Home

Getting Started

Stages

Interfaces

Configuration

Agent System

Features

Quality

Help

Governance

Policies — Azure

AI Services

Compute

Data Services

Identity

Management

Messaging

Monitoring

Networking

Security

Storage

Web & App

Policies — Well-Architected

Reliability

Security

Cost Optimization

Operational Excellence

Performance Efficiency

Integration

Anti-Patterns
Standards

Application

IaC

Principles

Transforms

Clone this wiki locally