Trigger: prod host unreachable / data centre down / hardware failure / ops decision to fail over.
RTO target: ~hours (host bring-up + restore time + DNS propagation). RPO target: ≤15 minutes (log-backup interval per ADR-0056).
The DR host must be pre-provisioned per infra/dns-tls/dr-provisioning-checklist.md. If it isn't, you can't fail over fast — fix that as a prerequisite, not during a disaster.
Operator + ops lead confirm DR promotion is the right call. Considerations:
- Is prod recoverable in less time than DR promotion? (Hardware fail with spare on hand: usually yes. Data-centre outage: usually no.)
- Has prod been ruled out as recoverable in the next N minutes?
- Is there active in-flight work (a deploy, a destructive migration) that would be lost?
Record the decision in the incident log with timestamp + signatories.
From the DR host:
$bkRoot = '\\${BACKUP_UNC_HOST}\${BACKUP_UNC_SHARE}\prod'
$drDest = 'D:\CCEBackups\restored'
robocopy $bkRoot $drDest /E /Z /R:3 /W:10 /LOG+:C:\ProgramData\CCE\logs\dr-fetch-$(Get-Date -Format 'yyyyMMddTHHmmssZ').log
Verify:
- Latest FULL backup is recent (within 24h).
- DIFF chain is contiguous from the FULL.
- LOG chain is contiguous from the latest FULL or DIFF.
$full = (Get-ChildItem D:\CCEBackups\restored\FULL\*.bak | Sort LastWriteTime | Select -Last 1).FullName
$diff = (Get-ChildItem D:\CCEBackups\restored\DIFF\*.bak | Sort LastWriteTime | Select -Last 1).FullName
$logs = Get-ChildItem D:\CCEBackups\restored\LOG\*.trn |
Where-Object LastWriteTime -gt (Get-Item $full).LastWriteTime |
Sort-Object LastWriteTime | ForEach-Object FullName
.\infra\backup\Restore-FromBackup.ps1 `
-FullBackup $full `
-DiffBackup $diff `
-LogBackups $logs `
-TargetDb CCE -Force `
-Environment dr
Verify migration history matches what the to-be-deployed image expects:
sqlcmd -S localhost -d CCE -Q "SELECT MigrationId FROM __EFMigrationsHistory ORDER BY MigrationId"
.\deploy\deploy.ps1 -Environment dr
Uses .env.dr, deploys at the same CCE_IMAGE_TAG prod was running. Apps come up against the just-restored DB.
From the DR host itself:
.\deploy\smoke.ps1 -Environment dr -Timeout 90
Probes the 4 DR hostnames (cce-ext-dr, etc.) over HTTPS. Expected: All 4 probes PASSED.
Repoint the production hostnames at the DR host's IP. This is the moment where production traffic flips.
| Hostname | From | To |
|---|---|---|
CCE-ext |
prod IP | DR IP |
CCE-admin-Panel |
prod IP | DR IP |
api.CCE |
prod IP | DR IP |
Api.CCE-admin-Panel |
prod IP | DR IP |
Operator-managed DNS; per IDD environment. Low DNS TTL during steady-state means propagation is fast (~60s).
If a load-balancer fronts the host instead of direct DNS → reconfigure the LB's pool member from prod IP to DR IP.
From a real client outside the DR host:
Resolve-DnsName CCE-ext
# Expected: A record points at DR host IP.
Test-NetConnection -ComputerName CCE-ext -Port 443
# Expected: TcpTestSucceeded = True.
Invoke-WebRequest https://CCE-ext/ -UseBasicParsing | Select-Object StatusCode
# Expected: 200.
# Smoke probe from the client subnet.
.\deploy\smoke.ps1 -Environment dr # but using the prod hostnames now
Note: the smoke.ps1 -Environment dr step probes the dr hostnames per .env.dr. After Step 6, the prod hostnames also resolve to the DR IP, so additionally probe:
Invoke-WebRequest https://CCE-ext/ -UseBasicParsing
Invoke-WebRequest https://CCE-admin-Panel/ -UseBasicParsing
Invoke-WebRequest https://api.CCE/health -UseBasicParsing
Invoke-WebRequest https://Api.CCE-admin-Panel/health -UseBasicParsing
- Status page update (degraded → operating from DR).
- Customer notification per ops policy.
- Internal incident channel: confirm DR promotion complete; estimate prod-recovery time.
DR is now production. Its backups need to start flowing.
# Register backup tasks on the (now-active) DR host.
.\infra\backup\Register-ScheduledTasks.ps1 -Environment dr `
-ServiceAccount cce.local\cce-sqlbackup-svc
# Verify.
Get-ScheduledTask | Where-Object TaskName -like 'CCE-Backup-*' | Format-Table TaskName, State
The DR host's backups go to the same UNC share but under \\<host>\<share>\dr\. Prod's old backups remain at \\<host>\<share>\prod\ for reference.
When prod is recoverable:
- Verify prod host healthy. Re-image, re-provision, get the host back to a clean state.
- Pull DR's latest backup chain to prod.
robocopy "\\${BACKUP_UNC_HOST}\${BACKUP_UNC_SHARE}\dr" "D:\CCEBackups\restored" /E /Z - Restore on prod. Same procedure as Step 3, but
-Environment prod -TargetDb CCE -Force. - Deploy on prod:
.\deploy\deploy.ps1 -Environment prod. - DNS swap back (Step 6 in reverse).
- Deactivate DR backup tasks + reactivate prod's:
# On DR host: foreach ($n in 'CCE-Backup-Full','CCE-Backup-Diff','CCE-Backup-Log','CCE-Backup-IntegrityCheck','CCE-Backup-Sync-OffHost') { Disable-ScheduledTask -TaskName $n } # On prod host: foreach ($n in 'CCE-Backup-Full','CCE-Backup-Diff','CCE-Backup-Log','CCE-Backup-IntegrityCheck','CCE-Backup-Sync-OffHost') { Enable-ScheduledTask -TaskName $n } - Communicate: incident closed; back to prod.
| Symptom | Cause | Fix |
|---|---|---|
robocopy from UNC fails with auth error |
DR host's cmdkey cache for the UNC not set |
Re-run cmdkey /add:... as the deploy user |
Restore-FromBackup.ps1 fails with "log chain broken" |
One log backup is missing or corrupt | Restore just FULL + DIFF; accept the RPO loss; document |
deploy.ps1 fails at image pull |
DR host can't reach ghcr.io | Check docker login; verify outbound 443 firewall to ghcr.io |
| Smoke probes pass internally but external clients can't reach DR | DNS not yet propagated | Wait for TTL; use external DNS-resolution checker |
| Auth fails after DNS swap | Keycloak realm cce-dr issuer doesn't match the redirect URL |
Verify KEYCLOAK_AUTHORITY in .env.dr uses api.cce-dr not api.CCE; or per IDD, accept that prod hostnames now resolve to DR + Keycloak realm config matches |
- DR provisioning checklist — pre-disaster setup.
- Backup-restore runbook — restore procedure detail.
secret-rotation.md— for any secret rotation triggered by the incident.- Sub-10c design spec §DR