DR-promotion runbook (Sub-10c)

Trigger: prod host unreachable / data centre down / hardware failure / ops decision to fail over.

RTO target: ~hours (host bring-up + restore time + DNS propagation). RPO target: ≤15 minutes (log-backup interval per ADR-0056).

The DR host must be pre-provisioned per infra/dns-tls/dr-provisioning-checklist.md. If it isn't, you can't fail over fast — fix that as a prerequisite, not during a disaster.

Step 1: Decision

Operator + ops lead confirm DR promotion is the right call. Considerations:

Is prod recoverable in less time than DR promotion? (Hardware fail with spare on hand: usually yes. Data-centre outage: usually no.)
Has prod been ruled out as recoverable in the next N minutes?
Is there active in-flight work (a deploy, a destructive migration) that would be lost?

Record the decision in the incident log with timestamp + signatories.

Step 2: Fetch latest backup chain from off-host store

From the DR host:

$bkRoot = '\\${BACKUP_UNC_HOST}\${BACKUP_UNC_SHARE}\prod'
$drDest = 'D:\CCEBackups\restored'
robocopy $bkRoot $drDest /E /Z /R:3 /W:10 /LOG+:C:\ProgramData\CCE\logs\dr-fetch-$(Get-Date -Format 'yyyyMMddTHHmmssZ').log

Verify:

Latest FULL backup is recent (within 24h).
DIFF chain is contiguous from the FULL.
LOG chain is contiguous from the latest FULL or DIFF.

Step 3: Restore DB

$full = (Get-ChildItem D:\CCEBackups\restored\FULL\*.bak | Sort LastWriteTime | Select -Last 1).FullName
$diff = (Get-ChildItem D:\CCEBackups\restored\DIFF\*.bak | Sort LastWriteTime | Select -Last 1).FullName
$logs = Get-ChildItem D:\CCEBackups\restored\LOG\*.trn |
        Where-Object LastWriteTime -gt (Get-Item $full).LastWriteTime |
        Sort-Object LastWriteTime | ForEach-Object FullName

.\infra\backup\Restore-FromBackup.ps1 `
    -FullBackup $full `
    -DiffBackup $diff `
    -LogBackups $logs `
    -TargetDb CCE -Force `
    -Environment dr

Verify migration history matches what the to-be-deployed image expects:

sqlcmd -S localhost -d CCE -Q "SELECT MigrationId FROM __EFMigrationsHistory ORDER BY MigrationId"

Step 4: Deploy apps on DR host

.\deploy\deploy.ps1 -Environment dr

Uses .env.dr, deploys at the same CCE_IMAGE_TAG prod was running. Apps come up against the just-restored DB.

Step 5: Verify DR host green internally

From the DR host itself:

.\deploy\smoke.ps1 -Environment dr -Timeout 90

Probes the 4 DR hostnames (cce-ext-dr, etc.) over HTTPS. Expected: All 4 probes PASSED.

Step 6: DNS swap

Repoint the production hostnames at the DR host's IP. This is the moment where production traffic flips.

Hostname	From	To
`CCE-ext`	prod IP	DR IP
`CCE-admin-Panel`	prod IP	DR IP
`api.CCE`	prod IP	DR IP
`Api.CCE-admin-Panel`	prod IP	DR IP

Operator-managed DNS; per IDD environment. Low DNS TTL during steady-state means propagation is fast (~60s).

If a load-balancer fronts the host instead of direct DNS → reconfigure the LB's pool member from prod IP to DR IP.

Step 7: End-to-end validation

From a real client outside the DR host:

Resolve-DnsName CCE-ext
# Expected: A record points at DR host IP.

Test-NetConnection -ComputerName CCE-ext -Port 443
# Expected: TcpTestSucceeded = True.

Invoke-WebRequest https://CCE-ext/ -UseBasicParsing | Select-Object StatusCode
# Expected: 200.

# Smoke probe from the client subnet.
.\deploy\smoke.ps1 -Environment dr  # but using the prod hostnames now

Note: the smoke.ps1 -Environment dr step probes the dr hostnames per .env.dr. After Step 6, the prod hostnames also resolve to the DR IP, so additionally probe:

Invoke-WebRequest https://CCE-ext/ -UseBasicParsing
Invoke-WebRequest https://CCE-admin-Panel/ -UseBasicParsing
Invoke-WebRequest https://api.CCE/health -UseBasicParsing
Invoke-WebRequest https://Api.CCE-admin-Panel/health -UseBasicParsing

Step 8: Communicate

Status page update (degraded → operating from DR).
Customer notification per ops policy.
Internal incident channel: confirm DR promotion complete; estimate prod-recovery time.

After promotion: activate DR backups

DR is now production. Its backups need to start flowing.

# Register backup tasks on the (now-active) DR host.
.\infra\backup\Register-ScheduledTasks.ps1 -Environment dr `
    -ServiceAccount cce.local\cce-sqlbackup-svc

# Verify.
Get-ScheduledTask | Where-Object TaskName -like 'CCE-Backup-*' | Format-Table TaskName, State

The DR host's backups go to the same UNC share but under \\<host>\<share>\dr\. Prod's old backups remain at \\<host>\<share>\prod\ for reference.

Failback (DR → prod)

When prod is recoverable:

Verify prod host healthy. Re-image, re-provision, get the host back to a clean state.

Pull DR's latest backup chain to prod.

robocopy "\\${BACKUP_UNC_HOST}\${BACKUP_UNC_SHARE}\dr" "D:\CCEBackups\restored" /E /Z

Restore on prod. Same procedure as Step 3, but -Environment prod -TargetDb CCE -Force.
Deploy on prod: .\deploy\deploy.ps1 -Environment prod.
DNS swap back (Step 6 in reverse).

Deactivate DR backup tasks + reactivate prod's:

# On DR host:
foreach ($n in 'CCE-Backup-Full','CCE-Backup-Diff','CCE-Backup-Log','CCE-Backup-IntegrityCheck','CCE-Backup-Sync-OffHost') {
    Disable-ScheduledTask -TaskName $n
}
# On prod host:
foreach ($n in 'CCE-Backup-Full','CCE-Backup-Diff','CCE-Backup-Log','CCE-Backup-IntegrityCheck','CCE-Backup-Sync-OffHost') {
    Enable-ScheduledTask -TaskName $n
}

Communicate: incident closed; back to prod.

Common failures during promotion

Symptom	Cause	Fix
`robocopy` from UNC fails with auth error	DR host's `cmdkey` cache for the UNC not set	Re-run `cmdkey /add:...` as the deploy user
`Restore-FromBackup.ps1` fails with "log chain broken"	One log backup is missing or corrupt	Restore just FULL + DIFF; accept the RPO loss; document
`deploy.ps1` fails at image pull	DR host can't reach ghcr.io	Check `docker login`; verify outbound 443 firewall to ghcr.io
Smoke probes pass internally but external clients can't reach DR	DNS not yet propagated	Wait for TTL; use external DNS-resolution checker
Auth fails after DNS swap	Keycloak realm `cce-dr` issuer doesn't match the redirect URL	Verify `KEYCLOAK_AUTHORITY` in `.env.dr` uses `api.cce-dr` not `api.CCE`; or per IDD, accept that prod hostnames now resolve to DR + Keycloak realm config matches

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DR-promotion runbook (Sub-10c)

Step 1: Decision

Step 2: Fetch latest backup chain from off-host store

Step 3: Restore DB

Step 4: Deploy apps on DR host

Step 5: Verify DR host green internally

Step 6: DNS swap

Step 7: End-to-end validation

Step 8: Communicate

After promotion: activate DR backups

Failback (DR → prod)

Common failures during promotion

See also

FilesExpand file tree

dr-promotion.md

Latest commit

History

dr-promotion.md

File metadata and controls

DR-promotion runbook (Sub-10c)

Step 1: Decision

Step 2: Fetch latest backup chain from off-host store

Step 3: Restore DB

Step 4: Deploy apps on DR host

Step 5: Verify DR host green internally

Step 6: DNS swap

Step 7: End-to-end validation

Step 8: Communicate

After promotion: activate DR backups

Failback (DR → prod)

Common failures during promotion

See also