CCE backups run via Ola Hallengren's maintenance solution + 5 Windows Task Scheduler tasks. ADR-0056 documents the design.
.\infra\backup\Test-BackupChain.ps1 -Environment <env>Expected: Backup chain HEALTHY. Reports failures, last-full-success time, log-backup count over 24h.
-
Install Ola Hallengren maintenance solution.
.\infra\backup\Install-OlaHallengren.ps1 -Environment <env>
First run: prints the downloaded SHA256 + asks operator to record it in
MaintenanceSolution.checksum+ re-run. -
Provision the backup-account. Create a Windows account
cce.local\cce-sqlbackup-svc(or whatever AD admins prefer) with SQL Serversysadmin(forBACKUP DATABASE) + filesystem write toD:\CCEBackups\+ UNC write to the destination share. -
Cache the UNC credential. From the deploy host as the backup-account:
cmdkey /add:${BACKUP_UNC_HOST} /user:${BACKUP_UNC_USER} /pass:${BACKUP_UNC_PASSWORD}
-
Register scheduled tasks.
.\infra\backup\Register-ScheduledTasks.ps1 -Environment <env> ` -ServiceAccount cce.local\cce-sqlbackup-svc
-
Verify.
Get-ScheduledTask | Where-Object TaskName -like 'CCE-Backup-*' | Format-Table TaskName, State, LastRunTime, NextRunTime
Recommended every quarter on a non-prod host:
# Pick the latest backup chain.
$full = Get-ChildItem D:\CCEBackups\FULL\*.bak | Sort-Object LastWriteTime -Descending | Select-Object -First 1
$diff = Get-ChildItem D:\CCEBackups\DIFF\*.bak | Sort-Object LastWriteTime -Descending | Select-Object -First 1
$logs = Get-ChildItem D:\CCEBackups\LOG\*.trn | Where-Object LastWriteTime -gt $full.LastWriteTime |
Sort-Object LastWriteTime | ForEach-Object FullName
# Restore to a test DB.
.\infra\backup\Restore-FromBackup.ps1 `
-FullBackup $full.FullName `
-DiffBackup $diff.FullName `
-LogBackups $logs `
-TargetDb CCE_restoretest
# Verify row counts vs the live CCE DB.
sqlcmd -S <server> -d CCE_restoretest -Q "SELECT COUNT(*) FROM <key-table>"
sqlcmd -S <server> -d CCE -Q "SELECT COUNT(*) FROM <key-table>"
# Cleanup.
sqlcmd -S <server> -Q "DROP DATABASE CCE_restoretest"Record the result in the ops runbook log.
After a destructive incident (data corruption, accidental delete) on the live DB:
-
Stop apps to prevent further writes:
docker compose -f docker-compose.prod.yml down
-
Identify the last good backup point. Use
Test-BackupChain.ps1to find the last full + last diff before the incident, plus all logs up to (but not past) the incident time. -
Run restore with
-Force:.\infra\backup\Restore-FromBackup.ps1 ` -FullBackup <path> -DiffBackup <path> -LogBackups <list> ` -TargetDb CCE -Force
-
Verify migration history matches what the running image expects:
sqlcmd -S <server> -d CCE -Q "SELECT MigrationId FROM __EFMigrationsHistory ORDER BY MigrationId"
-
Restart apps:
.\deploy\deploy.ps1 -Environment <env>
After DR promotion (see dr-promotion.md):
-
From the DR host, fetch the latest backup chain from the off-host UNC store:
robocopy "\\${BACKUP_UNC_HOST}\${BACKUP_UNC_SHARE}\prod" "D:\CCEBackups\restored" /E /Z /R:3 /W:10
-
Run restore with
-Forceagainst the DR host's SQL Server:.\infra\backup\Restore-FromBackup.ps1 ` -FullBackup "D:\CCEBackups\restored\FULL\<latest>.bak" ` -DiffBackup "D:\CCEBackups\restored\DIFF\<latest>.bak" ` -LogBackups (Get-ChildItem "D:\CCEBackups\restored\LOG\*.trn" | Sort Name).FullName ` -TargetDb CCE -Force -Environment dr
-
Continue with the deploy step in
dr-promotion.md.
| Symptom | Cause | Fix |
|---|---|---|
RESTORE failed: file 'X' is being used by another process |
Live CCE DB still in use | Stop apps via docker compose down before restore |
RESTORE LOG fails: cannot find a backup that includes time T |
Log chain has a gap (one log backup skipped) | Use the latest contiguous chain; restore just FULL + DIFF without the broken-chain LOG |
cmdkey credentials missing (robocopy auth fails) |
UNC credential not cached on host | Re-run cmdkey /add:... as the backup-account user |
Backup chain healthcheck reports 0 FULL successes in 24h |
Daily 02:00 task didn't run | Check Get-ScheduledTaskInfo CCE-Backup-Full; investigate task history |
DBCC CHECKDB reports allocation errors |
Possible disk corruption | STOP. File an incident; restore from latest known-good backup |
- ADR-0056 — Backup strategy
migrations.md— forward-only migration discipline (relevant for restore-vs-migration-history checks)dr-promotion.md— DR promotion procedure (Phase 05)- Ola Hallengren's docs
- Sub-10c design spec §Backup