|
| 1 | +# Disaster Recovery Runbook |
| 2 | + |
| 3 | +Last Updated: 2026-04-01 |
| 4 | +Issue: `#86` OPS-08 backup/restore automation and disaster-recovery drill playbook |
| 5 | + |
| 6 | +--- |
| 7 | + |
| 8 | +## Overview |
| 9 | + |
| 10 | +Taskdeck is a local-first application backed by a single SQLite database file. All boards, |
| 11 | +cards, columns, audit records, and automation state live in that file. This runbook covers: |
| 12 | + |
| 13 | +- Backup automation (what the scripts do and when to run them) |
| 14 | +- Manual restore procedure (step-by-step) |
| 15 | +- RTO and RPO targets |
| 16 | +- DR drill schedule and evidence requirements |
| 17 | +- Access controls for backup artefacts |
| 18 | + |
| 19 | +--- |
| 20 | + |
| 21 | +## RTO and RPO Targets |
| 22 | + |
| 23 | +| Tier | Target | Notes | |
| 24 | +| --- | --- | --- | |
| 25 | +| RTO (local SQLite instance) | **< 30 minutes** | Time from decision-to-restore to API serving healthy requests | |
| 26 | +| RTO (Docker / hosted instance) | **< 60 minutes** | Includes container restart and volume reattachment | |
| 27 | +| RPO (default daily rotation) | **< 24 hours** | Maximum data loss under the default 7-backup daily schedule | |
| 28 | +| RPO (high-frequency rotation) | **< 1 hour** | Achievable by scheduling `backup.sh` hourly via cron | |
| 29 | + |
| 30 | +These are targets for a single-operator local-first deployment. Cloud/multi-user deployments |
| 31 | +should tighten RPO by increasing backup frequency and consider continuous WAL shipping if |
| 32 | +eventual consistency is insufficient. |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +## Backup Automation |
| 37 | + |
| 38 | +### Scripts |
| 39 | + |
| 40 | +| Script | Platform | Location | |
| 41 | +| --- | --- | --- | |
| 42 | +| `backup.sh` | Linux / macOS / WSL | `scripts/backup.sh` | |
| 43 | +| `backup.ps1` | Windows PowerShell | `scripts/backup.ps1` | |
| 44 | +| `restore.sh` | Linux / macOS / WSL | `scripts/restore.sh` | |
| 45 | +| `restore.ps1` | Windows PowerShell | `scripts/restore.ps1` | |
| 46 | + |
| 47 | +### How backups work |
| 48 | + |
| 49 | +`backup.sh` (and the PS1 equivalent) uses `sqlite3 .backup` — SQLite's online backup API. |
| 50 | +This acquires a shared lock, flushes any pending WAL (write-ahead log) frames, and copies |
| 51 | +pages to the destination. It is **safe while the API is running and writing**. The fallback |
| 52 | +(`cp`) is explicitly unsafe with active writers and should only be used in development. |
| 53 | + |
| 54 | +### Quick start |
| 55 | + |
| 56 | +```bash |
| 57 | +# Default paths (~/.taskdeck/taskdeck.db -> ~/.taskdeck/backups/) |
| 58 | +bash scripts/backup.sh |
| 59 | + |
| 60 | +# Explicit paths |
| 61 | +bash scripts/backup.sh \ |
| 62 | + --db-path /app/data/taskdeck.db \ |
| 63 | + --output-dir /backups/taskdeck |
| 64 | + |
| 65 | +# Keep 14 backups instead of the default 7 |
| 66 | +bash scripts/backup.sh --retain 14 |
| 67 | +``` |
| 68 | + |
| 69 | +PowerShell (Windows): |
| 70 | + |
| 71 | +```powershell |
| 72 | +.\scripts\backup.ps1 |
| 73 | +.\scripts\backup.ps1 -DbPath "C:\app\data\taskdeck.db" -OutputDir "D:\backups" -Retain 14 |
| 74 | +``` |
| 75 | + |
| 76 | +### Scheduling (cron / Task Scheduler) |
| 77 | + |
| 78 | +**Linux / macOS — daily at 02:00:** |
| 79 | + |
| 80 | +```cron |
| 81 | +0 2 * * * /path/to/repo/scripts/backup.sh \ |
| 82 | + --db-path /app/data/taskdeck.db \ |
| 83 | + --output-dir /backups/taskdeck \ |
| 84 | + >> /var/log/taskdeck-backup.log 2>&1 |
| 85 | +``` |
| 86 | + |
| 87 | +**Windows — Task Scheduler (run as the app-service account):** |
| 88 | + |
| 89 | +```powershell |
| 90 | +# Create a daily backup task |
| 91 | +$action = New-ScheduledTaskAction -Execute "pwsh.exe" ` |
| 92 | + -Argument "-NonInteractive -File C:\taskdeck\scripts\backup.ps1" |
| 93 | +$trigger = New-ScheduledTaskTrigger -Daily -At "02:00" |
| 94 | +Register-ScheduledTask -TaskName "Taskdeck-Daily-Backup" ` |
| 95 | + -Action $action -Trigger $trigger -RunLevel Highest |
| 96 | +``` |
| 97 | + |
| 98 | +### Docker volume backups |
| 99 | + |
| 100 | +The Docker Compose deployment mounts `taskdeck-db:/app/data`. To back up from the host: |
| 101 | + |
| 102 | +```bash |
| 103 | +# Option A: exec into the container and run the backup script |
| 104 | +docker compose -f deploy/docker-compose.yml --profile baseline exec api \ |
| 105 | + bash /repo/scripts/backup.sh \ |
| 106 | + --db-path /app/data/taskdeck.db \ |
| 107 | + --output-dir /app/data/backups |
| 108 | + |
| 109 | +# Option B: copy the volume contents to the host (requires API to be stopped or paused) |
| 110 | +docker compose -f deploy/docker-compose.yml --profile baseline stop api |
| 111 | +docker run --rm \ |
| 112 | + -v taskdeck_taskdeck-db:/data \ |
| 113 | + -v "$(pwd)/local-backups:/backup" \ |
| 114 | + alpine:3 \ |
| 115 | + sh -c "cp /data/taskdeck.db /backup/taskdeck-$(date +%Y%m%d-%H%M%S).db" |
| 116 | +docker compose -f deploy/docker-compose.yml --profile baseline start api |
| 117 | + |
| 118 | +# Option C: add a dedicated backup sidecar (extend docker-compose.yml): |
| 119 | +# |
| 120 | +# backup: |
| 121 | +# profiles: ["backup"] |
| 122 | +# image: alpine:3 |
| 123 | +# volumes: |
| 124 | +# - taskdeck-db:/data:ro |
| 125 | +# - ./backups:/backup |
| 126 | +# command: > |
| 127 | +# sh -c "cp /data/taskdeck.db /backup/taskdeck-$(date +%Y%m%d-%H%M%S).db |
| 128 | +# && echo 'Backup done.'" |
| 129 | +# |
| 130 | +# Run one-off: docker compose --profile backup run --rm backup |
| 131 | +``` |
| 132 | + |
| 133 | +--- |
| 134 | + |
| 135 | +## Restore Procedure |
| 136 | + |
| 137 | +Use this procedure whenever a database restore is required (corruption, accidental deletion, |
| 138 | +or rollback after a bad migration). |
| 139 | + |
| 140 | +### Pre-conditions |
| 141 | + |
| 142 | +- You have a known-good backup file (`taskdeck-backup-YYYY-MM-DD-HHmmss.db`). |
| 143 | +- The Taskdeck API is stopped (or you are willing to restart it after restore). |
| 144 | +- You have write access to the directory containing the live database. |
| 145 | + |
| 146 | +### Step 1 — Stop the API (recommended) |
| 147 | + |
| 148 | +Stopping the API avoids any writes racing with the restore. It is not strictly required |
| 149 | +(`restore.sh` uses `sqlite3 .restore` which acquires an exclusive lock), but stopping first |
| 150 | +eliminates all risk. |
| 151 | + |
| 152 | +```bash |
| 153 | +# Docker Compose deployment |
| 154 | +docker compose -f deploy/docker-compose.yml --profile baseline stop api |
| 155 | + |
| 156 | +# Local dotnet run — send SIGTERM / Ctrl+C |
| 157 | +# systemd |
| 158 | +sudo systemctl stop taskdeck-api |
| 159 | +``` |
| 160 | + |
| 161 | +### Step 2 — Choose the backup to restore |
| 162 | + |
| 163 | +```bash |
| 164 | +# List available backups, newest first |
| 165 | +ls -lt ~/.taskdeck/backups/taskdeck-backup-*.db |
| 166 | + |
| 167 | +# Or for Docker volume backups |
| 168 | +ls -lt ./local-backups/ |
| 169 | +``` |
| 170 | + |
| 171 | +Select the most recent backup before the incident, or a specific point-in-time backup if |
| 172 | +you know the target date. |
| 173 | + |
| 174 | +### Step 3 — Run the restore script |
| 175 | + |
| 176 | +```bash |
| 177 | +bash scripts/restore.sh \ |
| 178 | + --backup-file ~/.taskdeck/backups/taskdeck-backup-2026-04-01-120000.db |
| 179 | + |
| 180 | +# With explicit DB path (required for Docker or non-default paths) |
| 181 | +bash scripts/restore.sh \ |
| 182 | + --backup-file /backups/taskdeck/taskdeck-backup-2026-04-01-120000.db \ |
| 183 | + --db-path /app/data/taskdeck.db |
| 184 | + |
| 185 | +# Skip interactive confirmation (for automation) |
| 186 | +bash scripts/restore.sh \ |
| 187 | + --backup-file /backups/taskdeck-backup-2026-04-01-120000.db \ |
| 188 | + --yes |
| 189 | +``` |
| 190 | + |
| 191 | +PowerShell (Windows): |
| 192 | + |
| 193 | +```powershell |
| 194 | +.\scripts\restore.ps1 ` |
| 195 | + -BackupFile "$env:USERPROFILE\.taskdeck\backups\taskdeck-backup-2026-04-01-120000.db" |
| 196 | +
|
| 197 | +.\scripts\restore.ps1 ` |
| 198 | + -BackupFile "D:\backups\taskdeck-backup-2026-04-01-120000.db" ` |
| 199 | + -DbPath "C:\app\data\taskdeck.db" ` |
| 200 | + -Yes |
| 201 | +``` |
| 202 | + |
| 203 | +The script will: |
| 204 | +1. Verify the backup is a valid SQLite file (magic bytes + `PRAGMA integrity_check`). |
| 205 | +2. Check that the backup contains a `Boards` table (Taskdeck schema sanity check). |
| 206 | +3. Prompt for confirmation (skip with `--yes` / `-Yes`). |
| 207 | +4. Create a timestamped safety copy of the current live database. |
| 208 | +5. Restore the backup into the live path. |
| 209 | +6. Run a post-restore `PRAGMA integrity_check`. |
| 210 | + |
| 211 | +### Step 4 — Verify row counts |
| 212 | + |
| 213 | +After restore, spot-check that the data volume is plausible: |
| 214 | + |
| 215 | +```bash |
| 216 | +sqlite3 /path/to/taskdeck.db <<'SQL' |
| 217 | +SELECT 'Boards' AS tbl, COUNT(*) AS rows FROM Boards |
| 218 | +UNION ALL |
| 219 | +SELECT 'Columns', COUNT(*) FROM Columns |
| 220 | +UNION ALL |
| 221 | +SELECT 'Cards', COUNT(*) FROM Cards |
| 222 | +UNION ALL |
| 223 | +SELECT 'Users', COUNT(*) FROM Users; |
| 224 | +SQL |
| 225 | +``` |
| 226 | + |
| 227 | +Compare against your last known-good row counts (see evidence log if available). |
| 228 | + |
| 229 | +### Step 5 — Start the API and verify health |
| 230 | + |
| 231 | +```bash |
| 232 | +# Docker Compose deployment |
| 233 | +docker compose -f deploy/docker-compose.yml --profile baseline start api |
| 234 | + |
| 235 | +# Wait for health |
| 236 | +for i in $(seq 1 30); do |
| 237 | + STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:5000/health/ready 2>/dev/null || true) |
| 238 | + if [[ "$STATUS" == "200" ]]; then echo "API healthy."; break; fi |
| 239 | + echo "Waiting... ($i/30)" |
| 240 | + sleep 2 |
| 241 | +done |
| 242 | + |
| 243 | +# Detailed health response |
| 244 | +curl -s http://localhost:5000/health/ready | python3 -m json.tool |
| 245 | +``` |
| 246 | + |
| 247 | +### Step 6 — Record the restore in the evidence log |
| 248 | + |
| 249 | +File an evidence entry in `docs/ops/rehearsals/` using the template in |
| 250 | +`docs/ops/EVIDENCE_TEMPLATE.md`. Tag it with `restore-event` rather than `rehearsal` if |
| 251 | +this was a real recovery. |
| 252 | + |
| 253 | +--- |
| 254 | + |
| 255 | +## Backup Verification |
| 256 | + |
| 257 | +Run these checks after every backup to confirm it is usable for recovery. They can be |
| 258 | +automated in CI or a monitoring cron job. |
| 259 | + |
| 260 | +```bash |
| 261 | +BACKUP_FILE="/path/to/latest.db" |
| 262 | + |
| 263 | +# 1. Integrity check |
| 264 | +sqlite3 "$BACKUP_FILE" 'PRAGMA integrity_check;' |
| 265 | +# Expected: ok |
| 266 | + |
| 267 | +# 2. Page count / file size sanity |
| 268 | +sqlite3 "$BACKUP_FILE" 'PRAGMA page_count; PRAGMA page_size;' |
| 269 | +# Should match or exceed the previous backup |
| 270 | + |
| 271 | +# 3. Schema presence |
| 272 | +sqlite3 "$BACKUP_FILE" '.tables' |
| 273 | +# Should contain: Boards Columns Cards Users AuditLogs AutomationProposals ... |
| 274 | + |
| 275 | +# 4. Row count spot check |
| 276 | +sqlite3 "$BACKUP_FILE" 'SELECT COUNT(*) FROM Boards;' |
| 277 | +# Should be >= 0 (positive for non-empty deployments) |
| 278 | + |
| 279 | +# 5. Last write recency (check that the backup is not stale) |
| 280 | +sqlite3 "$BACKUP_FILE" " |
| 281 | + SELECT MAX(UpdatedAt) AS last_write |
| 282 | + FROM ( |
| 283 | + SELECT UpdatedAt FROM Boards |
| 284 | + UNION ALL SELECT UpdatedAt FROM Cards |
| 285 | + ); |
| 286 | +" |
| 287 | +``` |
| 288 | + |
| 289 | +--- |
| 290 | + |
| 291 | +## Access Controls |
| 292 | + |
| 293 | +| Artefact | Required permission | How enforced | |
| 294 | +| --- | --- | --- | |
| 295 | +| Backup directory (`~/.taskdeck/backups/`) | Owner read/write only | `chmod 700` (bash) / restricted ACL (PowerShell) | |
| 296 | +| Backup files (`taskdeck-backup-*.db`) | Owner read/write only | `chmod 600` (bash) / restricted ACL (PowerShell) | |
| 297 | +| Pre-restore safety copies | Owner read/write only | Same as backup files | |
| 298 | +| Live database (`taskdeck.db`) | Owner read/write only | Set after restore by restore scripts | |
| 299 | + |
| 300 | +On Linux/macOS: the scripts set `chmod 700` on the backup directory and `chmod 600` on each |
| 301 | +file. Verify with `ls -la ~/.taskdeck/backups/`. |
| 302 | + |
| 303 | +On Windows: the scripts apply a restricted ACL granting FullControl to the current user only |
| 304 | +and removing inherited permissions. Verify with `Get-Acl <path> | Format-List`. |
| 305 | + |
| 306 | +**For Docker deployments**: ensure the Docker volume is not world-readable. The named volume |
| 307 | +`taskdeck-db` is accessible only to containers with the volume mounted. Restrict host-level |
| 308 | +access to the volume directory if the host filesystem is shared. |
| 309 | + |
| 310 | +--- |
| 311 | + |
| 312 | +## DR Drill Schedule |
| 313 | + |
| 314 | +| Drill type | Cadence | Scope | Evidence required | |
| 315 | +| --- | --- | --- | --- | |
| 316 | +| Backup verification | Monthly (automated preferred) | Run `PRAGMA integrity_check` and row-count spot-check on the latest backup | Log entry in backup cron output | |
| 317 | +| Manual restore drill | Monthly | Full restore to a separate test directory; verify health | Evidence package in `docs/ops/rehearsals/` | |
| 318 | +| Full DR drill | Quarterly | Restore + API restart + user acceptance test | Evidence package + retrospective | |
| 319 | + |
| 320 | +Drill dates align with the cadence defined in `docs/ops/INCIDENT_REHEARSAL_CADENCE.md`. |
| 321 | +The backup-restore scenario should be added to the monthly rotation. |
| 322 | + |
| 323 | +--- |
| 324 | + |
| 325 | +## DR Drill Evidence Template |
| 326 | + |
| 327 | +For each manual restore drill, file an evidence package at: |
| 328 | + |
| 329 | +``` |
| 330 | +docs/ops/rehearsals/YYYY-MM-DD_backup-restore-drill.md |
| 331 | +``` |
| 332 | + |
| 333 | +Use this table as a minimum record: |
| 334 | + |
| 335 | +| Date | Operator | Backup Age | Backup File | Restore Duration | `integrity_check` | Row Count Match | Pass/Fail | Notes | |
| 336 | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | |
| 337 | +| 2026-04-01 | @operator | 3h | taskdeck-backup-2026-04-01-090000.db | 4m 12s | ok | yes | Pass | Docker volume restore | |
| 338 | +| YYYY-MM-DD | @username | Xh | taskdeck-backup-YYYY-MM-DD-HHmmss.db | Xm Xs | ok/fail | yes/no | Pass/Fail | | |
| 339 | + |
| 340 | +Attach or inline: |
| 341 | +- `PRAGMA integrity_check` output |
| 342 | +- Row count query results (before and after restore) |
| 343 | +- API `/health/ready` response after restart |
| 344 | +- Any deviations from expected state |
| 345 | + |
| 346 | +--- |
| 347 | + |
| 348 | +## Escalation Path |
| 349 | + |
| 350 | +| Condition | Action | |
| 351 | +| --- | --- | |
| 352 | +| `PRAGMA integrity_check` returns anything other than `ok` | Do NOT restore this backup. Try the next-oldest backup. File an issue tagged `P1`. | |
| 353 | +| Restore script fails with permission error | Check file ownership, ACLs, and whether the API process holds an exclusive lock. | |
| 354 | +| All available backups fail integrity check | Escalate to the project owner immediately. Check the live database — it may still be intact. | |
| 355 | +| Post-restore API health check returns non-200 | Inspect `/health/ready` response for which subsystem failed. Check for EF migration drift between backup schema and current binary. | |
| 356 | +| Data loss confirmed after restore | File a P1 incident issue. Document the RPO gap in the evidence package. Increase backup frequency. | |
| 357 | + |
| 358 | +For this project, escalation means: create a GitHub issue with label `incident` and |
| 359 | +`data-loss` (or `data-risk`) and assign it to `@Chris0Jeky`. |
| 360 | + |
| 361 | +--- |
| 362 | + |
| 363 | +## Related Documents |
| 364 | + |
| 365 | +- `scripts/backup.sh` / `scripts/backup.ps1` — backup automation |
| 366 | +- `scripts/restore.sh` / `scripts/restore.ps1` — restore automation |
| 367 | +- `docs/ops/EVIDENCE_TEMPLATE.md` — evidence package format |
| 368 | +- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` — rehearsal schedule |
| 369 | +- `docs/ops/FAILURE_INJECTION_DRILLS.md` — automated failure-injection drills |
| 370 | +- `docs/ops/REHEARSAL_BACKOFF_RULES.md` — issue filing rules for drill findings |
| 371 | +- `docs/ops/rehearsal-scenarios/` — scenario library |
0 commit comments