Skip to content

Commit 4b6467a

Browse files
authored
Merge pull request #663 from Chris0Jeky/ops/backup-restore-dr-playbook
OPS-08: Backup/restore automation and DR drill playbook
2 parents 668b736 + 2bfc7d4 commit 4b6467a

7 files changed

Lines changed: 1484 additions & 0 deletions

File tree

Lines changed: 371 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,371 @@
1+
# Disaster Recovery Runbook
2+
3+
Last Updated: 2026-04-01
4+
Issue: `#86` OPS-08 backup/restore automation and disaster-recovery drill playbook
5+
6+
---
7+
8+
## Overview
9+
10+
Taskdeck is a local-first application backed by a single SQLite database file. All boards,
11+
cards, columns, audit records, and automation state live in that file. This runbook covers:
12+
13+
- Backup automation (what the scripts do and when to run them)
14+
- Manual restore procedure (step-by-step)
15+
- RTO and RPO targets
16+
- DR drill schedule and evidence requirements
17+
- Access controls for backup artefacts
18+
19+
---
20+
21+
## RTO and RPO Targets
22+
23+
| Tier | Target | Notes |
24+
| --- | --- | --- |
25+
| RTO (local SQLite instance) | **< 30 minutes** | Time from decision-to-restore to API serving healthy requests |
26+
| RTO (Docker / hosted instance) | **< 60 minutes** | Includes container restart and volume reattachment |
27+
| RPO (default daily rotation) | **< 24 hours** | Maximum data loss under the default 7-backup daily schedule |
28+
| RPO (high-frequency rotation) | **< 1 hour** | Achievable by scheduling `backup.sh` hourly via cron |
29+
30+
These are targets for a single-operator local-first deployment. Cloud/multi-user deployments
31+
should tighten RPO by increasing backup frequency and consider continuous WAL shipping if
32+
eventual consistency is insufficient.
33+
34+
---
35+
36+
## Backup Automation
37+
38+
### Scripts
39+
40+
| Script | Platform | Location |
41+
| --- | --- | --- |
42+
| `backup.sh` | Linux / macOS / WSL | `scripts/backup.sh` |
43+
| `backup.ps1` | Windows PowerShell | `scripts/backup.ps1` |
44+
| `restore.sh` | Linux / macOS / WSL | `scripts/restore.sh` |
45+
| `restore.ps1` | Windows PowerShell | `scripts/restore.ps1` |
46+
47+
### How backups work
48+
49+
`backup.sh` (and the PS1 equivalent) uses `sqlite3 .backup` — SQLite's online backup API.
50+
This acquires a shared lock, flushes any pending WAL (write-ahead log) frames, and copies
51+
pages to the destination. It is **safe while the API is running and writing**. The fallback
52+
(`cp`) is explicitly unsafe with active writers and should only be used in development.
53+
54+
### Quick start
55+
56+
```bash
57+
# Default paths (~/.taskdeck/taskdeck.db -> ~/.taskdeck/backups/)
58+
bash scripts/backup.sh
59+
60+
# Explicit paths
61+
bash scripts/backup.sh \
62+
--db-path /app/data/taskdeck.db \
63+
--output-dir /backups/taskdeck
64+
65+
# Keep 14 backups instead of the default 7
66+
bash scripts/backup.sh --retain 14
67+
```
68+
69+
PowerShell (Windows):
70+
71+
```powershell
72+
.\scripts\backup.ps1
73+
.\scripts\backup.ps1 -DbPath "C:\app\data\taskdeck.db" -OutputDir "D:\backups" -Retain 14
74+
```
75+
76+
### Scheduling (cron / Task Scheduler)
77+
78+
**Linux / macOS — daily at 02:00:**
79+
80+
```cron
81+
0 2 * * * /path/to/repo/scripts/backup.sh \
82+
--db-path /app/data/taskdeck.db \
83+
--output-dir /backups/taskdeck \
84+
>> /var/log/taskdeck-backup.log 2>&1
85+
```
86+
87+
**Windows — Task Scheduler (run as the app-service account):**
88+
89+
```powershell
90+
# Create a daily backup task
91+
$action = New-ScheduledTaskAction -Execute "pwsh.exe" `
92+
-Argument "-NonInteractive -File C:\taskdeck\scripts\backup.ps1"
93+
$trigger = New-ScheduledTaskTrigger -Daily -At "02:00"
94+
Register-ScheduledTask -TaskName "Taskdeck-Daily-Backup" `
95+
-Action $action -Trigger $trigger -RunLevel Highest
96+
```
97+
98+
### Docker volume backups
99+
100+
The Docker Compose deployment mounts `taskdeck-db:/app/data`. To back up from the host:
101+
102+
```bash
103+
# Option A: exec into the container and run the backup script
104+
docker compose -f deploy/docker-compose.yml --profile baseline exec api \
105+
bash /repo/scripts/backup.sh \
106+
--db-path /app/data/taskdeck.db \
107+
--output-dir /app/data/backups
108+
109+
# Option B: copy the volume contents to the host (requires API to be stopped or paused)
110+
docker compose -f deploy/docker-compose.yml --profile baseline stop api
111+
docker run --rm \
112+
-v taskdeck_taskdeck-db:/data \
113+
-v "$(pwd)/local-backups:/backup" \
114+
alpine:3 \
115+
sh -c "cp /data/taskdeck.db /backup/taskdeck-$(date +%Y%m%d-%H%M%S).db"
116+
docker compose -f deploy/docker-compose.yml --profile baseline start api
117+
118+
# Option C: add a dedicated backup sidecar (extend docker-compose.yml):
119+
#
120+
# backup:
121+
# profiles: ["backup"]
122+
# image: alpine:3
123+
# volumes:
124+
# - taskdeck-db:/data:ro
125+
# - ./backups:/backup
126+
# command: >
127+
# sh -c "cp /data/taskdeck.db /backup/taskdeck-$(date +%Y%m%d-%H%M%S).db
128+
# && echo 'Backup done.'"
129+
#
130+
# Run one-off: docker compose --profile backup run --rm backup
131+
```
132+
133+
---
134+
135+
## Restore Procedure
136+
137+
Use this procedure whenever a database restore is required (corruption, accidental deletion,
138+
or rollback after a bad migration).
139+
140+
### Pre-conditions
141+
142+
- You have a known-good backup file (`taskdeck-backup-YYYY-MM-DD-HHmmss.db`).
143+
- The Taskdeck API is stopped (or you are willing to restart it after restore).
144+
- You have write access to the directory containing the live database.
145+
146+
### Step 1 — Stop the API (recommended)
147+
148+
Stopping the API avoids any writes racing with the restore. It is not strictly required
149+
(`restore.sh` uses `sqlite3 .restore` which acquires an exclusive lock), but stopping first
150+
eliminates all risk.
151+
152+
```bash
153+
# Docker Compose deployment
154+
docker compose -f deploy/docker-compose.yml --profile baseline stop api
155+
156+
# Local dotnet run — send SIGTERM / Ctrl+C
157+
# systemd
158+
sudo systemctl stop taskdeck-api
159+
```
160+
161+
### Step 2 — Choose the backup to restore
162+
163+
```bash
164+
# List available backups, newest first
165+
ls -lt ~/.taskdeck/backups/taskdeck-backup-*.db
166+
167+
# Or for Docker volume backups
168+
ls -lt ./local-backups/
169+
```
170+
171+
Select the most recent backup before the incident, or a specific point-in-time backup if
172+
you know the target date.
173+
174+
### Step 3 — Run the restore script
175+
176+
```bash
177+
bash scripts/restore.sh \
178+
--backup-file ~/.taskdeck/backups/taskdeck-backup-2026-04-01-120000.db
179+
180+
# With explicit DB path (required for Docker or non-default paths)
181+
bash scripts/restore.sh \
182+
--backup-file /backups/taskdeck/taskdeck-backup-2026-04-01-120000.db \
183+
--db-path /app/data/taskdeck.db
184+
185+
# Skip interactive confirmation (for automation)
186+
bash scripts/restore.sh \
187+
--backup-file /backups/taskdeck-backup-2026-04-01-120000.db \
188+
--yes
189+
```
190+
191+
PowerShell (Windows):
192+
193+
```powershell
194+
.\scripts\restore.ps1 `
195+
-BackupFile "$env:USERPROFILE\.taskdeck\backups\taskdeck-backup-2026-04-01-120000.db"
196+
197+
.\scripts\restore.ps1 `
198+
-BackupFile "D:\backups\taskdeck-backup-2026-04-01-120000.db" `
199+
-DbPath "C:\app\data\taskdeck.db" `
200+
-Yes
201+
```
202+
203+
The script will:
204+
1. Verify the backup is a valid SQLite file (magic bytes + `PRAGMA integrity_check`).
205+
2. Check that the backup contains a `Boards` table (Taskdeck schema sanity check).
206+
3. Prompt for confirmation (skip with `--yes` / `-Yes`).
207+
4. Create a timestamped safety copy of the current live database.
208+
5. Restore the backup into the live path.
209+
6. Run a post-restore `PRAGMA integrity_check`.
210+
211+
### Step 4 — Verify row counts
212+
213+
After restore, spot-check that the data volume is plausible:
214+
215+
```bash
216+
sqlite3 /path/to/taskdeck.db <<'SQL'
217+
SELECT 'Boards' AS tbl, COUNT(*) AS rows FROM Boards
218+
UNION ALL
219+
SELECT 'Columns', COUNT(*) FROM Columns
220+
UNION ALL
221+
SELECT 'Cards', COUNT(*) FROM Cards
222+
UNION ALL
223+
SELECT 'Users', COUNT(*) FROM Users;
224+
SQL
225+
```
226+
227+
Compare against your last known-good row counts (see evidence log if available).
228+
229+
### Step 5 — Start the API and verify health
230+
231+
```bash
232+
# Docker Compose deployment
233+
docker compose -f deploy/docker-compose.yml --profile baseline start api
234+
235+
# Wait for health
236+
for i in $(seq 1 30); do
237+
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:5000/health/ready 2>/dev/null || true)
238+
if [[ "$STATUS" == "200" ]]; then echo "API healthy."; break; fi
239+
echo "Waiting... ($i/30)"
240+
sleep 2
241+
done
242+
243+
# Detailed health response
244+
curl -s http://localhost:5000/health/ready | python3 -m json.tool
245+
```
246+
247+
### Step 6 — Record the restore in the evidence log
248+
249+
File an evidence entry in `docs/ops/rehearsals/` using the template in
250+
`docs/ops/EVIDENCE_TEMPLATE.md`. Tag it with `restore-event` rather than `rehearsal` if
251+
this was a real recovery.
252+
253+
---
254+
255+
## Backup Verification
256+
257+
Run these checks after every backup to confirm it is usable for recovery. They can be
258+
automated in CI or a monitoring cron job.
259+
260+
```bash
261+
BACKUP_FILE="/path/to/latest.db"
262+
263+
# 1. Integrity check
264+
sqlite3 "$BACKUP_FILE" 'PRAGMA integrity_check;'
265+
# Expected: ok
266+
267+
# 2. Page count / file size sanity
268+
sqlite3 "$BACKUP_FILE" 'PRAGMA page_count; PRAGMA page_size;'
269+
# Should match or exceed the previous backup
270+
271+
# 3. Schema presence
272+
sqlite3 "$BACKUP_FILE" '.tables'
273+
# Should contain: Boards Columns Cards Users AuditLogs AutomationProposals ...
274+
275+
# 4. Row count spot check
276+
sqlite3 "$BACKUP_FILE" 'SELECT COUNT(*) FROM Boards;'
277+
# Should be >= 0 (positive for non-empty deployments)
278+
279+
# 5. Last write recency (check that the backup is not stale)
280+
sqlite3 "$BACKUP_FILE" "
281+
SELECT MAX(UpdatedAt) AS last_write
282+
FROM (
283+
SELECT UpdatedAt FROM Boards
284+
UNION ALL SELECT UpdatedAt FROM Cards
285+
);
286+
"
287+
```
288+
289+
---
290+
291+
## Access Controls
292+
293+
| Artefact | Required permission | How enforced |
294+
| --- | --- | --- |
295+
| Backup directory (`~/.taskdeck/backups/`) | Owner read/write only | `chmod 700` (bash) / restricted ACL (PowerShell) |
296+
| Backup files (`taskdeck-backup-*.db`) | Owner read/write only | `chmod 600` (bash) / restricted ACL (PowerShell) |
297+
| Pre-restore safety copies | Owner read/write only | Same as backup files |
298+
| Live database (`taskdeck.db`) | Owner read/write only | Set after restore by restore scripts |
299+
300+
On Linux/macOS: the scripts set `chmod 700` on the backup directory and `chmod 600` on each
301+
file. Verify with `ls -la ~/.taskdeck/backups/`.
302+
303+
On Windows: the scripts apply a restricted ACL granting FullControl to the current user only
304+
and removing inherited permissions. Verify with `Get-Acl <path> | Format-List`.
305+
306+
**For Docker deployments**: ensure the Docker volume is not world-readable. The named volume
307+
`taskdeck-db` is accessible only to containers with the volume mounted. Restrict host-level
308+
access to the volume directory if the host filesystem is shared.
309+
310+
---
311+
312+
## DR Drill Schedule
313+
314+
| Drill type | Cadence | Scope | Evidence required |
315+
| --- | --- | --- | --- |
316+
| Backup verification | Monthly (automated preferred) | Run `PRAGMA integrity_check` and row-count spot-check on the latest backup | Log entry in backup cron output |
317+
| Manual restore drill | Monthly | Full restore to a separate test directory; verify health | Evidence package in `docs/ops/rehearsals/` |
318+
| Full DR drill | Quarterly | Restore + API restart + user acceptance test | Evidence package + retrospective |
319+
320+
Drill dates align with the cadence defined in `docs/ops/INCIDENT_REHEARSAL_CADENCE.md`.
321+
The backup-restore scenario should be added to the monthly rotation.
322+
323+
---
324+
325+
## DR Drill Evidence Template
326+
327+
For each manual restore drill, file an evidence package at:
328+
329+
```
330+
docs/ops/rehearsals/YYYY-MM-DD_backup-restore-drill.md
331+
```
332+
333+
Use this table as a minimum record:
334+
335+
| Date | Operator | Backup Age | Backup File | Restore Duration | `integrity_check` | Row Count Match | Pass/Fail | Notes |
336+
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
337+
| 2026-04-01 | @operator | 3h | taskdeck-backup-2026-04-01-090000.db | 4m 12s | ok | yes | Pass | Docker volume restore |
338+
| YYYY-MM-DD | @username | Xh | taskdeck-backup-YYYY-MM-DD-HHmmss.db | Xm Xs | ok/fail | yes/no | Pass/Fail | |
339+
340+
Attach or inline:
341+
- `PRAGMA integrity_check` output
342+
- Row count query results (before and after restore)
343+
- API `/health/ready` response after restart
344+
- Any deviations from expected state
345+
346+
---
347+
348+
## Escalation Path
349+
350+
| Condition | Action |
351+
| --- | --- |
352+
| `PRAGMA integrity_check` returns anything other than `ok` | Do NOT restore this backup. Try the next-oldest backup. File an issue tagged `P1`. |
353+
| Restore script fails with permission error | Check file ownership, ACLs, and whether the API process holds an exclusive lock. |
354+
| All available backups fail integrity check | Escalate to the project owner immediately. Check the live database — it may still be intact. |
355+
| Post-restore API health check returns non-200 | Inspect `/health/ready` response for which subsystem failed. Check for EF migration drift between backup schema and current binary. |
356+
| Data loss confirmed after restore | File a P1 incident issue. Document the RPO gap in the evidence package. Increase backup frequency. |
357+
358+
For this project, escalation means: create a GitHub issue with label `incident` and
359+
`data-loss` (or `data-risk`) and assign it to `@Chris0Jeky`.
360+
361+
---
362+
363+
## Related Documents
364+
365+
- `scripts/backup.sh` / `scripts/backup.ps1` — backup automation
366+
- `scripts/restore.sh` / `scripts/restore.ps1` — restore automation
367+
- `docs/ops/EVIDENCE_TEMPLATE.md` — evidence package format
368+
- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` — rehearsal schedule
369+
- `docs/ops/FAILURE_INJECTION_DRILLS.md` — automated failure-injection drills
370+
- `docs/ops/REHEARSAL_BACKOFF_RULES.md` — issue filing rules for drill findings
371+
- `docs/ops/rehearsal-scenarios/` — scenario library

docs/ops/INCIDENT_REHEARSAL_CADENCE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ Available scenarios in `docs/ops/rehearsal-scenarios/`:
7979
- `missing-telemetry-signal.md` -- Correlation ID missing from OpenTelemetry traces
8080
- `mcp-server-startup-regression.md` -- Optional MCP server fails at boot
8181
- `deployment-readiness-failure.md` -- Docker Compose startup fails readiness checks
82+
- `backup-restore-drill.md` -- Full backup and restore loop; validates scripts, integrity checks, and RTO target
8283

8384
New scenarios should follow the same template structure (pre-conditions, injection, diagnosis, recovery, evidence checklist). File them in the `rehearsal-scenarios/` directory with a descriptive kebab-case filename.
8485

0 commit comments

Comments
 (0)