Skip to content

Rework rollback / recovery story around forward-only migrations #1540

@larryro

Description

@larryro

Context

PR #1539 locked `tale deploy` to the running CLI's version and intentionally did not lean on `tale rollback` as a recovery path. Reason: Tale's migrations are forward-only — new versions can write data the old binary cannot read, so swapping the binary back without reversing data leaves the system in a half-broken state. Keeping a command that looks like it recovers, but actually corrupts, is worse than not having it.

This issue tracks the proper redesign.

Proposed scope

1. Audit and likely remove `tale rollback`

  • Today `tale rollback --version X` accepts any version with no validation (verified in tools/cli/src/commands/rollback.ts) and re-implements its own deploy path.
  • Remove the command and its `setPreviousVersion` state-tracking, OR gate it behind a strict precondition (e.g. only allowed when no migrations have run since the previous version).

2. Backup-based recovery as the official supported path

  • Automatic volume snapshot in `tale deploy` before `runPendingMigrations` (the only step that mutates data). Snapshot the `STATEFUL_SERVICES` volumes via `docker run --rm -v :/data -v :/backup alpine tar czf /backup/-.tar.gz /data`.
  • Rotation policy (keep last N snapshots / N days) to bound disk usage.
  • Companion `tale restore ` command — listing, restoring, integrity-checking snapshots.
  • Failure behavior: snapshot failure aborts deploy by default; `--skip-backup` to override explicitly.

3. Docs

  • Update docs/production-deployment.md to make the recovery story explicit: "recovery = restore snapshot + re-deploy", not "roll back the binary".

Open questions

  • Per-project backup directory location and ownership (host path vs. dedicated docker volume).
  • How to handle very large stateful volumes — incremental snapshots? Hooks for app-level dumps (Convex export, pg_dump) instead of raw volume tar?
  • Should snapshot creation be on `tale deploy` only, or also on `tale start` when migrations run?
  • Should existing migrations (`namespace-volumes`, `split-convex`) be retroactively wrapped in the snapshot flow on first run after this lands?

Out of scope (for this issue)

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions