Skip to content

Commit 7b5f9fb

Browse files
committed
improve disaster control docs
1 parent 4300cbd commit 7b5f9fb

2 files changed

Lines changed: 81 additions & 6 deletions

File tree

CONTRIBUTING.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -334,6 +334,16 @@ fly vol create data --size 3 --region ams
334334
fly scale count 2
335335
```
336336

337+
For this app, avoid bringing up many regions simultaneously. Prefer cloning one
338+
machine at a time from the current primary and waiting for checks to pass
339+
before adding the next region:
340+
341+
```sh
342+
fly machine clone <PRIMARY_MACHINE_ID> -a kcd --region <REGION>
343+
fly machine status <NEW_MACHINE_ID> -a kcd
344+
fly checks list -a kcd
345+
```
346+
337347
### Removing regions
338348

339349
Similar to adding regions, maybe backup the data.
@@ -363,6 +373,26 @@ And when you're finished, scale down to the number of volumes you have:
363373
fly scale count <COUNT>
364374
```
365375

376+
If cleanup leaves any machines in a non-started state, destroy them:
377+
378+
```sh
379+
for id in $(fly m list -a kcd --json | jq -r '.[] | select(.state != "started") | .id'); do
380+
fly machine destroy "$id" -a kcd --force
381+
done
382+
```
383+
384+
After removing regions, also clean up unattached volumes in those regions so
385+
you are not paying for orphaned storage:
386+
387+
```sh
388+
for id in $(fly vol list -a kcd --json | jq -r '.[] | select(.attached_machine_id == null and (.region=="jnb" or .region=="ams" or .region=="sin" or .region=="bom" or .region=="syd" or .region=="cdg")) | .id'); do
389+
fly vol destroy "$id" -a kcd --yes
390+
done
391+
```
392+
393+
Run `fly vol list -a kcd` first and do not delete volumes attached to active
394+
machines.
395+
366396
## Help needed
367397

368398
Please checkout [the open issues][issues]

docs/fly-scale-down-recovery-plan.md

Lines changed: 51 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -58,17 +58,59 @@ FLY_MACHINE_ID fix).
5858
- Test site: `curl -I https://kentcdodds.com/healthcheck`
5959
- Check logs: `fly logs -a kcd` (look for ZodError — should be gone)
6060

61-
## Step 4: Scale Back Up
61+
## Step 4: Scale Back Up (one region at a time)
6262

63-
Run deploy again to recreate replicas:
63+
Do **not** bring up all replicas at once. Clone the primary into one region,
64+
wait for health checks to pass, then move to the next region.
65+
66+
Current standard regions:
67+
68+
- `gru`
69+
- `jnb`
70+
- `ams`
71+
- `sin`
72+
- `bom`
73+
- `syd`
74+
- `cdg`
75+
76+
Template (run sequentially, one region at a time):
6477

6578
```bash
66-
fly deploy -a kcd
79+
fly machine clone 7817602a936548 -a kcd --region <REGION>
80+
fly machine status <NEW_MACHINE_ID> -a kcd
81+
fly checks list -a kcd
82+
```
83+
84+
Only continue to the next region after the new machine is healthy (`3/3`).
85+
86+
If a machine fails to start or is left in a non-started state, destroy it
87+
before continuing:
88+
89+
```bash
90+
fly machine destroy <MACHINE_ID> -a kcd
91+
```
92+
93+
You can also prune all non-started machines in one pass:
94+
95+
```bash
96+
for id in $(fly m list -a kcd --json | jq -r '.[] | select(.state != "started") | .id'); do
97+
fly machine destroy "$id" -a kcd --force
98+
done
99+
```
100+
101+
## Step 5: Prune Unattached Volumes (removed regions)
102+
103+
After intentionally removing regions, delete unattached volumes in those
104+
regions to avoid ongoing storage costs.
105+
106+
```bash
107+
for id in $(fly vol list -a kcd --json | jq -r '.[] | select(.attached_machine_id == null and (.region=="jnb" or .region=="ams" or .region=="sin" or .region=="bom" or .region=="syd" or .region=="cdg")) | .id'); do
108+
fly vol destroy "$id" -a kcd --yes
109+
done
67110
```
68111

69-
Fly may recreate machines based on previous configuration. If replicas are not
70-
recreated, you may need to clone the primary to other regions via the Fly
71-
dashboard or `fly machine clone`.
112+
Always review `fly vol list -a kcd` first and keep attached volumes in active
113+
regions (`dfw`/`gru`).
72114

73115
## Notes
74116

@@ -77,3 +119,6 @@ dashboard or `fly machine clone`.
77119
(primary_region in fly.toml).
78120
- **Machine IDs**: Run `fly machines list -a kcd` to get current IDs before
79121
scaling down — they may change between runs.
122+
- **Avoid parallel startup**: Starting many regional machines concurrently can
123+
create noisy health-check failures and slow recovery; prefer strict serial
124+
rollout.

0 commit comments

Comments
 (0)