Commit 488250c
authored
ops(rolling-update): add GOMEMLIMIT=1800MiB + --memory=2500m defaults (#617)
## Summary
Add two OOM-defense defaults to `scripts/rolling-update.sh`:
- `GOMEMLIMIT=1800MiB` (via new `DEFAULT_EXTRA_ENV`, merged into the
existing `EXTRA_ENV` plumbing)
- `--memory=2500m` on the remote `docker run` (via new
`CONTAINER_MEMORY_LIMIT`)
Both are env-var-controlled with empty-string opt-out (`${VAR-default}`
so unset uses the default, but an explicit empty string disables it).
## Motivation
2026-04-24 incident: all 4 live nodes were kernel-OOM-SIGKILLed 22-169
times in 24h under a traffic spike. Each kill risked WAL-tail truncation
and triggered election storms, cascading into p99 GET spikes to 6-8s.
The runtime defense was applied by hand during the incident; this PR
makes it the script default so future rollouts inherit it.
- `GOMEMLIMIT` — Go runtime GCs aggressively as heap approaches the
limit, keeping RSS below the container ceiling.
- `--memory` (cgroup hard limit) — if Go can't keep up (e.g. non-heap
growth), the kill is scoped to the container, not host processes like
`qemu-guest-agent` or `systemd`.
## Behavior changes
| Variable | Default | Opt-out |
|----------|---------|---------|
| `DEFAULT_EXTRA_ENV` | `GOMEMLIMIT=1800MiB` | `DEFAULT_EXTRA_ENV=""` |
| `CONTAINER_MEMORY_LIMIT`| `2500m` | `CONTAINER_MEMORY_LIMIT=""` |
Operator-supplied `EXTRA_ENV` keys override matching keys in
`DEFAULT_EXTRA_ENV` (e.g., `EXTRA_ENV="GOMEMLIMIT=3000MiB"` wins over
the default).
## Related
Companion PRs (defense-in-depth):
- #612 `memwatch` — graceful shutdown before kernel OOM (prevents WAL
corruption in the first place)
- #613 WAL auto-repair — recovers on startup when the above fails
- #616 rolling-update via GitHub Actions over Tailscale — consumes this
script
## Test plan
- [x] `bash -n scripts/rolling-update.sh` passes
- [x] Deployed equivalents manually on all 4 live nodes during the
incident (2026-04-24T07:44Z - 07:46Z); no OOM recurrence since
- [ ] Next rolling-update invocation should produce `docker run ...
--memory=2500m ... -e GOMEMLIMIT=1800MiB ...` on each node
## Design doc reference
`docs/design/2026_04_24_proposed_resilience_roadmap.md` (item 1 —
capacity/runtime defenses).2 files changed
Lines changed: 111 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
65 | 77 | | |
66 | 78 | | |
67 | 79 | | |
| |||
113 | 125 | | |
114 | 126 | | |
115 | 127 | | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
116 | 131 | | |
117 | 132 | | |
118 | 133 | | |
| |||
427 | 442 | | |
428 | 443 | | |
429 | 444 | | |
| 445 | + | |
430 | 446 | | |
431 | 447 | | |
432 | 448 | | |
| |||
707 | 723 | | |
708 | 724 | | |
709 | 725 | | |
| 726 | + | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
| 734 | + | |
710 | 735 | | |
711 | 736 | | |
712 | 737 | | |
713 | 738 | | |
| 739 | + | |
714 | 740 | | |
715 | 741 | | |
716 | 742 | | |
| |||
868 | 894 | | |
869 | 895 | | |
870 | 896 | | |
871 | | - | |
872 | | - | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + | |
| 903 | + | |
| 904 | + | |
| 905 | + | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
| 913 | + | |
| 914 | + | |
| 915 | + | |
| 916 | + | |
| 917 | + | |
| 918 | + | |
| 919 | + | |
| 920 | + | |
| 921 | + | |
| 922 | + | |
| 923 | + | |
| 924 | + | |
| 925 | + | |
| 926 | + | |
| 927 | + | |
| 928 | + | |
| 929 | + | |
| 930 | + | |
| 931 | + | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
| 936 | + | |
| 937 | + | |
| 938 | + | |
| 939 | + | |
| 940 | + | |
| 941 | + | |
| 942 | + | |
| 943 | + | |
| 944 | + | |
| 945 | + | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
| 956 | + | |
| 957 | + | |
| 958 | + | |
| 959 | + | |
| 960 | + | |
| 961 | + | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
| 973 | + | |
873 | 974 | | |
| 975 | + | |
874 | 976 | | |
875 | 977 | | |
876 | 978 | | |
| |||
0 commit comments