|
| 1 | +--- |
| 2 | +name: scale-up-server |
| 3 | +description: Step-by-step workflow for resizing (scaling up) the Hetzner server in the torrust-tracker-demo stack. Use when asked to resize, scale up, or upgrade the server plan. Covers pre-resize preparation, graceful shutdown, provider panel action, post-resize recovery, and evidence capture. Triggers on "resize server", "scale up", "upgrade server plan", "Hetzner resize", "change server type". |
| 4 | +metadata: |
| 5 | + author: torrust |
| 6 | + version: "1.0" |
| 7 | +--- |
| 8 | + |
| 9 | +<!-- cspell:ignore nproc Rcvbuf snmp nstat urlencode --> |
| 10 | + |
| 11 | +# Scaling Up the Server |
| 12 | + |
| 13 | +## Overview |
| 14 | + |
| 15 | +This skill covers a **planned, live resize** of the Hetzner Cloud server: |
| 16 | +shut down services gracefully, resize the instance in the provider panel, |
| 17 | +restart services, and validate everything before re-opening to traffic. |
| 18 | + |
| 19 | +> **Important**: Resizing a Hetzner Cloud server **does not change IP addresses**. |
| 20 | +> Neither the public IPv4/IPv6 addresses nor any attached Floating IPs are |
| 21 | +> affected. DNS records and Floating IP assignments do not need updating. |
| 22 | +> This is standard cloud-provider behavior for in-place resizes. |
| 23 | +
|
| 24 | +## Responsibilities |
| 25 | + |
| 26 | +| Step | Who | |
| 27 | +| ----------------------------------- | ---------------------- | |
| 28 | +| Capture pre-resize baseline | AI assistant | |
| 29 | +| Graceful service shutdown | AI assistant (via SSH) | |
| 30 | +| Resize in Hetzner Cloud panel | **Human operator** | |
| 31 | +| Post-resize recovery and validation | AI assistant (via SSH) | |
| 32 | +| Document evidence and commit | AI assistant | |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +## Workflow |
| 37 | + |
| 38 | +### Step 1 — Capture pre-resize baseline |
| 39 | + |
| 40 | +Before touching the server, record the current state so there is a before/after |
| 41 | +reference. Save results to the issue-scoped evidence folder |
| 42 | +(`docs/issues/evidence/ISSUE-<N>/00-pre-resize-baseline.md`). |
| 43 | + |
| 44 | +```bash |
| 45 | +# Host snapshot |
| 46 | +ssh demotracker 'date -u; nproc; free -h; uptime; df -h' |
| 47 | + |
| 48 | +# Docker services |
| 49 | +ssh demotracker 'cd /opt/torrust && docker compose ps' |
| 50 | + |
| 51 | +# Prometheus request rates (5m window) |
| 52 | +ssh demotracker 'curl -sG "http://127.0.0.1:9090/api/v1/query" \ |
| 53 | + --data-urlencode "query=sum(rate(http_tracker_core_requests_received_total{server_binding_protocol=\"http\",server_binding_port=\"7070\"}[5m]))"' |
| 54 | + |
| 55 | +ssh demotracker 'curl -sG "http://127.0.0.1:9090/api/v1/query" \ |
| 56 | + --data-urlencode "query=sum(rate(udp_tracker_server_requests_received_total{server_binding_protocol=\"udp\",server_binding_port=\"6969\"}[5m]))"' |
| 57 | + |
| 58 | +# UDP buffer error counters |
| 59 | +ssh demotracker 'grep "^Udp:" /proc/net/snmp; nstat -az 2>/dev/null | grep -Ei "UdpRcvbufErrors|Udp6RcvbufErrors" || true' |
| 60 | +``` |
| 61 | + |
| 62 | +Commit the baseline file before proceeding to shutdown. |
| 63 | + |
| 64 | +### Step 2 — Confirm readiness |
| 65 | + |
| 66 | +Before shutting down: |
| 67 | + |
| 68 | +- Baseline file is complete and committed. |
| 69 | +- Branch is clean and pushed. |
| 70 | +- Nightly backup window awareness (~03:00 UTC). Prefer resizing outside that window. |
| 71 | +- Operator is available to complete the Hetzner panel action promptly. |
| 72 | + |
| 73 | +### Step 3 — Graceful service shutdown (AI assistant) |
| 74 | + |
| 75 | +Run from a local terminal. Capture the full output and record it in |
| 76 | +`docs/issues/evidence/ISSUE-<N>/01-resize-execution.md`. |
| 77 | + |
| 78 | +```bash |
| 79 | +ssh demotracker 'set -e |
| 80 | + echo "=== shutdown-start-utc ===" |
| 81 | + date -u +%Y-%m-%dT%H:%M:%SZ |
| 82 | + cd /opt/torrust |
| 83 | + echo "=== docker-compose-ps-before ===" |
| 84 | + docker compose ps |
| 85 | + echo "=== docker-compose-down ===" |
| 86 | + docker compose down |
| 87 | + echo "=== docker-compose-ps-after ===" |
| 88 | + docker compose ps |
| 89 | + echo "=== shutdown-end-utc ===" |
| 90 | + date -u +%Y-%m-%dT%H:%M:%SZ' |
| 91 | +``` |
| 92 | + |
| 93 | +Confirm all containers are stopped and networks are removed before handing over. |
| 94 | + |
| 95 | +### Step 4 — Resize in Hetzner Cloud panel (human operator) |
| 96 | + |
| 97 | +1. Log in to [Hetzner Cloud Console](https://console.hetzner.cloud/). |
| 98 | +2. Navigate to the project and select the server (`torrust-tracker-demo` or similar). |
| 99 | +3. Go to **Rescale** (or **Server type**) tab. |
| 100 | +4. Select the target server type (e.g. CCX33) and confirm. |
| 101 | +5. Wait for the resize to complete — typically under 2 minutes. |
| 102 | +6. Power on the server if it does not start automatically. |
| 103 | +7. Notify the AI assistant when the server is reachable again. |
| 104 | + |
| 105 | +> No IP address changes are required. Floating IPs, public IPs, and private |
| 106 | +> network IPs all remain the same after a Hetzner in-place resize. |
| 107 | +
|
| 108 | +### Step 5 — Post-resize recovery (AI assistant) |
| 109 | + |
| 110 | +Start all services and capture the new host profile: |
| 111 | + |
| 112 | +```bash |
| 113 | +ssh demotracker 'set -e |
| 114 | + echo "=== startup-utc ===" |
| 115 | + date -u +%Y-%m-%dT%H:%M:%SZ |
| 116 | + echo "=== host ===" |
| 117 | + nproc; free -h; uptime |
| 118 | + cd /opt/torrust |
| 119 | + echo "=== docker-compose-up ===" |
| 120 | + docker compose up -d |
| 121 | + echo "=== docker-compose-ps ===" |
| 122 | + docker compose ps' |
| 123 | +``` |
| 124 | + |
| 125 | +### Step 6 — Post-resize validation (AI assistant) |
| 126 | + |
| 127 | +Run all checks and record outputs in the execution log. |
| 128 | + |
| 129 | +```bash |
| 130 | +# Container health |
| 131 | +ssh demotracker 'cd /opt/torrust && docker compose ps' |
| 132 | + |
| 133 | +# UDP buffer counters (should be zero after fresh boot) |
| 134 | +ssh demotracker 'grep "^Udp:" /proc/net/snmp; nstat -az 2>/dev/null | grep -Ei "UdpRcvbufErrors|Udp6RcvbufErrors" || true' |
| 135 | + |
| 136 | +# Prometheus targets |
| 137 | +ssh demotracker 'curl -sG "http://127.0.0.1:9090/api/v1/query" \ |
| 138 | + --data-urlencode "query=up{job=\"tracker_metrics\"}" |
| 139 | + curl -sG "http://127.0.0.1:9090/api/v1/query" \ |
| 140 | + --data-urlencode "query=up{job=\"tracker_stats\"}"' |
| 141 | +``` |
| 142 | + |
| 143 | +External checks (from local machine): |
| 144 | + |
| 145 | +```bash |
| 146 | +# HTTP tracker health |
| 147 | +curl -fsS "https://http1.torrust-tracker-demo.com/health_check" |
| 148 | + |
| 149 | +# Grafana (302 to /login is expected) |
| 150 | +curl -I "https://grafana.torrust-tracker-demo.com" |
| 151 | + |
| 152 | +# UDP port probe |
| 153 | +nc -zvu udp1.torrust-tracker-demo.com 6969 2>&1 | head -5 |
| 154 | +``` |
| 155 | + |
| 156 | +All services must reach `healthy` status, HTTP health must return `200`, |
| 157 | +and Prometheus targets must show `up=1` before the resize is considered |
| 158 | +complete. |
| 159 | + |
| 160 | +> The tracker API health endpoint (`/health_check` on `api.torrust-tracker-demo.com`) |
| 161 | +> requires authentication and returns `500 unauthorized` without a token. |
| 162 | +> This is expected and not a failure indicator. |
| 163 | +
|
| 164 | +### Step 7 — Document and commit |
| 165 | + |
| 166 | +Fill in the execution log (`01-resize-execution.md`) with all checklist items, |
| 167 | +the full timeline (start UTC / end UTC / total impact window), the command |
| 168 | +outputs, and the validation results. |
| 169 | + |
| 170 | +Run linters before committing: |
| 171 | + |
| 172 | +```bash |
| 173 | +./scripts/lint.sh |
| 174 | +``` |
| 175 | + |
| 176 | +Commit with: |
| 177 | + |
| 178 | +```bash |
| 179 | +git commit -S -m "docs(issue-<N>): document resize execution and post-resize validation" \ |
| 180 | + -m "Refs: #<N>" |
| 181 | +``` |
| 182 | + |
| 183 | +### Step 8 — Update infrastructure docs |
| 184 | + |
| 185 | +After the resize is confirmed stable: |
| 186 | + |
| 187 | +- Update the hardware table in `docs/infrastructure.md` to reflect the new |
| 188 | + server type, vCPU count, RAM, storage, traffic allowance, and price. |
| 189 | +- Add a row to `docs/infrastructure-resize-history.md` with the resize date, |
| 190 | + old and new plan, throughput at resize time, normalized req/s per vCPU, |
| 191 | + and a link to the related issue. |
| 192 | + |
| 193 | +--- |
| 194 | + |
| 195 | +## Post-Resize Observation Period |
| 196 | + |
| 197 | +After the resize, monitor for at least **7 days** before concluding success: |
| 198 | + |
| 199 | +- Fill one row per day in `docs/issues/evidence/ISSUE-<N>/02-post-resize-daily-checks.md` |
| 200 | + using the same Prometheus queries from Step 1. |
| 201 | +- Check external uptime from [newTrackon](https://newtrackon.com/) or similar. |
| 202 | +- Watch UDP buffer error counters for any resurgence. |
| 203 | + |
| 204 | +Once the observation window is complete, fill the final comparison table in |
| 205 | +`docs/issues/evidence/ISSUE-<N>/03-pre-post-comparison.md` and decide whether |
| 206 | +the resize meets the acceptance criteria. |
0 commit comments