Skip to content

Commit 4a8d1fc

Browse files
committed
Merge #22: feat(issue-21): scale up server CCX23 → CCX33 for better UDP uptime
dcc3cc3 docs(conntrack): add RPS/RFS softirq steering how-to to UDP runbook (Jose Celano) 5341ff9 docs(conntrack): add UDP conntrack runbook and check skill (Jose Celano) 7ca3751 docs(issue-21): record success — UDP uptime reached 99.9% over 7-day window (Jose Celano) 62fbef5 docs(issue-21): record D+2 live UDP verification state (Jose Celano) 1e8b0a1 docs(issue-21): record D+1 post-resize observations and comparison (Jose Celano) 11937b8 fix: increase nf_conntrack table size and reduce UDP timeouts (Jose Celano) 469e67d docs: add scale-up-server skill with resize workflow (Jose Celano) 90e0653 docs(issue-21): document post-resize validation results (Jose Celano) 56414cf docs(issue-21): add resize execution runbook (Jose Celano) 1a84f3b docs(issue-21): record measured pre-resize load baseline (Jose Celano) Pull request description: ## Summary Scales the Hetzner server from CCX23 (4 vCPU, 16 GB RAM) to CCX33 (8 vCPU, 32 GB RAM) to address the UDP uptime issues tracked in #19 and #21. The observation window is complete. This PR includes the full evidence trail, the conntrack fix required to sustain uptime, and permanent operational documentation. ## What Happened The resize alone was not sufficient. A secondary root cause was discovered during the observation window: Docker DNAT creates one conntrack entry per UDP packet. With the default `nf_conntrack_max=262144` and a 120 s UDP stream timeout, the conntrack table filled under load, silently dropping packets. **Fix applied (2026-04-20):** - `nf_conntrack_max=1048576` (4× previous) - `nf_conntrack_udp_timeout=10` - `nf_conntrack_udp_timeout_stream=15` - `nf_conntrack` kernel module pre-loaded via `/etc/modules-load.d/conntrack.conf` After this fix, UDP uptime rose from ~92% to **99.90%** and has held there for the full 7-day post-fix window. ## Outcome | Item | Before | After | |---|---|---| | Plan | CCX23 | CCX33 | | vCPU | 4 | 8 | | RAM | 16 GB | 32 GB | | Traffic | 20 TB | 30 TB | | Price | €31.49/mo | €62.49/mo | | HTTP req/s (peak) | ~1350 | ~2000 | | UDP req/s (peak) | ~1507 | ~750 | | UDP uptime | ~92.20% | **99.90%** | | HTTP uptime | ~99.90% | **99.90%** | ## Acceptance Criteria - [x] UDP newTrackon uptime ≥ 99.0% over rolling 7 days post-fix — **99.90% achieved** - [x] UDP buffer error counters remain near zero after the server has been under load - [x] Host load average stays below 70% of available capacity - [x] No new service degradation observed in HTTP tracker - [x] Pre/post comparison documented in `03-pre-post-comparison.md` - [x] Resize workflow skill added and referenced ## Changes **Evidence trail:** - `docs/issues/ISSUE-21-scale-up-server-for-udp-uptime.md` — issue spec, now marked RESOLVED - `docs/issues/evidence/ISSUE-21/00-pre-resize-baseline.md` — pre-resize Prometheus measurements - `docs/issues/evidence/ISSUE-21/01-resize-execution.md` — full resize log - `docs/issues/evidence/ISSUE-21/02-post-resize-daily-checks.md` — 7-day daily log (D+1–D+7 filled) - `docs/issues/evidence/ISSUE-21/03-pre-post-comparison.md` — pre/post comparison, decision: **Success** **Server configuration (deployed and in-repo):** - `server/etc/sysctl.d/99-conntrack.conf` — conntrack kernel parameters - `server/etc/modules-load.d/conntrack.conf` — ensures `nf_conntrack` loads at boot **Permanent operational documentation:** - `docs/udp-conntrack-runbook.md` — how to detect, fix, and validate conntrack saturation and softirq imbalance (including RPS/RFS how-to) - `.github/skills/check-udp-conntrack/skill.md` — agent workflow for future conntrack health checks **Infrastructure docs updated:** - `docs/infrastructure.md` — updated traffic figures and added runbook link - `docs/infrastructure-resize-history.md` — new file; resize events log with links Refs: #21 ACKs for top commit: josecelano: ACK dcc3cc3 Tree-SHA512: a4d0986681fa8d3f86945f3b6e1efa7635877cfe38bf7c735f56f429d195e120391f23b254a238cae46157bc5e322d0ae52d0e7c3411832538fe15912a03692f
2 parents 67a8f07 + dcc3cc3 commit 4a8d1fc

13 files changed

Lines changed: 990 additions & 65 deletions
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
---
2+
name: check-udp-conntrack
3+
description: Workflow for checking whether UDP packet loss or uptime degradation may be caused by conntrack saturation on the torrust-tracker-demo server. Use when diagnosing UDP timeouts, low newTrackon uptime, packet drops, conntrack pressure, UDP receive-buffer errors, or when validating whether conntrack tuning is still healthy.
4+
metadata:
5+
author: torrust
6+
version: "1.0"
7+
---
8+
9+
<!-- cspell:ignore Rcvbuf conntrack NoPorts -->
10+
11+
# Check UDP Conntrack
12+
13+
## Overview
14+
15+
Use this skill to investigate whether UDP instability is caused by kernel-side
16+
conntrack saturation or related packet-path pressure.
17+
18+
The canonical human-facing reference is:
19+
20+
- `docs/udp-conntrack-runbook.md`
21+
22+
Keep durable explanations and operational guidance in that document. This skill
23+
should stay focused on workflow and safe execution.
24+
25+
## When To Use
26+
27+
Use this skill when the user asks to:
28+
29+
- check whether conntrack is too small
30+
- diagnose UDP timeouts or packet loss
31+
- validate that current conntrack tuning is still active
32+
- verify whether the server is dropping UDP packets
33+
- assess whether current symptoms point to conntrack saturation or something else
34+
35+
## Workflow
36+
37+
1. Run the host checks from `docs/udp-conntrack-runbook.md`.
38+
2. Summarize the results in terms of:
39+
- conntrack occupancy
40+
- presence or absence of `table full` events
41+
- IPv4 and IPv6 UDP receive-buffer errors
42+
- whether `NoPorts` counters are relevant or benign
43+
3. Distinguish conntrack saturation from softirq/RX steering imbalance.
44+
4. If the user asks to document the result, update the relevant issue evidence
45+
or incident file and reference the runbook when appropriate.
46+
47+
## Interpretation Rules
48+
49+
- `nf_conntrack_count` near or equal to `nf_conntrack_max` means real pressure.
50+
- Any fresh `nf_conntrack: table full, dropping packet` message is a confirmed problem.
51+
- `UdpRcvbufErrors` or `Udp6RcvbufErrors` increasing during the incident means packet loss below the application layer.
52+
- `NoPorts` counters alone do not prove tracker loss.
53+
- High load average with one CPU dominated by `%soft` points to softirq concentration, not necessarily conntrack exhaustion.
54+
55+
## Safety Constraints
56+
57+
- Do not change sysctl values unless the user explicitly asks for a fix.
58+
- If applying a fix, update both runtime state and persistent files when appropriate.
59+
- Preserve issue-specific evidence in `docs/issues/evidence/ISSUE-<N>/`.
60+
- Do not present the skill as the primary source of truth; the runbook in `docs/` is the canonical explanation.
Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
---
2+
name: scale-up-server
3+
description: Step-by-step workflow for resizing (scaling up) the Hetzner server in the torrust-tracker-demo stack. Use when asked to resize, scale up, or upgrade the server plan. Covers pre-resize preparation, graceful shutdown, provider panel action, post-resize recovery, and evidence capture. Triggers on "resize server", "scale up", "upgrade server plan", "Hetzner resize", "change server type".
4+
metadata:
5+
author: torrust
6+
version: "1.0"
7+
---
8+
9+
<!-- cspell:ignore nproc Rcvbuf snmp nstat urlencode -->
10+
11+
# Scaling Up the Server
12+
13+
## Overview
14+
15+
This skill covers a **planned, live resize** of the Hetzner Cloud server:
16+
shut down services gracefully, resize the instance in the provider panel,
17+
restart services, and validate everything before re-opening to traffic.
18+
19+
> **Important**: Resizing a Hetzner Cloud server **does not change IP addresses**.
20+
> Neither the public IPv4/IPv6 addresses nor any attached Floating IPs are
21+
> affected. DNS records and Floating IP assignments do not need updating.
22+
> This is standard cloud-provider behavior for in-place resizes.
23+
24+
## Responsibilities
25+
26+
| Step | Who |
27+
| ----------------------------------- | ---------------------- |
28+
| Capture pre-resize baseline | AI assistant |
29+
| Graceful service shutdown | AI assistant (via SSH) |
30+
| Resize in Hetzner Cloud panel | **Human operator** |
31+
| Post-resize recovery and validation | AI assistant (via SSH) |
32+
| Document evidence and commit | AI assistant |
33+
34+
---
35+
36+
## Workflow
37+
38+
### Step 1 — Capture pre-resize baseline
39+
40+
Before touching the server, record the current state so there is a before/after
41+
reference. Save results to the issue-scoped evidence folder
42+
(`docs/issues/evidence/ISSUE-<N>/00-pre-resize-baseline.md`).
43+
44+
```bash
45+
# Host snapshot
46+
ssh demotracker 'date -u; nproc; free -h; uptime; df -h'
47+
48+
# Docker services
49+
ssh demotracker 'cd /opt/torrust && docker compose ps'
50+
51+
# Prometheus request rates (5m window)
52+
ssh demotracker 'curl -sG "http://127.0.0.1:9090/api/v1/query" \
53+
--data-urlencode "query=sum(rate(http_tracker_core_requests_received_total{server_binding_protocol=\"http\",server_binding_port=\"7070\"}[5m]))"'
54+
55+
ssh demotracker 'curl -sG "http://127.0.0.1:9090/api/v1/query" \
56+
--data-urlencode "query=sum(rate(udp_tracker_server_requests_received_total{server_binding_protocol=\"udp\",server_binding_port=\"6969\"}[5m]))"'
57+
58+
# UDP buffer error counters
59+
ssh demotracker 'grep "^Udp:" /proc/net/snmp; nstat -az 2>/dev/null | grep -Ei "UdpRcvbufErrors|Udp6RcvbufErrors" || true'
60+
```
61+
62+
Commit the baseline file before proceeding to shutdown.
63+
64+
### Step 2 — Confirm readiness
65+
66+
Before shutting down:
67+
68+
- Baseline file is complete and committed.
69+
- Branch is clean and pushed.
70+
- Nightly backup window awareness (~03:00 UTC). Prefer resizing outside that window.
71+
- Operator is available to complete the Hetzner panel action promptly.
72+
73+
### Step 3 — Graceful service shutdown (AI assistant)
74+
75+
Run from a local terminal. Capture the full output and record it in
76+
`docs/issues/evidence/ISSUE-<N>/01-resize-execution.md`.
77+
78+
```bash
79+
ssh demotracker 'set -e
80+
echo "=== shutdown-start-utc ==="
81+
date -u +%Y-%m-%dT%H:%M:%SZ
82+
cd /opt/torrust
83+
echo "=== docker-compose-ps-before ==="
84+
docker compose ps
85+
echo "=== docker-compose-down ==="
86+
docker compose down
87+
echo "=== docker-compose-ps-after ==="
88+
docker compose ps
89+
echo "=== shutdown-end-utc ==="
90+
date -u +%Y-%m-%dT%H:%M:%SZ'
91+
```
92+
93+
Confirm all containers are stopped and networks are removed before handing over.
94+
95+
### Step 4 — Resize in Hetzner Cloud panel (human operator)
96+
97+
1. Log in to [Hetzner Cloud Console](https://console.hetzner.cloud/).
98+
2. Navigate to the project and select the server (`torrust-tracker-demo` or similar).
99+
3. Go to **Rescale** (or **Server type**) tab.
100+
4. Select the target server type (e.g. CCX33) and confirm.
101+
5. Wait for the resize to complete — typically under 2 minutes.
102+
6. Power on the server if it does not start automatically.
103+
7. Notify the AI assistant when the server is reachable again.
104+
105+
> No IP address changes are required. Floating IPs, public IPs, and private
106+
> network IPs all remain the same after a Hetzner in-place resize.
107+
108+
### Step 5 — Post-resize recovery (AI assistant)
109+
110+
Start all services and capture the new host profile:
111+
112+
```bash
113+
ssh demotracker 'set -e
114+
echo "=== startup-utc ==="
115+
date -u +%Y-%m-%dT%H:%M:%SZ
116+
echo "=== host ==="
117+
nproc; free -h; uptime
118+
cd /opt/torrust
119+
echo "=== docker-compose-up ==="
120+
docker compose up -d
121+
echo "=== docker-compose-ps ==="
122+
docker compose ps'
123+
```
124+
125+
### Step 6 — Post-resize validation (AI assistant)
126+
127+
Run all checks and record outputs in the execution log.
128+
129+
```bash
130+
# Container health
131+
ssh demotracker 'cd /opt/torrust && docker compose ps'
132+
133+
# UDP buffer counters (should be zero after fresh boot)
134+
ssh demotracker 'grep "^Udp:" /proc/net/snmp; nstat -az 2>/dev/null | grep -Ei "UdpRcvbufErrors|Udp6RcvbufErrors" || true'
135+
136+
# Prometheus targets
137+
ssh demotracker 'curl -sG "http://127.0.0.1:9090/api/v1/query" \
138+
--data-urlencode "query=up{job=\"tracker_metrics\"}"
139+
curl -sG "http://127.0.0.1:9090/api/v1/query" \
140+
--data-urlencode "query=up{job=\"tracker_stats\"}"'
141+
```
142+
143+
External checks (from local machine):
144+
145+
```bash
146+
# HTTP tracker health
147+
curl -fsS "https://http1.torrust-tracker-demo.com/health_check"
148+
149+
# Grafana (302 to /login is expected)
150+
curl -I "https://grafana.torrust-tracker-demo.com"
151+
152+
# UDP port probe
153+
nc -zvu udp1.torrust-tracker-demo.com 6969 2>&1 | head -5
154+
```
155+
156+
All services must reach `healthy` status, HTTP health must return `200`,
157+
and Prometheus targets must show `up=1` before the resize is considered
158+
complete.
159+
160+
> The tracker API health endpoint (`/health_check` on `api.torrust-tracker-demo.com`)
161+
> requires authentication and returns `500 unauthorized` without a token.
162+
> This is expected and not a failure indicator.
163+
164+
### Step 7 — Document and commit
165+
166+
Fill in the execution log (`01-resize-execution.md`) with all checklist items,
167+
the full timeline (start UTC / end UTC / total impact window), the command
168+
outputs, and the validation results.
169+
170+
Run linters before committing:
171+
172+
```bash
173+
./scripts/lint.sh
174+
```
175+
176+
Commit with:
177+
178+
```bash
179+
git commit -S -m "docs(issue-<N>): document resize execution and post-resize validation" \
180+
-m "Refs: #<N>"
181+
```
182+
183+
### Step 8 — Update infrastructure docs
184+
185+
After the resize is confirmed stable:
186+
187+
- Update the hardware table in `docs/infrastructure.md` to reflect the new
188+
server type, vCPU count, RAM, storage, traffic allowance, and price.
189+
- Add a row to `docs/infrastructure-resize-history.md` with the resize date,
190+
old and new plan, throughput at resize time, normalized req/s per vCPU,
191+
and a link to the related issue.
192+
193+
---
194+
195+
## Post-Resize Observation Period
196+
197+
After the resize, monitor for at least **7 days** before concluding success:
198+
199+
- Fill one row per day in `docs/issues/evidence/ISSUE-<N>/02-post-resize-daily-checks.md`
200+
using the same Prometheus queries from Step 1.
201+
- Check external uptime from [newTrackon](https://newtrackon.com/) or similar.
202+
- Watch UDP buffer error counters for any resurgence.
203+
204+
Once the observation window is complete, fill the final comparison table in
205+
`docs/issues/evidence/ISSUE-<N>/03-pre-post-comparison.md` and decide whether
206+
the resize meets the acceptance criteria.

docs/infrastructure-resize-history.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,10 @@ investigations (especially for UDP uptime on newTrackon).
2323

2424
## Timeline
2525

26-
| Date (UTC) | Change type | Server plan | vCPU | RAM | HTTP1 req/s | UDP1 req/s | Total req/s | Req/s per vCPU | UDP newTrackon uptime | Notes | Related |
27-
| ---------- | --------------------- | ----------- | ---- | ----- | ----------- | ---------- | ----------- | -------------- | --------------------- | ------------------------------------------------------------------------------------ | ---------------------------------------------------------------- |
28-
| 2026-04-13 | Baseline (pre-resize) | CCX23 | 4 | 16 GB | ~1300 | ~1500 | ~2800 | ~700 | 92.20% | High combined load. Capacity pressure suspected at current normalized request rate. | [#19](https://github.com/torrust/torrust-tracker-demo/issues/19) |
29-
| 2026-04-13 | Planned target resize | CCX33 | 8 | 32 GB | ~1300 | ~1500 | ~2800 | ~350 | 92.20% | Selected next plan: 30 TB traffic, €0.100/h - €62.49/mo. Value assumes similar load. | [#21](https://github.com/torrust/torrust-tracker-demo/issues/21) |
26+
| Date (UTC) | Change type | Server plan | vCPU | RAM | HTTP1 req/s | UDP1 req/s | Total req/s | Req/s per vCPU | UDP newTrackon uptime | Notes | Related |
27+
| ---------- | --------------------- | ----------- | ---- | ----- | ----------- | ---------- | ----------- | -------------- | --------------------- | ----------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- |
28+
| 2026-04-13 | Baseline (pre-resize) | CCX23 | 4 | 16 GB | ~1350 | ~1507 | ~2857 | ~714 | 92.20% | Baseline from Prometheus 5m rate snapshot at 2026-04-13T15:27:46Z. Capacity pressure suspected. | [#19](https://github.com/torrust/torrust-tracker-demo/issues/19) |
29+
| 2026-04-13 | Planned target resize | CCX33 | 8 | 32 GB | ~1350 | ~1507 | ~2857 | ~357 | 92.20% | Selected next plan: 30 TB traffic, €0.100/h - €62.49/mo. Assumes similar load after resize. | [#21](https://github.com/torrust/torrust-tracker-demo/issues/21) |
3030

3131
## Decision Criteria (Suggested)
3232

@@ -39,5 +39,7 @@ investigations (especially for UDP uptime on newTrackon).
3939

4040
1. Track UDP uptime daily for at least 7 days.
4141
2. Re-check host load and UDP receive buffer errors.
42+
For conntrack-specific diagnosis and remediation, use
43+
[udp-conntrack-runbook.md](udp-conntrack-runbook.md).
4244
3. Compare tracker error/aborted counters before vs after resize.
4345
4. Record final conclusion in this file and in the related issue.

docs/infrastructure.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ For raw command outputs (`ip addr`, `df -h`, etc.) see
66
[infrastructure-raw-outputs.md](infrastructure-raw-outputs.md).
77
For server resize and observed request-rate history see
88
[infrastructure-resize-history.md](infrastructure-resize-history.md).
9+
For UDP packet-loss diagnosis and conntrack tuning guidance see
10+
[udp-conntrack-runbook.md](udp-conntrack-runbook.md).
911

1012
## Server
1113

docs/issues/ISSUE-21-scale-up-server-for-udp-uptime.md

Lines changed: 30 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,8 @@
1111
## Overview
1212

1313
Observed traffic and evidence suggest the current server size (CCX23, 4 vCPU,
14-
16 GB RAM) is likely under pressure for current request volume (roughly
15-
1300 HTTP req/s + 1500 UDP req/s).
14+
16 GB RAM) is likely under pressure for current request volume (about
15+
1350 HTTP req/s + 1507 UDP req/s at the latest baseline snapshot).
1616

1717
Current public uptime observed in newTrackon for UDP is below target:
1818

@@ -21,22 +21,39 @@ Current public uptime observed in newTrackon for UDP is below target:
2121
This issue tracks a controlled resize experiment to determine whether capacity
2222
is the main bottleneck and to restore/maintain UDP uptime at or above 99%.
2323

24+
## Current State (2026-04-27) — RESOLVED
25+
26+
- Resize (CCX23 -> CCX33) complete and stable.
27+
- Conntrack overflow root cause identified and fixed on 2026-04-20
28+
(`nf_conntrack_max` 262144 → 1048576, UDP timeouts reduced, module pre-load
29+
added).
30+
- 7-day post-fix observation window complete.
31+
- newTrackon rolling UDP uptime reached **99.9%** — above the 99.0% target.
32+
33+
Outcome: **Success**. See
34+
[03-pre-post-comparison.md](evidence/ISSUE-21/03-pre-post-comparison.md) for
35+
the final decision record. Permanent follow-up documentation now lives in
36+
[udp-conntrack-runbook.md](../../udp-conntrack-runbook.md), with a reusable
37+
workspace skill at `.github/skills/check-udp-conntrack/skill.md`.
38+
2439
## Goal
2540

2641
Increase UDP tracker uptime to at least 99.0% over a rolling 7-day window while
2742
keeping service behavior stable.
2843

2944
## Current Throughput Baseline (Pre-Resize)
3045

31-
Observed request rates (Grafana, recent 3h window):
46+
Observed request rates at baseline snapshot (`2026-04-13T15:27:46Z`):
47+
48+
- Source: Prometheus instant query using 5-minute rate windows
3249

33-
- HTTP1: ~1300 req/s
34-
- UDP1: ~1500 req/s
35-
- Combined: ~2800 req/s
50+
- HTTP1: ~1350 req/s
51+
- UDP1: ~1507 req/s
52+
- Combined: ~2857 req/s
3653

3754
On the current CCX23 (4 vCPU), this is approximately:
3855

39-
- ~700 req/s per vCPU (combined)
56+
- ~714 req/s per vCPU (combined)
4057

4158
This baseline must be preserved in the resize history so future sizing
4259
decisions can be based on both absolute load and normalized load per vCPU.
@@ -98,12 +115,12 @@ The next available option selected for this experiment is:
98115

99116
## Acceptance Criteria
100117

101-
- [ ] Resize executed and documented in resize history.
102-
- [ ] No critical service regression immediately after resize.
103-
- [ ] At least 7 days of post-resize observations recorded.
104-
- [ ] UDP newTrackon uptime reaches and stays >= 99.0% during evaluation window.
105-
- [ ] Pre/post comparison documented with clear conclusion.
106-
- [ ] Resize workflow skill added and referenced.
118+
- [x] Resize executed and documented in resize history.
119+
- [x] No critical service regression immediately after resize.
120+
- [x] At least 7 days of post-resize observations recorded.
121+
- [x] UDP newTrackon uptime reaches and stays >= 99.0% during evaluation window.
122+
- [x] Pre/post comparison documented with clear conclusion.
123+
- [x] Resize workflow skill added and referenced.
107124

108125
## Possible Outcomes
109126

0 commit comments

Comments
 (0)