feat(issue-21): scale up server CCX23 → CCX33 for better UDP uptime by josecelano · Pull Request #22 · torrust/torrust-tracker-demo

josecelano · 2026-04-13T15:53:28Z

Summary

Scales the Hetzner server from CCX23 (4 vCPU, 16 GB RAM) to CCX33 (8 vCPU,
32 GB RAM) to address the UDP uptime issues tracked in #19 and #21. The
observation window is complete. This PR includes the full evidence trail,
the conntrack fix required to sustain uptime, and permanent operational
documentation.

What Happened

The resize alone was not sufficient. A secondary root cause was discovered
during the observation window: Docker DNAT creates one conntrack entry per UDP
packet. With the default nf_conntrack_max=262144 and a 120 s UDP stream
timeout, the conntrack table filled under load, silently dropping packets.

Fix applied (2026-04-20):

nf_conntrack_max=1048576 (4× previous)
nf_conntrack_udp_timeout=10
nf_conntrack_udp_timeout_stream=15
nf_conntrack kernel module pre-loaded via /etc/modules-load.d/conntrack.conf

After this fix, UDP uptime rose from ~92% to 99.90% and has held there for
the full 7-day post-fix window.

Outcome

Item	Before	After
Plan	CCX23	CCX33
vCPU	4	8
RAM	16 GB	32 GB
Traffic	20 TB	30 TB
Price	€31.49/mo	€62.49/mo
HTTP req/s (peak)	~1350	~2000
UDP req/s (peak)	~1507	~750
UDP uptime	~92.20%	99.90%
HTTP uptime	~99.90%	99.90%

Acceptance Criteria

UDP newTrackon uptime ≥ 99.0% over rolling 7 days post-fix — 99.90% achieved
UDP buffer error counters remain near zero after the server has been under load
Host load average stays below 70% of available capacity
No new service degradation observed in HTTP tracker
Pre/post comparison documented in 03-pre-post-comparison.md
Resize workflow skill added and referenced

Changes

Evidence trail:

docs/issues/ISSUE-21-scale-up-server-for-udp-uptime.md — issue spec, now marked RESOLVED
docs/issues/evidence/ISSUE-21/00-pre-resize-baseline.md — pre-resize Prometheus measurements
docs/issues/evidence/ISSUE-21/01-resize-execution.md — full resize log
docs/issues/evidence/ISSUE-21/02-post-resize-daily-checks.md — 7-day daily log (D+1–D+7 filled)
docs/issues/evidence/ISSUE-21/03-pre-post-comparison.md — pre/post comparison, decision: Success

Server configuration (deployed and in-repo):

server/etc/sysctl.d/99-conntrack.conf — conntrack kernel parameters
server/etc/modules-load.d/conntrack.conf — ensures nf_conntrack loads at boot

Permanent operational documentation:

docs/udp-conntrack-runbook.md — how to detect, fix, and validate conntrack saturation and softirq imbalance (including RPS/RFS how-to)
.github/skills/check-udp-conntrack/skill.md — agent workflow for future conntrack health checks

Infrastructure docs updated:

docs/infrastructure.md — updated traffic figures and added runbook link
docs/infrastructure-resize-history.md — new file; resize events log with links

Refs: #21

Documents the full resize workflow: pre-resize baseline capture, graceful shutdown, Hetzner panel action by human operator, post-resize recovery and validation, evidence capture, and 7-day observation period. Notes the key behaviour that Hetzner in-place resizes preserve all IP addresses (public, private, and Floating IPs), so no DNS or IP reassignment is needed. Refs: #21

The default conntrack table (262144 entries) fills up under sustained UDP tracker load, causing "nf_conntrack: table full, dropping packet" kernel errors and intermittent UDP timeouts on uptime monitors. Applied kernel tunables: - nf_conntrack_max: 262144 → 1048576 (4x increase) - nf_conntrack_udp_timeout_stream: 120 s → 15 s (8x reduction) - nf_conntrack_udp_timeout: 30 s → 10 s Added /etc/modules-load.d/conntrack.conf to pre-load the nf_conntrack module at boot so sysctl settings are applied before Docker starts. Without this, net.netfilter.* keys don't exist when sysctl runs and the settings are silently skipped after a reboot. Refs: #21

Fill in the D+1 row (2026-04-20) in the daily checks log: - HTTP: ~1564 req/s, UDP: ~1015 req/s, total ~2579 req/s (~322/vCPU) - Host load: 6.05/5.49/4.80 - UDP newTrackon uptime: 83.9% (includes resize downtime + conntrack overflow period; fix applied same day) Update the pre/post comparison table with available metrics and mark the decision as "partial" — resize alone was insufficient, conntrack overflow was the actual bottleneck. Follow-up plan added. Refs: #21

…window - Fill D+3–D+7 daily log rows; all post-fix days show uptime recovering to 99.9% by D+7 (2026-04-27) - Add D+7 newTrackon snapshot: both HTTP and UDP trackers at 99.90% - Add D+7 live conntrack verification: table at 32.6% utilization, no table-full dmesg events, zero IPv4 UDP receive-buffer errors - Flip decision in 03-pre-post-comparison.md from Partial → Success - Update main issue doc Current State to 2026-04-27 RESOLVED Refs: #21

- add a permanent runbook for diagnosing and fixing UDP conntrack saturation - add a reusable workspace skill for checking conntrack-related UDP loss - link infrastructure and issue docs to the new canonical guidance - add shared spell-check terms for the new runbook and skill Refs: #21

josecelano · 2026-04-27T10:16:57Z

ACK dcc3cc3

…UDP tracker 2762210 docs: add blog post on nf_conntrack overflow with Docker UDP tracker (Jose Celano) Pull request description: ## Summary Adds a new blog post documenting the `nf_conntrack` table exhaustion problem that caused UDP tracker downtime on both the DigitalOcean and Hetzner Torrust demos. ## What the post covers - **Mechanism** — how Docker bridge DNAT forces connection tracking for UDP flows, and why the table fills under tracker load - **Symptom** — UDP availability drops while HTTP stays healthy, self-recovering outages, application log completely silent - **Diagnosis** — `dmesg`, `/proc/sys/net/netfilter/nf_conntrack_count`, `conntrack -S` - **Our experience** — three incidents across two demos (DigitalOcean × 2, Hetzner × 1); post-fix UDP uptime confirmed at 99.9% - **The fix** — three-parameter sysctl config (`nf_conntrack_max`, `udp_timeout`, `udp_timeout_stream`) + module pre-load for reboot persistence - **Hash table sizing** — `nf_conntrack_buckets` / `hashsize` to avoid O(n) lookup degradation after raising the ceiling - **Reboot persistence trap** — why sysctl settings silently vanish after reboot without `modules-load.d` - **Alternative approaches** — host networking (`--network=host`), `NOTRACK` rules (with real-world failure story from torrust/torrust-demo#72), and macvlan - **Monitoring** — `conntrack -S` early_drop counter, 80% fill-level alerting rule - **Independent documentation** — links to the Aquatic tracker Docker guide that covers the same problem ## Related issues - torrust/torrust-demo#26 — first occurrence (DigitalOcean) - torrust/torrust-demo#72 — second occurrence + failed NOTRACK attempt - torrust/torrust-tracker-demo#21 — third occurrence (Hetzner) - torrust/torrust-tracker-demo#22 — PR that deployed the fix ACKs for top commit: josecelano: ACK 2762210 Tree-SHA512: 593ac524b72d051b0330ec3a6cd006e155e56ac3aa17ffc03b426936c0c9f5313391f2920f604b55aad29e2bb82e3dea428fd1b1d9dfd691e28e04666b0cf2b2

josecelano added 3 commits April 13, 2026 16:31

docs(issue-21): record measured pre-resize load baseline

1a84f3b

Refs: #21

docs(issue-21): add resize execution runbook

56414cf

Refs: #21

docs(issue-21): document post-resize validation results

90e0653

Refs: #21

josecelano self-assigned this Apr 13, 2026

josecelano requested review from cgbosse and da2ce7 April 13, 2026 15:54

josecelano added 3 commits April 13, 2026 16:56

josecelano mentioned this pull request Apr 20, 2026

New article: nf_conntrack overflow causes intermittent UDP tracker downtime with Docker torrust/torrust-website#192

Closed

josecelano added 4 commits April 21, 2026 08:28

docs(issue-21): record D+2 live UDP verification state

62fbef5

docs(conntrack): add RPS/RFS softirq steering how-to to UDP runbook

dcc3cc3

josecelano marked this pull request as ready for review April 27, 2026 10:16

josecelano merged commit 4a8d1fc into main Apr 27, 2026
2 checks passed

josecelano linked an issue Apr 27, 2026 that may be closed by this pull request

chore(infra): scale up demo server to improve UDP uptime #21

Closed

6 tasks

This was referenced Apr 27, 2026

chore(infra): scale up demo server to improve UDP uptime #21

Closed

docs: add blog post on nf_conntrack overflow with Docker UDP tracker torrust/torrust-website#193

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(issue-21): scale up server CCX23 → CCX33 for better UDP uptime#22

feat(issue-21): scale up server CCX23 → CCX33 for better UDP uptime#22
josecelano merged 10 commits into
mainfrom
issue-21-scale-up-server

josecelano commented Apr 13, 2026 •

edited

Loading

Uh oh!

josecelano commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

josecelano commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Happened

Outcome

Acceptance Criteria

Changes

Uh oh!

josecelano commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

josecelano commented Apr 13, 2026 •

edited

Loading