Skip to content

feat(issue-21): scale up server CCX23 → CCX33 for better UDP uptime#22

Merged
josecelano merged 10 commits intomainfrom
issue-21-scale-up-server
Apr 27, 2026
Merged

feat(issue-21): scale up server CCX23 → CCX33 for better UDP uptime#22
josecelano merged 10 commits intomainfrom
issue-21-scale-up-server

Conversation

@josecelano
Copy link
Copy Markdown
Member

@josecelano josecelano commented Apr 13, 2026

Summary

Scales the Hetzner server from CCX23 (4 vCPU, 16 GB RAM) to CCX33 (8 vCPU,
32 GB RAM) to address the UDP uptime issues tracked in #19 and #21. The
observation window is complete. This PR includes the full evidence trail,
the conntrack fix required to sustain uptime, and permanent operational
documentation.

What Happened

The resize alone was not sufficient. A secondary root cause was discovered
during the observation window: Docker DNAT creates one conntrack entry per UDP
packet. With the default nf_conntrack_max=262144 and a 120 s UDP stream
timeout, the conntrack table filled under load, silently dropping packets.

Fix applied (2026-04-20):

  • nf_conntrack_max=1048576 (4× previous)
  • nf_conntrack_udp_timeout=10
  • nf_conntrack_udp_timeout_stream=15
  • nf_conntrack kernel module pre-loaded via /etc/modules-load.d/conntrack.conf

After this fix, UDP uptime rose from ~92% to 99.90% and has held there for
the full 7-day post-fix window.

Outcome

Item Before After
Plan CCX23 CCX33
vCPU 4 8
RAM 16 GB 32 GB
Traffic 20 TB 30 TB
Price €31.49/mo €62.49/mo
HTTP req/s (peak) ~1350 ~2000
UDP req/s (peak) ~1507 ~750
UDP uptime ~92.20% 99.90%
HTTP uptime ~99.90% 99.90%

Acceptance Criteria

  • UDP newTrackon uptime ≥ 99.0% over rolling 7 days post-fix — 99.90% achieved
  • UDP buffer error counters remain near zero after the server has been under load
  • Host load average stays below 70% of available capacity
  • No new service degradation observed in HTTP tracker
  • Pre/post comparison documented in 03-pre-post-comparison.md
  • Resize workflow skill added and referenced

Changes

Evidence trail:

  • docs/issues/ISSUE-21-scale-up-server-for-udp-uptime.md — issue spec, now marked RESOLVED
  • docs/issues/evidence/ISSUE-21/00-pre-resize-baseline.md — pre-resize Prometheus measurements
  • docs/issues/evidence/ISSUE-21/01-resize-execution.md — full resize log
  • docs/issues/evidence/ISSUE-21/02-post-resize-daily-checks.md — 7-day daily log (D+1–D+7 filled)
  • docs/issues/evidence/ISSUE-21/03-pre-post-comparison.md — pre/post comparison, decision: Success

Server configuration (deployed and in-repo):

  • server/etc/sysctl.d/99-conntrack.conf — conntrack kernel parameters
  • server/etc/modules-load.d/conntrack.conf — ensures nf_conntrack loads at boot

Permanent operational documentation:

  • docs/udp-conntrack-runbook.md — how to detect, fix, and validate conntrack saturation and softirq imbalance (including RPS/RFS how-to)
  • .github/skills/check-udp-conntrack/skill.md — agent workflow for future conntrack health checks

Infrastructure docs updated:

  • docs/infrastructure.md — updated traffic figures and added runbook link
  • docs/infrastructure-resize-history.md — new file; resize events log with links

Refs: #21

@josecelano josecelano self-assigned this Apr 13, 2026
@josecelano josecelano requested review from cgbosse and da2ce7 April 13, 2026 15:54
Documents the full resize workflow: pre-resize baseline capture, graceful shutdown, Hetzner panel action by human operator, post-resize recovery and validation, evidence capture, and 7-day observation period.

Notes the key behaviour that Hetzner in-place resizes preserve all IP addresses (public, private, and Floating IPs), so no DNS or IP reassignment is needed.

Refs: #21
The default conntrack table (262144 entries) fills up under sustained
UDP tracker load, causing "nf_conntrack: table full, dropping packet"
kernel errors and intermittent UDP timeouts on uptime monitors.

Applied kernel tunables:
- nf_conntrack_max: 262144 → 1048576 (4x increase)
- nf_conntrack_udp_timeout_stream: 120 s → 15 s (8x reduction)
- nf_conntrack_udp_timeout: 30 s → 10 s

Added /etc/modules-load.d/conntrack.conf to pre-load the nf_conntrack
module at boot so sysctl settings are applied before Docker starts.
Without this, net.netfilter.* keys don't exist when sysctl runs and
the settings are silently skipped after a reboot.

Refs: #21
Fill in the D+1 row (2026-04-20) in the daily checks log:
- HTTP: ~1564 req/s, UDP: ~1015 req/s, total ~2579 req/s (~322/vCPU)
- Host load: 6.05/5.49/4.80
- UDP newTrackon uptime: 83.9% (includes resize downtime + conntrack
  overflow period; fix applied same day)

Update the pre/post comparison table with available metrics and mark
the decision as "partial" — resize alone was insufficient, conntrack
overflow was the actual bottleneck. Follow-up plan added.

Refs: #21
…window

- Fill D+3–D+7 daily log rows; all post-fix days show uptime recovering
  to 99.9% by D+7 (2026-04-27)
- Add D+7 newTrackon snapshot: both HTTP and UDP trackers at 99.90%
- Add D+7 live conntrack verification: table at 32.6% utilization,
  no table-full dmesg events, zero IPv4 UDP receive-buffer errors
- Flip decision in 03-pre-post-comparison.md from Partial → Success
- Update main issue doc Current State to 2026-04-27 RESOLVED

Refs: #21
- add a permanent runbook for diagnosing and fixing UDP conntrack saturation
- add a reusable workspace skill for checking conntrack-related UDP loss
- link infrastructure and issue docs to the new canonical guidance
- add shared spell-check terms for the new runbook and skill

Refs: #21
@josecelano josecelano marked this pull request as ready for review April 27, 2026 10:16
@josecelano
Copy link
Copy Markdown
Member Author

ACK dcc3cc3

@josecelano josecelano merged commit 4a8d1fc into main Apr 27, 2026
2 checks passed
@josecelano josecelano linked an issue Apr 27, 2026 that may be closed by this pull request
6 tasks
josecelano added a commit to torrust/torrust-website that referenced this pull request Apr 27, 2026
…UDP tracker

2762210 docs: add blog post on nf_conntrack overflow with Docker UDP tracker (Jose Celano)

Pull request description:

  ## Summary

  Adds a new blog post documenting the `nf_conntrack` table exhaustion problem that caused UDP tracker downtime on both the DigitalOcean and Hetzner Torrust demos.

  ## What the post covers

  - **Mechanism** — how Docker bridge DNAT forces connection tracking for UDP flows, and why the table fills under tracker load
  - **Symptom** — UDP availability drops while HTTP stays healthy, self-recovering outages, application log completely silent
  - **Diagnosis** — `dmesg`, `/proc/sys/net/netfilter/nf_conntrack_count`, `conntrack -S`
  - **Our experience** — three incidents across two demos (DigitalOcean × 2, Hetzner × 1); post-fix UDP uptime confirmed at 99.9%
  - **The fix** — three-parameter sysctl config (`nf_conntrack_max`, `udp_timeout`, `udp_timeout_stream`) + module pre-load for reboot persistence
  - **Hash table sizing** — `nf_conntrack_buckets` / `hashsize` to avoid O(n) lookup degradation after raising the ceiling
  - **Reboot persistence trap** — why sysctl settings silently vanish after reboot without `modules-load.d`
  - **Alternative approaches** — host networking (`--network=host`), `NOTRACK` rules (with real-world failure story from torrust/torrust-demo#72), and macvlan
  - **Monitoring** — `conntrack -S` early_drop counter, 80% fill-level alerting rule
  - **Independent documentation** — links to the Aquatic tracker Docker guide that covers the same problem

  ## Related issues

  - torrust/torrust-demo#26 — first occurrence (DigitalOcean)
  - torrust/torrust-demo#72 — second occurrence + failed NOTRACK attempt
  - torrust/torrust-tracker-demo#21 — third occurrence (Hetzner)
  - torrust/torrust-tracker-demo#22 — PR that deployed the fix

ACKs for top commit:
  josecelano:
    ACK 2762210

Tree-SHA512: 593ac524b72d051b0330ec3a6cd006e155e56ac3aa17ffc03b426936c0c9f5313391f2920f604b55aad29e2bb82e3dea428fd1b1d9dfd691e28e04666b0cf2b2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chore(infra): scale up demo server to improve UDP uptime

1 participant