torrust · josecelano · Apr 27, 2026 · Apr 13, 2026 · Apr 13, 2026 · Apr 13, 2026
diff --git a/.github/skills/check-udp-conntrack/skill.md b/.github/skills/check-udp-conntrack/skill.md
@@ -0,0 +1,60 @@
+---
+name: check-udp-conntrack
+description: Workflow for checking whether UDP packet loss or uptime degradation may be caused by conntrack saturation on the torrust-tracker-demo server. Use when diagnosing UDP timeouts, low newTrackon uptime, packet drops, conntrack pressure, UDP receive-buffer errors, or when validating whether conntrack tuning is still healthy.
+metadata:
+  author: torrust
+  version: "1.0"
+---
+
+<!-- cspell:ignore Rcvbuf conntrack NoPorts -->
+
+# Check UDP Conntrack
+
+## Overview
+
+Use this skill to investigate whether UDP instability is caused by kernel-side
+conntrack saturation or related packet-path pressure.
+
+The canonical human-facing reference is:
+
+- `docs/udp-conntrack-runbook.md`
+
+Keep durable explanations and operational guidance in that document. This skill
+should stay focused on workflow and safe execution.
+
+## When To Use
+
+Use this skill when the user asks to:
+
+- check whether conntrack is too small
+- diagnose UDP timeouts or packet loss
+- validate that current conntrack tuning is still active
+- verify whether the server is dropping UDP packets
+- assess whether current symptoms point to conntrack saturation or something else
+
+## Workflow
+
+1. Run the host checks from `docs/udp-conntrack-runbook.md`.
+2. Summarize the results in terms of:
+   - conntrack occupancy
+   - presence or absence of `table full` events
+   - IPv4 and IPv6 UDP receive-buffer errors
+   - whether `NoPorts` counters are relevant or benign
+3. Distinguish conntrack saturation from softirq/RX steering imbalance.
+4. If the user asks to document the result, update the relevant issue evidence
+   or incident file and reference the runbook when appropriate.
+
+## Interpretation Rules
+
+- `nf_conntrack_count` near or equal to `nf_conntrack_max` means real pressure.
+- Any fresh `nf_conntrack: table full, dropping packet` message is a confirmed problem.
+- `UdpRcvbufErrors` or `Udp6RcvbufErrors` increasing during the incident means packet loss below the application layer.
+- `NoPorts` counters alone do not prove tracker loss.
+- High load average with one CPU dominated by `%soft` points to softirq concentration, not necessarily conntrack exhaustion.
+
+## Safety Constraints
+
+- Do not change sysctl values unless the user explicitly asks for a fix.
+- If applying a fix, update both runtime state and persistent files when appropriate.
+- Preserve issue-specific evidence in `docs/issues/evidence/ISSUE-<N>/`.
+- Do not present the skill as the primary source of truth; the runbook in `docs/` is the canonical explanation.
diff --git a/.github/skills/scale-up-server/skill.md b/.github/skills/scale-up-server/skill.md
@@ -0,0 +1,206 @@
+---
+name: scale-up-server
+description: Step-by-step workflow for resizing (scaling up) the Hetzner server in the torrust-tracker-demo stack. Use when asked to resize, scale up, or upgrade the server plan. Covers pre-resize preparation, graceful shutdown, provider panel action, post-resize recovery, and evidence capture. Triggers on "resize server", "scale up", "upgrade server plan", "Hetzner resize", "change server type".
+metadata:
+  author: torrust
+  version: "1.0"
+---
+
+<!-- cspell:ignore nproc Rcvbuf snmp nstat urlencode -->
+
+# Scaling Up the Server
+
+## Overview
+
+This skill covers a **planned, live resize** of the Hetzner Cloud server:
+shut down services gracefully, resize the instance in the provider panel,
+restart services, and validate everything before re-opening to traffic.
+
+> **Important**: Resizing a Hetzner Cloud server **does not change IP addresses**.
+> Neither the public IPv4/IPv6 addresses nor any attached Floating IPs are
+> affected. DNS records and Floating IP assignments do not need updating.
+> This is standard cloud-provider behavior for in-place resizes.
+
+## Responsibilities
+
+| Step                                | Who                    |
+| ----------------------------------- | ---------------------- |
+| Capture pre-resize baseline         | AI assistant           |
+| Graceful service shutdown           | AI assistant (via SSH) |
+| Resize in Hetzner Cloud panel       | **Human operator**     |
+| Post-resize recovery and validation | AI assistant (via SSH) |
+| Document evidence and commit        | AI assistant           |
+
+---
+
+## Workflow
+
+### Step 1 — Capture pre-resize baseline
+
+Before touching the server, record the current state so there is a before/after
+reference. Save results to the issue-scoped evidence folder
+(`docs/issues/evidence/ISSUE-<N>/00-pre-resize-baseline.md`).
+
+```bash
+# Host snapshot
+ssh demotracker 'date -u; nproc; free -h; uptime; df -h'
+
+# Docker services
+ssh demotracker 'cd /opt/torrust && docker compose ps'
+
+# Prometheus request rates (5m window)
+ssh demotracker 'curl -sG "http://127.0.0.1:9090/api/v1/query" \
+  --data-urlencode "query=sum(rate(http_tracker_core_requests_received_total{server_binding_protocol=\"http\",server_binding_port=\"7070\"}[5m]))"'
+
+ssh demotracker 'curl -sG "http://127.0.0.1:9090/api/v1/query" \
+  --data-urlencode "query=sum(rate(udp_tracker_server_requests_received_total{server_binding_protocol=\"udp\",server_binding_port=\"6969\"}[5m]))"'
+
+# UDP buffer error counters
+ssh demotracker 'grep "^Udp:" /proc/net/snmp; nstat -az 2>/dev/null | grep -Ei "UdpRcvbufErrors|Udp6RcvbufErrors" || true'
+```
+
+Commit the baseline file before proceeding to shutdown.
+
+### Step 2 — Confirm readiness
+
+Before shutting down:
+
+- Baseline file is complete and committed.
+- Branch is clean and pushed.
+- Nightly backup window awareness (~03:00 UTC). Prefer resizing outside that window.
+- Operator is available to complete the Hetzner panel action promptly.
+
+### Step 3 — Graceful service shutdown (AI assistant)
+
+Run from a local terminal. Capture the full output and record it in
+`docs/issues/evidence/ISSUE-<N>/01-resize-execution.md`.
+
+```bash
+ssh demotracker 'set -e
+  echo "=== shutdown-start-utc ==="
+  date -u +%Y-%m-%dT%H:%M:%SZ
+  cd /opt/torrust
+  echo "=== docker-compose-ps-before ==="
+  docker compose ps
+  echo "=== docker-compose-down ==="
+  docker compose down
+  echo "=== docker-compose-ps-after ==="
+  docker compose ps
+  echo "=== shutdown-end-utc ==="
+  date -u +%Y-%m-%dT%H:%M:%SZ'
+```
+
+Confirm all containers are stopped and networks are removed before handing over.
+
+### Step 4 — Resize in Hetzner Cloud panel (human operator)
+
+1. Log in to [Hetzner Cloud Console](https://console.hetzner.cloud/).
+2. Navigate to the project and select the server (`torrust-tracker-demo` or similar).
+3. Go to **Rescale** (or **Server type**) tab.
+4. Select the target server type (e.g. CCX33) and confirm.
+5. Wait for the resize to complete — typically under 2 minutes.
+6. Power on the server if it does not start automatically.
+7. Notify the AI assistant when the server is reachable again.
+
+> No IP address changes are required. Floating IPs, public IPs, and private
+> network IPs all remain the same after a Hetzner in-place resize.
+
+### Step 5 — Post-resize recovery (AI assistant)
+
+Start all services and capture the new host profile:
+
+```bash
+ssh demotracker 'set -e
+  echo "=== startup-utc ==="
+  date -u +%Y-%m-%dT%H:%M:%SZ
+  echo "=== host ==="
+  nproc; free -h; uptime
+  cd /opt/torrust
+  echo "=== docker-compose-up ==="
+  docker compose up -d
+  echo "=== docker-compose-ps ==="
+  docker compose ps'
+```
+
+### Step 6 — Post-resize validation (AI assistant)
+
+Run all checks and record outputs in the execution log.
+
+```bash
+# Container health
+ssh demotracker 'cd /opt/torrust && docker compose ps'
+
+# UDP buffer counters (should be zero after fresh boot)
+ssh demotracker 'grep "^Udp:" /proc/net/snmp; nstat -az 2>/dev/null | grep -Ei "UdpRcvbufErrors|Udp6RcvbufErrors" || true'
+
+# Prometheus targets
+ssh demotracker 'curl -sG "http://127.0.0.1:9090/api/v1/query" \
+  --data-urlencode "query=up{job=\"tracker_metrics\"}"
+  curl -sG "http://127.0.0.1:9090/api/v1/query" \
+  --data-urlencode "query=up{job=\"tracker_stats\"}"'
+```
+
+External checks (from local machine):
+
+```bash
+# HTTP tracker health
+curl -fsS "https://http1.torrust-tracker-demo.com/health_check"
+
+# Grafana (302 to /login is expected)
+curl -I "https://grafana.torrust-tracker-demo.com"
+
+# UDP port probe
+nc -zvu udp1.torrust-tracker-demo.com 6969 2>&1 | head -5
+```
+
+All services must reach `healthy` status, HTTP health must return `200`,
+and Prometheus targets must show `up=1` before the resize is considered
+complete.
+
+> The tracker API health endpoint (`/health_check` on `api.torrust-tracker-demo.com`)
+> requires authentication and returns `500 unauthorized` without a token.
+> This is expected and not a failure indicator.
+
+### Step 7 — Document and commit
+
+Fill in the execution log (`01-resize-execution.md`) with all checklist items,
+the full timeline (start UTC / end UTC / total impact window), the command
+outputs, and the validation results.
+
+Run linters before committing:
+
+```bash
+./scripts/lint.sh
+```
+
+Commit with:
+
+```bash
+git commit -S -m "docs(issue-<N>): document resize execution and post-resize validation" \
+  -m "Refs: #<N>"
+```
+
+### Step 8 — Update infrastructure docs
+
+After the resize is confirmed stable:
+
+- Update the hardware table in `docs/infrastructure.md` to reflect the new
+  server type, vCPU count, RAM, storage, traffic allowance, and price.
+- Add a row to `docs/infrastructure-resize-history.md` with the resize date,
+  old and new plan, throughput at resize time, normalized req/s per vCPU,
+  and a link to the related issue.
+
+---
+
+## Post-Resize Observation Period
+
+After the resize, monitor for at least **7 days** before concluding success:
+
+- Fill one row per day in `docs/issues/evidence/ISSUE-<N>/02-post-resize-daily-checks.md`
+  using the same Prometheus queries from Step 1.
+- Check external uptime from [newTrackon](https://newtrackon.com/) or similar.
+- Watch UDP buffer error counters for any resurgence.
+
+Once the observation window is complete, fill the final comparison table in
+`docs/issues/evidence/ISSUE-<N>/03-pre-post-comparison.md` and decide whether
+the resize meets the acceptance criteria.
diff --git a/docs/infrastructure-resize-history.md b/docs/infrastructure-resize-history.md
@@ -23,10 +23,10 @@ investigations (especially for UDP uptime on newTrackon).
 
 ## Timeline
 
-| Date (UTC) | Change type           | Server plan | vCPU | RAM   | HTTP1 req/s | UDP1 req/s | Total req/s | Req/s per vCPU | UDP newTrackon uptime | Notes                                                                                | Related                                                          |
-| ---------- | --------------------- | ----------- | ---- | ----- | ----------- | ---------- | ----------- | -------------- | --------------------- | ------------------------------------------------------------------------------------ | ---------------------------------------------------------------- |
-| 2026-04-13 | Baseline (pre-resize) | CCX23       | 4    | 16 GB | ~1300       | ~1500      | ~2800       | ~700           | 92.20%                | High combined load. Capacity pressure suspected at current normalized request rate.  | [#19](https://github.com/torrust/torrust-tracker-demo/issues/19) |
-| 2026-04-13 | Planned target resize | CCX33       | 8    | 32 GB | ~1300       | ~1500      | ~2800       | ~350           | 92.20%                | Selected next plan: 30 TB traffic, €0.100/h - €62.49/mo. Value assumes similar load. | [#21](https://github.com/torrust/torrust-tracker-demo/issues/21) |
+| Date (UTC) | Change type           | Server plan | vCPU | RAM   | HTTP1 req/s | UDP1 req/s | Total req/s | Req/s per vCPU | UDP newTrackon uptime | Notes                                                                                           | Related                                                          |
+| ---------- | --------------------- | ----------- | ---- | ----- | ----------- | ---------- | ----------- | -------------- | --------------------- | ----------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- |
+| 2026-04-13 | Baseline (pre-resize) | CCX23       | 4    | 16 GB | ~1350       | ~1507      | ~2857       | ~714           | 92.20%                | Baseline from Prometheus 5m rate snapshot at 2026-04-13T15:27:46Z. Capacity pressure suspected. | [#19](https://github.com/torrust/torrust-tracker-demo/issues/19) |
+| 2026-04-13 | Planned target resize | CCX33       | 8    | 32 GB | ~1350       | ~1507      | ~2857       | ~357           | 92.20%                | Selected next plan: 30 TB traffic, €0.100/h - €62.49/mo. Assumes similar load after resize.     | [#21](https://github.com/torrust/torrust-tracker-demo/issues/21) |
 
 ## Decision Criteria (Suggested)
 
@@ -39,5 +39,7 @@ investigations (especially for UDP uptime on newTrackon).
 
 1. Track UDP uptime daily for at least 7 days.
 2. Re-check host load and UDP receive buffer errors.
+   For conntrack-specific diagnosis and remediation, use
+   [udp-conntrack-runbook.md](udp-conntrack-runbook.md).
 3. Compare tracker error/aborted counters before vs after resize.
 4. Record final conclusion in this file and in the related issue.
diff --git a/docs/infrastructure.md b/docs/infrastructure.md
@@ -6,6 +6,8 @@ For raw command outputs (`ip addr`, `df -h`, etc.) see
 [infrastructure-raw-outputs.md](infrastructure-raw-outputs.md).
 For server resize and observed request-rate history see
 [infrastructure-resize-history.md](infrastructure-resize-history.md).
+For UDP packet-loss diagnosis and conntrack tuning guidance see
+[udp-conntrack-runbook.md](udp-conntrack-runbook.md).
 
 ## Server
 

diff --git a/docs/issues/ISSUE-21-scale-up-server-for-udp-uptime.md b/docs/issues/ISSUE-21-scale-up-server-for-udp-uptime.md
@@ -11,8 +11,8 @@
 ## Overview
 
 Observed traffic and evidence suggest the current server size (CCX23, 4 vCPU,
-16 GB RAM) is likely under pressure for current request volume (roughly
-1300 HTTP req/s + 1500 UDP req/s).
+16 GB RAM) is likely under pressure for current request volume (about
+1350 HTTP req/s + 1507 UDP req/s at the latest baseline snapshot).
 
 Current public uptime observed in newTrackon for UDP is below target:
 
@@ -21,22 +21,39 @@ Current public uptime observed in newTrackon for UDP is below target:
 This issue tracks a controlled resize experiment to determine whether capacity
 is the main bottleneck and to restore/maintain UDP uptime at or above 99%.
 
+## Current State (2026-04-27) — RESOLVED
+
+- Resize (CCX23 -> CCX33) complete and stable.
+- Conntrack overflow root cause identified and fixed on 2026-04-20
+  (`nf_conntrack_max` 262144 → 1048576, UDP timeouts reduced, module pre-load
+  added).
+- 7-day post-fix observation window complete.
+- newTrackon rolling UDP uptime reached **99.9%** — above the 99.0% target.
+
+Outcome: **Success**. See
+[03-pre-post-comparison.md](evidence/ISSUE-21/03-pre-post-comparison.md) for
+the final decision record. Permanent follow-up documentation now lives in
+[udp-conntrack-runbook.md](../../udp-conntrack-runbook.md), with a reusable
+workspace skill at `.github/skills/check-udp-conntrack/skill.md`.
+
 ## Goal
 
 Increase UDP tracker uptime to at least 99.0% over a rolling 7-day window while
 keeping service behavior stable.
 
 ## Current Throughput Baseline (Pre-Resize)
 
-Observed request rates (Grafana, recent 3h window):
+Observed request rates at baseline snapshot (`2026-04-13T15:27:46Z`):
+
+- Source: Prometheus instant query using 5-minute rate windows
 
-- HTTP1: ~1300 req/s
-- UDP1: ~1500 req/s
-- Combined: ~2800 req/s
+- HTTP1: ~1350 req/s
+- UDP1: ~1507 req/s
+- Combined: ~2857 req/s
 
 On the current CCX23 (4 vCPU), this is approximately:
 
-- ~700 req/s per vCPU (combined)
+- ~714 req/s per vCPU (combined)
 
 This baseline must be preserved in the resize history so future sizing
 decisions can be based on both absolute load and normalized load per vCPU.
@@ -98,12 +115,12 @@ The next available option selected for this experiment is:
 
 ## Acceptance Criteria
 
-- [ ] Resize executed and documented in resize history.
-- [ ] No critical service regression immediately after resize.
-- [ ] At least 7 days of post-resize observations recorded.
-- [ ] UDP newTrackon uptime reaches and stays >= 99.0% during evaluation window.
-- [ ] Pre/post comparison documented with clear conclusion.
-- [ ] Resize workflow skill added and referenced.
+- [x] Resize executed and documented in resize history.
+- [x] No critical service regression immediately after resize.
+- [x] At least 7 days of post-resize observations recorded.
+- [x] UDP newTrackon uptime reaches and stays >= 99.0% during evaluation window.
+- [x] Pre/post comparison documented with clear conclusion.
+- [x] Resize workflow skill added and referenced.
 
 ## Possible Outcomes