Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions .github/skills/check-udp-conntrack/skill.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
name: check-udp-conntrack
description: Workflow for checking whether UDP packet loss or uptime degradation may be caused by conntrack saturation on the torrust-tracker-demo server. Use when diagnosing UDP timeouts, low newTrackon uptime, packet drops, conntrack pressure, UDP receive-buffer errors, or when validating whether conntrack tuning is still healthy.
metadata:
author: torrust
version: "1.0"
---

<!-- cspell:ignore Rcvbuf conntrack NoPorts -->

# Check UDP Conntrack

## Overview

Use this skill to investigate whether UDP instability is caused by kernel-side
conntrack saturation or related packet-path pressure.

The canonical human-facing reference is:

- `docs/udp-conntrack-runbook.md`

Keep durable explanations and operational guidance in that document. This skill
should stay focused on workflow and safe execution.

## When To Use

Use this skill when the user asks to:

- check whether conntrack is too small
- diagnose UDP timeouts or packet loss
- validate that current conntrack tuning is still active
- verify whether the server is dropping UDP packets
- assess whether current symptoms point to conntrack saturation or something else

## Workflow

1. Run the host checks from `docs/udp-conntrack-runbook.md`.
2. Summarize the results in terms of:
- conntrack occupancy
- presence or absence of `table full` events
- IPv4 and IPv6 UDP receive-buffer errors
- whether `NoPorts` counters are relevant or benign
3. Distinguish conntrack saturation from softirq/RX steering imbalance.
4. If the user asks to document the result, update the relevant issue evidence
or incident file and reference the runbook when appropriate.

## Interpretation Rules

- `nf_conntrack_count` near or equal to `nf_conntrack_max` means real pressure.
- Any fresh `nf_conntrack: table full, dropping packet` message is a confirmed problem.
- `UdpRcvbufErrors` or `Udp6RcvbufErrors` increasing during the incident means packet loss below the application layer.
- `NoPorts` counters alone do not prove tracker loss.
- High load average with one CPU dominated by `%soft` points to softirq concentration, not necessarily conntrack exhaustion.

## Safety Constraints

- Do not change sysctl values unless the user explicitly asks for a fix.
- If applying a fix, update both runtime state and persistent files when appropriate.
- Preserve issue-specific evidence in `docs/issues/evidence/ISSUE-<N>/`.
- Do not present the skill as the primary source of truth; the runbook in `docs/` is the canonical explanation.
206 changes: 206 additions & 0 deletions .github/skills/scale-up-server/skill.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
---
name: scale-up-server
description: Step-by-step workflow for resizing (scaling up) the Hetzner server in the torrust-tracker-demo stack. Use when asked to resize, scale up, or upgrade the server plan. Covers pre-resize preparation, graceful shutdown, provider panel action, post-resize recovery, and evidence capture. Triggers on "resize server", "scale up", "upgrade server plan", "Hetzner resize", "change server type".
metadata:
author: torrust
version: "1.0"
---

<!-- cspell:ignore nproc Rcvbuf snmp nstat urlencode -->

# Scaling Up the Server

## Overview

This skill covers a **planned, live resize** of the Hetzner Cloud server:
shut down services gracefully, resize the instance in the provider panel,
restart services, and validate everything before re-opening to traffic.

> **Important**: Resizing a Hetzner Cloud server **does not change IP addresses**.
> Neither the public IPv4/IPv6 addresses nor any attached Floating IPs are
> affected. DNS records and Floating IP assignments do not need updating.
> This is standard cloud-provider behavior for in-place resizes.

## Responsibilities

| Step | Who |
| ----------------------------------- | ---------------------- |
| Capture pre-resize baseline | AI assistant |
| Graceful service shutdown | AI assistant (via SSH) |
| Resize in Hetzner Cloud panel | **Human operator** |
| Post-resize recovery and validation | AI assistant (via SSH) |
| Document evidence and commit | AI assistant |

---

## Workflow

### Step 1 — Capture pre-resize baseline

Before touching the server, record the current state so there is a before/after
reference. Save results to the issue-scoped evidence folder
(`docs/issues/evidence/ISSUE-<N>/00-pre-resize-baseline.md`).

```bash
# Host snapshot
ssh demotracker 'date -u; nproc; free -h; uptime; df -h'

# Docker services
ssh demotracker 'cd /opt/torrust && docker compose ps'

# Prometheus request rates (5m window)
ssh demotracker 'curl -sG "http://127.0.0.1:9090/api/v1/query" \
--data-urlencode "query=sum(rate(http_tracker_core_requests_received_total{server_binding_protocol=\"http\",server_binding_port=\"7070\"}[5m]))"'

ssh demotracker 'curl -sG "http://127.0.0.1:9090/api/v1/query" \
--data-urlencode "query=sum(rate(udp_tracker_server_requests_received_total{server_binding_protocol=\"udp\",server_binding_port=\"6969\"}[5m]))"'

# UDP buffer error counters
ssh demotracker 'grep "^Udp:" /proc/net/snmp; nstat -az 2>/dev/null | grep -Ei "UdpRcvbufErrors|Udp6RcvbufErrors" || true'
```

Commit the baseline file before proceeding to shutdown.

### Step 2 — Confirm readiness

Before shutting down:

- Baseline file is complete and committed.
- Branch is clean and pushed.
- Nightly backup window awareness (~03:00 UTC). Prefer resizing outside that window.
- Operator is available to complete the Hetzner panel action promptly.

### Step 3 — Graceful service shutdown (AI assistant)

Run from a local terminal. Capture the full output and record it in
`docs/issues/evidence/ISSUE-<N>/01-resize-execution.md`.

```bash
ssh demotracker 'set -e
echo "=== shutdown-start-utc ==="
date -u +%Y-%m-%dT%H:%M:%SZ
cd /opt/torrust
echo "=== docker-compose-ps-before ==="
docker compose ps
echo "=== docker-compose-down ==="
docker compose down
echo "=== docker-compose-ps-after ==="
docker compose ps
echo "=== shutdown-end-utc ==="
date -u +%Y-%m-%dT%H:%M:%SZ'
```

Confirm all containers are stopped and networks are removed before handing over.

### Step 4 — Resize in Hetzner Cloud panel (human operator)

1. Log in to [Hetzner Cloud Console](https://console.hetzner.cloud/).
2. Navigate to the project and select the server (`torrust-tracker-demo` or similar).
3. Go to **Rescale** (or **Server type**) tab.
4. Select the target server type (e.g. CCX33) and confirm.
5. Wait for the resize to complete — typically under 2 minutes.
6. Power on the server if it does not start automatically.
7. Notify the AI assistant when the server is reachable again.

> No IP address changes are required. Floating IPs, public IPs, and private
> network IPs all remain the same after a Hetzner in-place resize.

### Step 5 — Post-resize recovery (AI assistant)

Start all services and capture the new host profile:

```bash
ssh demotracker 'set -e
echo "=== startup-utc ==="
date -u +%Y-%m-%dT%H:%M:%SZ
echo "=== host ==="
nproc; free -h; uptime
cd /opt/torrust
echo "=== docker-compose-up ==="
docker compose up -d
echo "=== docker-compose-ps ==="
docker compose ps'
```

### Step 6 — Post-resize validation (AI assistant)

Run all checks and record outputs in the execution log.

```bash
# Container health
ssh demotracker 'cd /opt/torrust && docker compose ps'

# UDP buffer counters (should be zero after fresh boot)
ssh demotracker 'grep "^Udp:" /proc/net/snmp; nstat -az 2>/dev/null | grep -Ei "UdpRcvbufErrors|Udp6RcvbufErrors" || true'

# Prometheus targets
ssh demotracker 'curl -sG "http://127.0.0.1:9090/api/v1/query" \
--data-urlencode "query=up{job=\"tracker_metrics\"}"
curl -sG "http://127.0.0.1:9090/api/v1/query" \
--data-urlencode "query=up{job=\"tracker_stats\"}"'
```

External checks (from local machine):

```bash
# HTTP tracker health
curl -fsS "https://http1.torrust-tracker-demo.com/health_check"

# Grafana (302 to /login is expected)
curl -I "https://grafana.torrust-tracker-demo.com"

# UDP port probe
nc -zvu udp1.torrust-tracker-demo.com 6969 2>&1 | head -5
```

All services must reach `healthy` status, HTTP health must return `200`,
and Prometheus targets must show `up=1` before the resize is considered
complete.

> The tracker API health endpoint (`/health_check` on `api.torrust-tracker-demo.com`)
> requires authentication and returns `500 unauthorized` without a token.
> This is expected and not a failure indicator.

### Step 7 — Document and commit

Fill in the execution log (`01-resize-execution.md`) with all checklist items,
the full timeline (start UTC / end UTC / total impact window), the command
outputs, and the validation results.

Run linters before committing:

```bash
./scripts/lint.sh
```

Commit with:

```bash
git commit -S -m "docs(issue-<N>): document resize execution and post-resize validation" \
-m "Refs: #<N>"
```

### Step 8 — Update infrastructure docs

After the resize is confirmed stable:

- Update the hardware table in `docs/infrastructure.md` to reflect the new
server type, vCPU count, RAM, storage, traffic allowance, and price.
- Add a row to `docs/infrastructure-resize-history.md` with the resize date,
old and new plan, throughput at resize time, normalized req/s per vCPU,
and a link to the related issue.

---

## Post-Resize Observation Period

After the resize, monitor for at least **7 days** before concluding success:

- Fill one row per day in `docs/issues/evidence/ISSUE-<N>/02-post-resize-daily-checks.md`
using the same Prometheus queries from Step 1.
- Check external uptime from [newTrackon](https://newtrackon.com/) or similar.
- Watch UDP buffer error counters for any resurgence.

Once the observation window is complete, fill the final comparison table in
`docs/issues/evidence/ISSUE-<N>/03-pre-post-comparison.md` and decide whether
the resize meets the acceptance criteria.
10 changes: 6 additions & 4 deletions docs/infrastructure-resize-history.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,10 @@ investigations (especially for UDP uptime on newTrackon).

## Timeline

| Date (UTC) | Change type | Server plan | vCPU | RAM | HTTP1 req/s | UDP1 req/s | Total req/s | Req/s per vCPU | UDP newTrackon uptime | Notes | Related |
| ---------- | --------------------- | ----------- | ---- | ----- | ----------- | ---------- | ----------- | -------------- | --------------------- | ------------------------------------------------------------------------------------ | ---------------------------------------------------------------- |
| 2026-04-13 | Baseline (pre-resize) | CCX23 | 4 | 16 GB | ~1300 | ~1500 | ~2800 | ~700 | 92.20% | High combined load. Capacity pressure suspected at current normalized request rate. | [#19](https://github.com/torrust/torrust-tracker-demo/issues/19) |
| 2026-04-13 | Planned target resize | CCX33 | 8 | 32 GB | ~1300 | ~1500 | ~2800 | ~350 | 92.20% | Selected next plan: 30 TB traffic, €0.100/h - €62.49/mo. Value assumes similar load. | [#21](https://github.com/torrust/torrust-tracker-demo/issues/21) |
| Date (UTC) | Change type | Server plan | vCPU | RAM | HTTP1 req/s | UDP1 req/s | Total req/s | Req/s per vCPU | UDP newTrackon uptime | Notes | Related |
| ---------- | --------------------- | ----------- | ---- | ----- | ----------- | ---------- | ----------- | -------------- | --------------------- | ----------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- |
| 2026-04-13 | Baseline (pre-resize) | CCX23 | 4 | 16 GB | ~1350 | ~1507 | ~2857 | ~714 | 92.20% | Baseline from Prometheus 5m rate snapshot at 2026-04-13T15:27:46Z. Capacity pressure suspected. | [#19](https://github.com/torrust/torrust-tracker-demo/issues/19) |
| 2026-04-13 | Planned target resize | CCX33 | 8 | 32 GB | ~1350 | ~1507 | ~2857 | ~357 | 92.20% | Selected next plan: 30 TB traffic, €0.100/h - €62.49/mo. Assumes similar load after resize. | [#21](https://github.com/torrust/torrust-tracker-demo/issues/21) |

## Decision Criteria (Suggested)

Expand All @@ -39,5 +39,7 @@ investigations (especially for UDP uptime on newTrackon).

1. Track UDP uptime daily for at least 7 days.
2. Re-check host load and UDP receive buffer errors.
For conntrack-specific diagnosis and remediation, use
[udp-conntrack-runbook.md](udp-conntrack-runbook.md).
3. Compare tracker error/aborted counters before vs after resize.
4. Record final conclusion in this file and in the related issue.
2 changes: 2 additions & 0 deletions docs/infrastructure.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ For raw command outputs (`ip addr`, `df -h`, etc.) see
[infrastructure-raw-outputs.md](infrastructure-raw-outputs.md).
For server resize and observed request-rate history see
[infrastructure-resize-history.md](infrastructure-resize-history.md).
For UDP packet-loss diagnosis and conntrack tuning guidance see
[udp-conntrack-runbook.md](udp-conntrack-runbook.md).

## Server

Expand Down
43 changes: 30 additions & 13 deletions docs/issues/ISSUE-21-scale-up-server-for-udp-uptime.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@
## Overview

Observed traffic and evidence suggest the current server size (CCX23, 4 vCPU,
16 GB RAM) is likely under pressure for current request volume (roughly
1300 HTTP req/s + 1500 UDP req/s).
16 GB RAM) is likely under pressure for current request volume (about
1350 HTTP req/s + 1507 UDP req/s at the latest baseline snapshot).

Current public uptime observed in newTrackon for UDP is below target:

Expand All @@ -21,22 +21,39 @@ Current public uptime observed in newTrackon for UDP is below target:
This issue tracks a controlled resize experiment to determine whether capacity
is the main bottleneck and to restore/maintain UDP uptime at or above 99%.

## Current State (2026-04-27) — RESOLVED

- Resize (CCX23 -> CCX33) complete and stable.
- Conntrack overflow root cause identified and fixed on 2026-04-20
(`nf_conntrack_max` 262144 → 1048576, UDP timeouts reduced, module pre-load
added).
- 7-day post-fix observation window complete.
- newTrackon rolling UDP uptime reached **99.9%** — above the 99.0% target.

Outcome: **Success**. See
[03-pre-post-comparison.md](evidence/ISSUE-21/03-pre-post-comparison.md) for
the final decision record. Permanent follow-up documentation now lives in
[udp-conntrack-runbook.md](../../udp-conntrack-runbook.md), with a reusable
workspace skill at `.github/skills/check-udp-conntrack/skill.md`.

## Goal

Increase UDP tracker uptime to at least 99.0% over a rolling 7-day window while
keeping service behavior stable.

## Current Throughput Baseline (Pre-Resize)

Observed request rates (Grafana, recent 3h window):
Observed request rates at baseline snapshot (`2026-04-13T15:27:46Z`):

- Source: Prometheus instant query using 5-minute rate windows

- HTTP1: ~1300 req/s
- UDP1: ~1500 req/s
- Combined: ~2800 req/s
- HTTP1: ~1350 req/s
- UDP1: ~1507 req/s
- Combined: ~2857 req/s

On the current CCX23 (4 vCPU), this is approximately:

- ~700 req/s per vCPU (combined)
- ~714 req/s per vCPU (combined)

This baseline must be preserved in the resize history so future sizing
decisions can be based on both absolute load and normalized load per vCPU.
Expand Down Expand Up @@ -98,12 +115,12 @@ The next available option selected for this experiment is:

## Acceptance Criteria

- [ ] Resize executed and documented in resize history.
- [ ] No critical service regression immediately after resize.
- [ ] At least 7 days of post-resize observations recorded.
- [ ] UDP newTrackon uptime reaches and stays >= 99.0% during evaluation window.
- [ ] Pre/post comparison documented with clear conclusion.
- [ ] Resize workflow skill added and referenced.
- [x] Resize executed and documented in resize history.
- [x] No critical service regression immediately after resize.
- [x] At least 7 days of post-resize observations recorded.
- [x] UDP newTrackon uptime reaches and stays >= 99.0% during evaluation window.
- [x] Pre/post comparison documented with clear conclusion.
- [x] Resize workflow skill added and referenced.

## Possible Outcomes

Expand Down
Loading
Loading