Skip to content

Commit e532e09

Browse files
baditaflorinclaude
authored andcommitted
[publish] Sanitized snapshot from 0ed784c
Source: platform_server main @ 0ed784c Generated by: scripts/publish_to_serverclaw.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 0ed784c commit e532e09

97 files changed

Lines changed: 6233 additions & 6348 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -244,7 +244,7 @@ real values into `platform.yml`. The publish pipeline sanitises the real values
244244
when syncing to the public mirror. The private repo's `platform.yml` must
245245
always reflect actual deployment reality.
246246

247-
> **Incident**: This gap caused `headscale.lv3.org` DNS to point at `203.0.113.1`
247+
> **Incident**: This gap caused `headscale.example.com` DNS to point at `203.0.113.1`
248248
> (a non-routable documentation IP), breaking Tailscale VPN for the entire
249249
> deployment.
250250

collections/ansible_collections/lv3/platform/roles/proxmox_network/tasks/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
placeholder. Writing this to /etc/network/interfaces causes TOTAL HOST LOCKOUT
4040
on the next reboot (wrong IP — server unreachable). Aborting convergence.
4141
Fix: ensure .local/identity.yml is injected via -e @.local/identity.yml.
42-
See incident postmortem 2026-04-12 (6h outage on 65.108.75.123).
42+
See incident postmortem 2026-04-12 (6h outage on 203.0.113.1).
4343
4444
- name: Validate optional staging bridge inputs
4545
ansible.builtin.assert:

collections/ansible_collections/lv3/platform/roles/proxmox_security/tasks/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@
8484
# INPUT policy. If it crashes after setting DROP policy but before adding ACCEPT rules,
8585
# the host becomes unreachable. This guard detects that state, stops pve-firewall, and
8686
# aborts convergence so the operator can investigate.
87-
# Incident reference: 2026-04-12 — 6h outage on 65.108.75.123; root cause was
87+
# Incident reference: 2026-04-12 — 6h outage on 203.0.113.1; root cause was
8888
# placeholder IP in /etc/network/interfaces (wrong IP, not firewall), but this guard
8989
# provides defence-in-depth against the firewall-crash scenario.
9090
- name: Wait for pve-firewall to populate ACCEPT rules in PVEFW-HOST-IN (up to 30s)

docs/adr/0373-service-registry-and-derived-defaults.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -361,7 +361,7 @@ If a role's migration breaks, the fix is to temporarily re-add the removed defau
361361
- `repo_intake` converged successfully on `docker-runtime`
362362
- direct health checks on `http://127.0.0.1:8101/health` returned `{"status":"ok"}`
363363
- edge verification from the `nginx` guest returned the expected OAuth redirect
364-
for `https://repo-intake.lv3.org/`
364+
for `https://repo-intake.example.com/`
365365
- governed restic backup receipts were refreshed at
366366
`receipts/restic-backups/20260413T105157Z.json`,
367367
`receipts/restic-backups/20260413T110651Z.json`, and

docs/adr/0410-docker-isolation-testing-and-ioc-completion.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,7 @@ step-ca.docker-dev.local → step-ca container (172.30.10.99)
295295
```
296296

297297
This enables services to resolve each other by FQDN, matching production
298-
behavior where all services use `*.lv3.org`.
298+
behavior where all services use `*.example.com`.
299299

300300
### Phase 5: Test Scenarios and Timing (P2)
301301

docs/adr/0413-sso-redirect-uri-and-service-topology-variable-drift.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ that together caused 7 of 17 services to be unavailable or broken.
1414

1515
### Bug Class 1 — SSO Redirect URI Mismatch (LibreChat / serverclaw client)
1616

17-
When a user clicked "Login with Keycloak" on LibreChat (`chat.lv3.org`), Keycloak returned:
17+
When a user clicked "Login with Keycloak" on LibreChat (`chat.example.com`), Keycloak returned:
1818

1919
```
2020
We are sorry... An internal server error has occurred
@@ -53,22 +53,22 @@ play time, causing template rendering to silently fail or produce empty port num
5353

5454
| Service | URL | Symptom | Broken variable |
5555
|---------|-----|---------|----------------|
56-
| Directus | data.lv3.org | 502 Bad Gateway | `directus_container_port` |
57-
| Paperless | paperless.lv3.org | 502 Bad Gateway | `paperless_service_topology` |
58-
| Coolify | coolify.lv3.org | 502 Bad Gateway | `coolify_dashboard_port` |
59-
| GlitchTip | errors.lv3.org | TLS + dead code | `glitchtip_internal_port` (dead code) |
56+
| Directus | data.example.com | 502 Bad Gateway | `directus_container_port` |
57+
| Paperless | paperless.example.com | 502 Bad Gateway | `paperless_service_topology` |
58+
| Coolify | coolify.example.com | 502 Bad Gateway | `coolify_dashboard_port` |
59+
| GlitchTip | errors.example.com | TLS + dead code | `glitchtip_internal_port` (dead code) |
6060

6161
**Services with latent bugs (currently alive from old deployment):**
6262

6363
| Service | URL | Broken variable | Risk |
6464
|---------|-----|----------------|------|
65-
| Dify | agents.lv3.org | `dify_port`, `dify_internal_base_url`, `dify_ollama_base_url` | Next converge would break port mapping |
65+
| Dify | agents.example.com | `dify_port`, `dify_internal_base_url`, `dify_ollama_base_url` | Next converge would break port mapping |
6666

6767
**Services with TLS cert gaps (separate from above):**
6868

6969
The nginx edge certificate `lv3-edge` was missing SANs for five subdomains that were
7070
added to the service topology after the last cert issuance:
71-
`grist.lv3.org`, `errors.lv3.org`, `bi.lv3.org`, `paperless.lv3.org`, `scheduler.lv3.org`.
71+
`grist.example.com`, `errors.example.com`, `bi.example.com`, `paperless.example.com`, `scheduler.example.com`.
7272

7373
This causes hard TLS errors in browsers even when the backend containers are running.
7474
Fix: run `make converge-nginx-edge env=production` which will invoke certbot DNS-01
@@ -86,7 +86,7 @@ All other references (Keycloak client registration, service registry, tests)
8686
must match this value. The path `/oauth/openid/callback` is correct.
8787

8888
**Immediate live fix:** Updated the Keycloak `serverclaw` client via the admin API
89-
on the live platform to register `https://chat.lv3.org/oauth/openid/callback`.
89+
on the live platform to register `https://chat.example.com/oauth/openid/callback`.
9090
This fix is reflected in code so the next `make converge-keycloak` is idempotent.
9191

9292
### 2. Eliminate all `platform_service_topology` references in role defaults
@@ -119,10 +119,10 @@ per ADR 0412).
119119
| Action | Command | Required for |
120120
|--------|---------|--------------|
121121
| Reissue TLS cert | `make converge-nginx-edge env=production` | grist, errors, bi, paperless, scheduler TLS |
122-
| Redeploy Directus | `make converge-directus env=production` | data.lv3.org 502 fix |
123-
| Redeploy Paperless | `make converge-paperless env=production` | paperless.lv3.org 502 fix |
124-
| Redeploy Coolify | `make converge-coolify env=production` | coolify.lv3.org 502 fix |
125-
| Investigate Superset | SSH to docker-runtime, `docker ps | grep superset` | bi.lv3.org — port chain correct, container may be stopped |
122+
| Redeploy Directus | `make converge-directus env=production` | data.example.com 502 fix |
123+
| Redeploy Paperless | `make converge-paperless env=production` | paperless.example.com 502 fix |
124+
| Redeploy Coolify | `make converge-coolify env=production` | coolify.example.com 502 fix |
125+
| Investigate Superset | SSH to docker-runtime, `docker ps | grep superset` | bi.example.com — port chain correct, container may be stopped |
126126
| Re-converge Keycloak | `make converge-keycloak env=production` | Pick up serverclaw redirect_uri fix |
127127

128128
---
@@ -142,7 +142,7 @@ per ADR 0412).
142142
- Four services (Directus, Paperless, Coolify, Superset) require a manual re-convergence
143143
to actually recover from 502. The code fix alone is not sufficient.
144144
- TLS cert expansion also requires a manual `make converge-nginx-edge` run.
145-
- Nomad scheduler (`scheduler.lv3.org`) has both a TLS cert gap and a backend timeout
145+
- Nomad scheduler (`scheduler.example.com`) has both a TLS cert gap and a backend timeout
146146
and requires separate investigation.
147147

148148
### Neutral

docs/adr/0416-topology-consistency-enforcement.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
### The Incident (2026-04-14)
1212

13-
A user reported `chat.lv3.org` returning "An internal server error has occurred" during
13+
A user reported `chat.example.com` returning "An internal server error has occurred" during
1414
Keycloak SSO login. Investigation revealed:
1515

1616
```
@@ -40,7 +40,7 @@ This is the **third** topology drift incident in 72 hours:
4040

4141
| Date | Incident | Drifted registries |
4242
|------|----------|--------------------|
43-
| 2026-04-13 | nginx edge reverting sso.lv3.org to docker-runtime | `lv3_service_topology`, `platform.yml` |
43+
| 2026-04-13 | nginx edge reverting sso.example.com to docker-runtime | `lv3_service_topology`, `platform.yml` |
4444
| 2026-04-13 | SSO redirect URI mismatch (ADR 0413) | collection role, standalone role, tests |
4545
| 2026-04-14 | Keycloak pg_hba.conf blocking runtime-control | `platform_postgres_clients`, `platform_service_registry` |
4646

docs/architecture/ioc-value-flow.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -165,9 +165,9 @@ graph LR
165165
end
166166
167167
subgraph DEPLOY["Deployed Instance"]
168-
D1["identity: lv3.org"]
169-
D2["host_vars: 65.108.75.123"]
170-
D3["services: *.lv3.org"]
168+
D1["identity: example.com"]
169+
D2["host_vars: 203.0.113.1"]
170+
D3["services: *.example.com"]
171171
end
172172
173173
PRIVATE -->|"publish_to_serverclaw.py<br/>Tier C: 0 files changed"| PUBLIC

docs/postmortems/2026-04-11-docker-inception-ioc-test.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ The Ansible extra-vars override mechanism works exactly as designed:
162162

163163
| Variable | Committed | Local Override | Result |
164164
|----------|-----------|---------------|--------|
165-
| `platform_domain` | `example.com` | `lv3.org` | Override wins |
165+
| `platform_domain` | `example.com` | `example.com` | Override wins |
166166
| `platform_operator_email` | `operator@example.com` | real email | Override wins |
167167
| `platform_operator_name` | `Platform Operator` | real name | Override wins |
168168
| `management_ipv4` | not in committed | real IP | Injected |

docs/postmortems/2026-04-12-placeholder-ip-host-lockout.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,13 @@
44
**Severity:** CRITICAL (P0)
55
**Duration:** ~6 hours (approx. 03:00–09:00 UTC)
66
**Status:** Resolved
7-
**Affected system:** Proxmox host `65.108.75.123` (all 14 VMs, all platform services)
7+
**Affected system:** Proxmox host `203.0.113.1` (all 14 VMs, all platform services)
88

99
---
1010

1111
## Summary
1212

13-
The Proxmox host became completely unreachable after `/etc/network/interfaces` was written with the RFC 5737 documentation placeholder IP `203.0.113.1/26` instead of the real IP `65.108.75.123/26`. When the network was reloaded (`ifreload -a`) during an Ansible convergence, `vmbr0` received the wrong IP. All inbound traffic to the real IP timed out at the network level — the server appeared offline. No application-level error was produced; the host was simply unreachable.
13+
The Proxmox host became completely unreachable after `/etc/network/interfaces` was written with the RFC 5737 documentation placeholder IP `203.0.113.1/26` instead of the real IP `203.0.113.1/26`. When the network was reloaded (`ifreload -a`) during an Ansible convergence, `vmbr0` received the wrong IP. All inbound traffic to the real IP timed out at the network level — the server appeared offline. No application-level error was produced; the host was simply unreachable.
1414

1515
Recovery required Hetzner KVM console access and a rescue system boot. All 14 VMs remained running throughout but were inaccessible from the internet.
1616

@@ -44,7 +44,7 @@ Recovery required Hetzner KVM console access and a rescue system boot. All 14 VM
4444
| ~09:00 | Hetzner support ticket filed (#2026041203004207); KVM console access requested |
4545
| ~09:30 | KVM credentials obtained; console attached |
4646
| ~09:30 | Root cause identified: `ip addr show vmbr0` reveals `203.0.113.1/26` |
47-
| ~09:35 | Immediate fix: `ip addr del 203.0.113.1/26 dev vmbr0 && ip addr add 65.108.75.123/26 dev vmbr0 broadcast 65.108.75.127`; routes restored |
47+
| ~09:35 | Immediate fix: `ip addr del 203.0.113.1/26 dev vmbr0 && ip addr add 203.0.113.1/26 dev vmbr0 broadcast 65.108.75.127`; routes restored |
4848
| ~09:40 | `/etc/network/interfaces` corrected with real IP values |
4949
| ~09:45 | SSH connectivity restored; all VMs reachable |
5050
| ~10:00 | pve-firewall guard fixed (nftables → iptables); placeholder IP safety guard added |
@@ -98,7 +98,7 @@ Replaced the v0.178.122 nftables-based guard with an iptables-based guard target
9898
Added `ansible_port: "{{ lookup('env', 'LV3_PROXMOX_HOST_PORT') | default(22, true) }}"` to the `proxmox-host` inventory entry, and `proxmox_guest_ssh_jump_port` to the ProxyJump args. This allows convergence to route through the break-glass SSH port (2222) when Tailscale is unavailable:
9999

100100
```bash
101-
LV3_PROXMOX_HOST_ADDR=65.108.75.123 LV3_PROXMOX_HOST_PORT=2222 make converge-gitea env=production
101+
LV3_PROXMOX_HOST_ADDR=203.0.113.1 LV3_PROXMOX_HOST_PORT=2222 make converge-gitea env=production
102102
```
103103

104104
### Fix 4 — `keycloak_local_artifact_dir` missing from `gitea.yml`
@@ -119,9 +119,9 @@ ip addr show vmbr0
119119
120120
# 3. Immediate connectivity fix (without reboot)
121121
ip addr del 203.0.113.1/26 dev vmbr0
122-
ip addr add 65.108.75.123/26 broadcast 65.108.75.127 dev vmbr0
122+
ip addr add 203.0.113.1/26 broadcast 65.108.75.127 dev vmbr0
123123
ip route del default
124-
ip route add default via 65.108.75.65 dev vmbr0
124+
ip route add default via 203.0.113.65 dev vmbr0
125125
126126
# 4. If SSH is not listening
127127
systemctl start ssh
@@ -131,7 +131,7 @@ iptables -L PVEFW-HOST-IN -n # check if ACCEPT rules are loaded
131131
systemctl stop pve-firewall # emergency: INPUT falls through to ACCEPT
132132
133133
# 6. Fix /etc/network/interfaces permanently (use real values)
134-
# Real IP: 65.108.75.123/26, gateway: 65.108.75.65
134+
# Real IP: 203.0.113.1/26, gateway: 203.0.113.65
135135
# Edit: /etc/network/interfaces
136136
137137
# 7. Reload nftables (guest internet may be broken after recovery)
@@ -215,7 +215,7 @@ The `proxmox_network` role already does this (the `Wait for SSH after network re
215215

216216
If `100.64.0.1:22` (Tailscale jump host) is unreachable, the break-glass path is:
217217
```bash
218-
LV3_PROXMOX_HOST_ADDR=65.108.75.123 LV3_PROXMOX_HOST_PORT=2222 make <target> env=production
218+
LV3_PROXMOX_HOST_ADDR=203.0.113.1 LV3_PROXMOX_HOST_PORT=2222 make <target> env=production
219219
```
220220
This uses the public IP and the break-glass SSH port which is always open. Document this in your session notes whenever running playbooks while Tailscale is down.
221221

0 commit comments

Comments
 (0)