doctor subcommand + uninstaller + installer hardening + VM CI by andrewroydshayes · Pull Request #1 · andrewroydshayes/rdc-proxy

andrewroydshayes · 2026-04-25T16:23:18Z

Summary

New rdc-proxy doctor subcommand: read-only post-install diagnostic with 14 checks (host / network / TPROXY plumbing / runtime), with a --json mode for tooling. Catches the silent-failure mode where install.sh reports green but br0 was never actually created (TPROXY rule + ebtables broute rule reference -i br0 so they're inert).
New install/uninstall.sh reverses everything install.sh does, idempotently. Handles legacy filenames + restores netplan backups.
Real installer fix: setup-bridge.sh now actively dispossesses NetworkManager (deletes the active connection profile per member) and netplan (rewrites YAML to renderer:networkd, backs up original). Renames its own systemd-networkd config files to 05-rdc-proxy-* so they win the lexical match against any 10-netplan-* files. Reloads systemd-networkd, brings members up, polls until br0 exists with both members enslaved before exiting.
install.sh self-check now asserts br0 exists AND every expected member is enslaved, not just that the rules are present.
CI:
- installer-smoke.yml (fast lane, every PR) — shellcheck on install/*.sh, run doctor against an empty config dir on a bare runner and assert it correctly reports FAIL with the expected check signature.
- installer-vm.yml (slow lane) — boots Debian Bookworm aarch64 cloud image under qemu-system-aarch64 on the ubuntu-24.04-arm hosted runner, runs install → doctor → uninstall → verify-clean. Currently gated on workflow_dispatch + cron only while iterating: hosted ARM runners don't expose /dev/kvm so we run TCG, and the qemu user-mode NAT topology doesn't gracefully survive the bridge takeover (eth0 dispossession kills the only ssh path). Fast lane gates PRs in the meantime.

Why

The installer was reporting success on hosts where another network manager (NetworkManager on Pi OS Bookworm, netplan on cloud Debian) already owned the bridge member interfaces. The systemd-networkd config got written but lost the priority contest, so br0 was never actually created — yet the TPROXY + ebtables rules referencing -i br0 got installed cleanly and the proxy started listening. From the install summary alone, everything looked green; in reality nothing was being intercepted. Doctor catches this now, the bridge setup actually solves it, and the fast-lane CI guarantees doctor's logic doesn't regress.

Test plan

77/77 unit tests pass (24 new for doctor)
End-to-end manual run in qemu-aarch64 + Debian Bookworm cloud image (HVF-accelerated on local Mac): install reports clean, doctor green, uninstall clean
installer-smoke workflow passes
lint, test (3.10/3.11/3.12), shellcheck pass
installer-vm workflow (slow lane) — runs nightly, currently failing on bridge-takeover SSH disconnect under qemu user-mode NAT; tracked as follow-up

🤖 Generated with Claude Code

doctor: read-only post-install diagnostic. 14 checks across host / network / TPROXY / runtime, with per-check fix hints. Text and --json output. Pure-function checks behind a stubbable Probe so they're unit- testable without touching real system state. Surfaces failures the install.sh self-check misses — notably an install where the iptables/ebtables rules are correctly applied (matching `-i br0`) but br0 itself was never brought up by systemd-networkd, so no traffic ever flows. install.sh declares success in that case; doctor catches it. uninstall.sh: reverses install.sh. Stops + disables the service, kills any leftover process, removes the unit file, deletes the iptables / ebtables / ip-rule entries, removes the systemd-networkd br0 configs, removes the NM unmanaged.conf, and removes /opt/rdc-proxy and /etc/rdc-proxy. Idempotent. Honors KEEP_CONFIG=1, PURGE_APT=1, YES=1. Does NOT tear down br0 itself at runtime (would drop SSH if you're on it), only its boot config. CI: installer-smoke workflow runs shellcheck on install/*.sh and runs doctor against an empty config dir on a bare ubuntu-latest runner, asserting it correctly reports FAIL with the expected check names. Verified end-to-end in qemu-aarch64 (Debian Bookworm cloud image, two virtio NICs, HVF accel): install.sh runs clean, doctor surfaces the missing-bridge bug, uninstall.sh removes everything cleanly.

…comes up Two bugs surfaced by `rdc-proxy doctor`: 1. setup-bridge.sh wrote /etc/systemd/network/{10,20,30}-br0*.network but on hosts where another manager already owned the would-be bridge members, our config never took effect: - Pi OS Bookworm: NetworkManager has an active connection on eth0, writing unmanaged.conf alone doesn't release it (NM keeps the existing profile up). - Cloud Debian / Ubuntu Server: netplan compiles its YAML into /run/systemd/network/10-netplan-eth0.network at priority 10, which beats our 20-br0-members.network in the systemd-networkd match order. eth0 stays out of the bridge silently. Either way the install scripts reported success — TPROXY rule + ebtables broute rule + listening port all green — but `-i br0` matched nothing because br0 had no members. Same shape as the field symptom of "proxy running, no telemetry." 2. install.sh's self-check verified the rules exist but never confirmed br0 itself was up with members enslaved. Fixes: - setup-bridge.sh now actively dispossesses NetworkManager (delete the active connection profile on each member iface, then write unmanaged.conf so NM never reclaims) and netplan (rewrite the YAML to renderer:networkd with no per-iface stanzas, backing up the original). - Renames the systemd-networkd config files to 05-rdc-proxy-* so they win the lexical match order against any 10-netplan-*.network that might still be in /run. - Triggers `networkctl reload` after writing config and verifies br0 exists with all members enslaved before exiting (waits up to 45s total). Hard-fails with a diagnostic dump otherwise. - install.sh self-check now asserts `ip link show br0` succeeds AND every expected member is enslaved. - uninstall.sh removes the new 05-rdc-proxy-* filenames AND restores any netplan .rdc-backup files setup-bridge wrote.

Boots Debian Bookworm aarch64 cloud image under qemu+KVM on the ubuntu-24.04-arm hosted runner, then exercises the whole installer pipeline end-to-end: install.sh -> doctor (must be ok=true and bridge up) -> uninstall.sh -> verify clean state Triggers: PR/push touching install/, daily cron at 11:00 UTC, and manual workflow_dispatch. Complement to installer-smoke.yml's fast lane (shellcheck + doctor against an empty config). Asserts the specific check names that matter: bridge, bridge_ip, iptables_mangle_tproxy, ebtables_broute, listen_port, service must be ok. rdc_reachable is allowed to warn since CI doesn't have a real generator wired in. Any fail anywhere uploads serial.log + the four captured outputs as a workflow artifact.

The ubuntu-24.04-arm hosted runners don't expose /dev/kvm, so 'accel=kvm -cpu host' fails immediately. Detect /dev/kvm at runtime; use it when present, otherwise fall back to '-accel tcg,thread=multi -cpu cortex-a72' (Pi-4 profile, multi-threaded TCG which keeps guest/host arch aligned and runs at usable speed on the ARM host). Bumped timeout to 45 min for the TCG path.

TCG is slow and variable, so a fixed 30s sleep between cloud-init's first ssh-up and the post-reboot ssh-up was racing — the probe step hit the VM mid-reboot and got 'kex_exchange_identification: read: Connection reset by peer'. Switch to polling /proc/uptime via ssh and continuing once it reports >= 90s. That guarantees we're past the reboot regardless of TCG performance.

…d-init)

…not have run yet)

…-shot cloud-init's power_state runs in the final stage which fires on every boot, so 'condition: True' produced an infinite reboot loop the moment cloud-init status --wait returned and we tried to use the VM. Replaced with a runcmd that touches /var/lib/rdc-vm-rebooted and reboots only when the marker is absent — fires exactly once on first boot.

Two-condition wait removes both the apt-lock race (cloud-init still installing git when we reach Stage) and the reboot race (we treat first- boot ssh as ready, then cloud-init reboots and kills our session). The marker file /var/lib/rdc-vm-rebooted is touched immediately before the first-boot reboot, so seeing it guarantees we are post-reboot. Combined with cloud-init status==done, the VM is fully settled before any of our later steps touch it. Stage step's belt-and-braces apt install of git becomes conditional on git being absent — usually a no-op since cloud-init's packages: already ran.

netplan's set-name rename is applied by udev at link-add time using the file under /run/systemd/network — the initramfs rebuild was overhead without payoff. Removing it cuts first-boot time substantially under TCG (where each apt + initramfs operation is multi-minutes). Bumped wait step + internal deadline to 25/23 min as headroom for the cloud-init apt install + reboot cycle when KVM is unavailable.

…otherwise) Under TCG on the hosted ARM runner, install.sh's apt-get install of git+python3+pip+venv+flask+iptables+ebtables+tcpdump+bridge-utils was eating 30+ minutes of wall clock and consistently blew the 45-min job budget. Move that apt install into cloud-init's packages: list so it runs once during cloud-init (still slow but happens before our test loop starts and doesn't block our budget). install.sh's apt step becomes a no-op since DEBIAN_FRONTEND=noninteractive apt-get install on already-present packages exits in seconds.

setup-bridge.sh's networkctl reload moves eth0 into br0, which strips eth0's IP and resets every TCP connection bound to it — including the ssh session streaming the install output. The install would continue inside the VM, but the GH step had no visibility and just waited until the 45-min job timeout. Run install.sh under setsid+nohup with output to /tmp/install.log, then poll for /tmp/install.done from the host side. SSH reconnect picks up the same NAT-DHCP IP (now bound to br0). Tail the log periodically for live visibility, fetch full output + rc at the end.

The VM workflow needs more work to handle the qemu user-mode NAT interaction with the bridge takeover (eth0 dispossession kills the ssh path; reconnect via the bridge MAC isn't reliable yet on the hosted ARM runner under TCG). Take it off PR/push triggers so it doesn't block fast iteration; keep daily cron + manual run for ongoing work. Fast lane (installer-smoke.yml) still gates PRs on shellcheck + doctor logic.

Adds a third virtio NIC (eth2) on its own user-mode NAT, moves the hostfwd 2222->22 onto it, and points cloud-init's netplan + a .link file at it. The installer never touches eth2 (MEMBERS hardcodes eth0+SECOND_IFACE), so ssh survives the bridge takeover instead of racing the unanswered SLIRP DHCP for br0's freshly-synthesized MAC. With ssh stable, re-enable PR + push triggers on install/** so the slow lane gates again. Tighten Wait-for-ssh to 17m / 15m deadline (no longer waiting on cloud-init apt — pre-installed in user-data). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Imager 2.x replaced the three-button hub + popup-with-tabs flow with a six-step wizard, so the old screenshot description, "EDIT SETTINGS" popup, and General/Services/Options tab walkthrough no longer match what users see. Rewrites Chapter 4: - Replaces the ASCII three-button mock with a wizard description and a one-line note that older-Imager users can still follow the same field names. - Section 4b is now "Step through the wizard" — same picks, framed as sequential Next-clicked steps instead of three buttons. - Section 4c collapses the General/Services/Options tabs into one scrolling-form table, adds the new "Capital city" combined locale picker, calls out the new "Raspberry Pi Connect" toggle (leave off), and replaces the SAVE / YES / YES button sequence with the 2.x Save + confirm flow. - 4d: Continue/Done hedge for the success screen. Reported by pauljp during a fresh install with Imager 2.0.7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

andrewroydshayes and others added 17 commits April 24, 2026 17:42

installer-vm: install git via cloud-init (Debian cloud image is minimal)

0f938ae

installer-vm: explicit apt install git in Stage step (don't race clou…

f1eb3e4

…d-init)

installer-vm: apt-get update before install (cloud-init's update may …

c85e46c

…not have run yet)

installer-vm: wait on 'cloud-init status --wait' before staging

ff47dbb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doctor subcommand + uninstaller + installer hardening + VM CI#1

doctor subcommand + uninstaller + installer hardening + VM CI#1
andrewroydshayes wants to merge 17 commits into
mainfrom
doctor-subcommand

andrewroydshayes commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrewroydshayes commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andrewroydshayes commented Apr 25, 2026 •

edited

Loading