doctor subcommand + uninstaller + installer hardening + VM CI#1
Open
andrewroydshayes wants to merge 17 commits into
Open
doctor subcommand + uninstaller + installer hardening + VM CI#1andrewroydshayes wants to merge 17 commits into
andrewroydshayes wants to merge 17 commits into
Conversation
doctor: read-only post-install diagnostic. 14 checks across host / network / TPROXY / runtime, with per-check fix hints. Text and --json output. Pure-function checks behind a stubbable Probe so they're unit- testable without touching real system state. Surfaces failures the install.sh self-check misses — notably an install where the iptables/ebtables rules are correctly applied (matching `-i br0`) but br0 itself was never brought up by systemd-networkd, so no traffic ever flows. install.sh declares success in that case; doctor catches it. uninstall.sh: reverses install.sh. Stops + disables the service, kills any leftover process, removes the unit file, deletes the iptables / ebtables / ip-rule entries, removes the systemd-networkd br0 configs, removes the NM unmanaged.conf, and removes /opt/rdc-proxy and /etc/rdc-proxy. Idempotent. Honors KEEP_CONFIG=1, PURGE_APT=1, YES=1. Does NOT tear down br0 itself at runtime (would drop SSH if you're on it), only its boot config. CI: installer-smoke workflow runs shellcheck on install/*.sh and runs doctor against an empty config dir on a bare ubuntu-latest runner, asserting it correctly reports FAIL with the expected check names. Verified end-to-end in qemu-aarch64 (Debian Bookworm cloud image, two virtio NICs, HVF accel): install.sh runs clean, doctor surfaces the missing-bridge bug, uninstall.sh removes everything cleanly.
…comes up
Two bugs surfaced by `rdc-proxy doctor`:
1. setup-bridge.sh wrote /etc/systemd/network/{10,20,30}-br0*.network but
on hosts where another manager already owned the would-be bridge
members, our config never took effect:
- Pi OS Bookworm: NetworkManager has an active connection on eth0,
writing unmanaged.conf alone doesn't release it (NM keeps the
existing profile up).
- Cloud Debian / Ubuntu Server: netplan compiles its YAML into
/run/systemd/network/10-netplan-eth0.network at priority 10, which
beats our 20-br0-members.network in the systemd-networkd match
order. eth0 stays out of the bridge silently.
Either way the install scripts reported success — TPROXY rule + ebtables
broute rule + listening port all green — but `-i br0` matched nothing
because br0 had no members. Same shape as the field symptom of "proxy
running, no telemetry."
2. install.sh's self-check verified the rules exist but never confirmed
br0 itself was up with members enslaved.
Fixes:
- setup-bridge.sh now actively dispossesses NetworkManager (delete the
active connection profile on each member iface, then write
unmanaged.conf so NM never reclaims) and netplan (rewrite the YAML to
renderer:networkd with no per-iface stanzas, backing up the original).
- Renames the systemd-networkd config files to 05-rdc-proxy-* so they
win the lexical match order against any 10-netplan-*.network that
might still be in /run.
- Triggers `networkctl reload` after writing config and verifies br0
exists with all members enslaved before exiting (waits up to 45s
total). Hard-fails with a diagnostic dump otherwise.
- install.sh self-check now asserts `ip link show br0` succeeds AND
every expected member is enslaved.
- uninstall.sh removes the new 05-rdc-proxy-* filenames AND restores any
netplan .rdc-backup files setup-bridge wrote.
Boots Debian Bookworm aarch64 cloud image under qemu+KVM on the ubuntu-24.04-arm hosted runner, then exercises the whole installer pipeline end-to-end: install.sh -> doctor (must be ok=true and bridge up) -> uninstall.sh -> verify clean state Triggers: PR/push touching install/, daily cron at 11:00 UTC, and manual workflow_dispatch. Complement to installer-smoke.yml's fast lane (shellcheck + doctor against an empty config). Asserts the specific check names that matter: bridge, bridge_ip, iptables_mangle_tproxy, ebtables_broute, listen_port, service must be ok. rdc_reachable is allowed to warn since CI doesn't have a real generator wired in. Any fail anywhere uploads serial.log + the four captured outputs as a workflow artifact.
The ubuntu-24.04-arm hosted runners don't expose /dev/kvm, so 'accel=kvm -cpu host' fails immediately. Detect /dev/kvm at runtime; use it when present, otherwise fall back to '-accel tcg,thread=multi -cpu cortex-a72' (Pi-4 profile, multi-threaded TCG which keeps guest/host arch aligned and runs at usable speed on the ARM host). Bumped timeout to 45 min for the TCG path.
TCG is slow and variable, so a fixed 30s sleep between cloud-init's first ssh-up and the post-reboot ssh-up was racing — the probe step hit the VM mid-reboot and got 'kex_exchange_identification: read: Connection reset by peer'. Switch to polling /proc/uptime via ssh and continuing once it reports >= 90s. That guarantees we're past the reboot regardless of TCG performance.
…not have run yet)
…-shot cloud-init's power_state runs in the final stage which fires on every boot, so 'condition: True' produced an infinite reboot loop the moment cloud-init status --wait returned and we tried to use the VM. Replaced with a runcmd that touches /var/lib/rdc-vm-rebooted and reboots only when the marker is absent — fires exactly once on first boot.
Two-condition wait removes both the apt-lock race (cloud-init still installing git when we reach Stage) and the reboot race (we treat first- boot ssh as ready, then cloud-init reboots and kills our session). The marker file /var/lib/rdc-vm-rebooted is touched immediately before the first-boot reboot, so seeing it guarantees we are post-reboot. Combined with cloud-init status==done, the VM is fully settled before any of our later steps touch it. Stage step's belt-and-braces apt install of git becomes conditional on git being absent — usually a no-op since cloud-init's packages: already ran.
netplan's set-name rename is applied by udev at link-add time using the file under /run/systemd/network — the initramfs rebuild was overhead without payoff. Removing it cuts first-boot time substantially under TCG (where each apt + initramfs operation is multi-minutes). Bumped wait step + internal deadline to 25/23 min as headroom for the cloud-init apt install + reboot cycle when KVM is unavailable.
…otherwise) Under TCG on the hosted ARM runner, install.sh's apt-get install of git+python3+pip+venv+flask+iptables+ebtables+tcpdump+bridge-utils was eating 30+ minutes of wall clock and consistently blew the 45-min job budget. Move that apt install into cloud-init's packages: list so it runs once during cloud-init (still slow but happens before our test loop starts and doesn't block our budget). install.sh's apt step becomes a no-op since DEBIAN_FRONTEND=noninteractive apt-get install on already-present packages exits in seconds.
setup-bridge.sh's networkctl reload moves eth0 into br0, which strips eth0's IP and resets every TCP connection bound to it — including the ssh session streaming the install output. The install would continue inside the VM, but the GH step had no visibility and just waited until the 45-min job timeout. Run install.sh under setsid+nohup with output to /tmp/install.log, then poll for /tmp/install.done from the host side. SSH reconnect picks up the same NAT-DHCP IP (now bound to br0). Tail the log periodically for live visibility, fetch full output + rc at the end.
The VM workflow needs more work to handle the qemu user-mode NAT interaction with the bridge takeover (eth0 dispossession kills the ssh path; reconnect via the bridge MAC isn't reliable yet on the hosted ARM runner under TCG). Take it off PR/push triggers so it doesn't block fast iteration; keep daily cron + manual run for ongoing work. Fast lane (installer-smoke.yml) still gates PRs on shellcheck + doctor logic.
Adds a third virtio NIC (eth2) on its own user-mode NAT, moves the hostfwd 2222->22 onto it, and points cloud-init's netplan + a .link file at it. The installer never touches eth2 (MEMBERS hardcodes eth0+SECOND_IFACE), so ssh survives the bridge takeover instead of racing the unanswered SLIRP DHCP for br0's freshly-synthesized MAC. With ssh stable, re-enable PR + push triggers on install/** so the slow lane gates again. Tighten Wait-for-ssh to 17m / 15m deadline (no longer waiting on cloud-init apt — pre-installed in user-data). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Imager 2.x replaced the three-button hub + popup-with-tabs flow with a six-step wizard, so the old screenshot description, "EDIT SETTINGS" popup, and General/Services/Options tab walkthrough no longer match what users see. Rewrites Chapter 4: - Replaces the ASCII three-button mock with a wizard description and a one-line note that older-Imager users can still follow the same field names. - Section 4b is now "Step through the wizard" — same picks, framed as sequential Next-clicked steps instead of three buttons. - Section 4c collapses the General/Services/Options tabs into one scrolling-form table, adds the new "Capital city" combined locale picker, calls out the new "Raspberry Pi Connect" toggle (leave off), and replaces the SAVE / YES / YES button sequence with the 2.x Save + confirm flow. - 4d: Continue/Done hedge for the success screen. Reported by pauljp during a fresh install with Imager 2.0.7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rdc-proxy doctorsubcommand: read-only post-install diagnostic with 14 checks (host / network / TPROXY plumbing / runtime), with a--jsonmode for tooling. Catches the silent-failure mode whereinstall.shreports green butbr0was never actually created (TPROXY rule + ebtables broute rule reference-i br0so they're inert).install/uninstall.shreverses everythinginstall.shdoes, idempotently. Handles legacy filenames + restores netplan backups.setup-bridge.shnow actively dispossesses NetworkManager (deletes the active connection profile per member) and netplan (rewrites YAML torenderer:networkd, backs up original). Renames its own systemd-networkd config files to05-rdc-proxy-*so they win the lexical match against any10-netplan-*files. Reloadssystemd-networkd, brings members up, polls untilbr0exists with both members enslaved before exiting.install.shself-check now assertsbr0exists AND every expected member is enslaved, not just that the rules are present.installer-smoke.yml(fast lane, every PR) — shellcheck oninstall/*.sh, rundoctoragainst an empty config dir on a bare runner and assert it correctly reports FAIL with the expected check signature.installer-vm.yml(slow lane) — boots Debian Bookworm aarch64 cloud image underqemu-system-aarch64on theubuntu-24.04-armhosted runner, runs install → doctor → uninstall → verify-clean. Currently gated onworkflow_dispatch + crononly while iterating: hosted ARM runners don't expose/dev/kvmso we run TCG, and the qemu user-mode NAT topology doesn't gracefully survive the bridge takeover (eth0 dispossession kills the only ssh path). Fast lane gates PRs in the meantime.Why
The installer was reporting success on hosts where another network manager (NetworkManager on Pi OS Bookworm, netplan on cloud Debian) already owned the bridge member interfaces. The systemd-networkd config got written but lost the priority contest, so
br0was never actually created — yet the TPROXY + ebtables rules referencing-i br0got installed cleanly and the proxy started listening. From the install summary alone, everything looked green; in reality nothing was being intercepted. Doctor catches this now, the bridge setup actually solves it, and the fast-lane CI guarantees doctor's logic doesn't regress.Test plan
installer-smokeworkflow passeslint,test (3.10/3.11/3.12),shellcheckpassinstaller-vmworkflow (slow lane) — runs nightly, currently failing on bridge-takeover SSH disconnect under qemu user-mode NAT; tracked as follow-up🤖 Generated with Claude Code