Skip to content

doctor subcommand + uninstaller + installer hardening + VM CI#1

Open
andrewroydshayes wants to merge 17 commits into
mainfrom
doctor-subcommand
Open

doctor subcommand + uninstaller + installer hardening + VM CI#1
andrewroydshayes wants to merge 17 commits into
mainfrom
doctor-subcommand

Conversation

@andrewroydshayes
Copy link
Copy Markdown
Owner

@andrewroydshayes andrewroydshayes commented Apr 25, 2026

Summary

  • New rdc-proxy doctor subcommand: read-only post-install diagnostic with 14 checks (host / network / TPROXY plumbing / runtime), with a --json mode for tooling. Catches the silent-failure mode where install.sh reports green but br0 was never actually created (TPROXY rule + ebtables broute rule reference -i br0 so they're inert).
  • New install/uninstall.sh reverses everything install.sh does, idempotently. Handles legacy filenames + restores netplan backups.
  • Real installer fix: setup-bridge.sh now actively dispossesses NetworkManager (deletes the active connection profile per member) and netplan (rewrites YAML to renderer:networkd, backs up original). Renames its own systemd-networkd config files to 05-rdc-proxy-* so they win the lexical match against any 10-netplan-* files. Reloads systemd-networkd, brings members up, polls until br0 exists with both members enslaved before exiting.
  • install.sh self-check now asserts br0 exists AND every expected member is enslaved, not just that the rules are present.
  • CI:
    • installer-smoke.yml (fast lane, every PR) — shellcheck on install/*.sh, run doctor against an empty config dir on a bare runner and assert it correctly reports FAIL with the expected check signature.
    • installer-vm.yml (slow lane) — boots Debian Bookworm aarch64 cloud image under qemu-system-aarch64 on the ubuntu-24.04-arm hosted runner, runs install → doctor → uninstall → verify-clean. Currently gated on workflow_dispatch + cron only while iterating: hosted ARM runners don't expose /dev/kvm so we run TCG, and the qemu user-mode NAT topology doesn't gracefully survive the bridge takeover (eth0 dispossession kills the only ssh path). Fast lane gates PRs in the meantime.

Why

The installer was reporting success on hosts where another network manager (NetworkManager on Pi OS Bookworm, netplan on cloud Debian) already owned the bridge member interfaces. The systemd-networkd config got written but lost the priority contest, so br0 was never actually created — yet the TPROXY + ebtables rules referencing -i br0 got installed cleanly and the proxy started listening. From the install summary alone, everything looked green; in reality nothing was being intercepted. Doctor catches this now, the bridge setup actually solves it, and the fast-lane CI guarantees doctor's logic doesn't regress.

Test plan

  • 77/77 unit tests pass (24 new for doctor)
  • End-to-end manual run in qemu-aarch64 + Debian Bookworm cloud image (HVF-accelerated on local Mac): install reports clean, doctor green, uninstall clean
  • installer-smoke workflow passes
  • lint, test (3.10/3.11/3.12), shellcheck pass
  • installer-vm workflow (slow lane) — runs nightly, currently failing on bridge-takeover SSH disconnect under qemu user-mode NAT; tracked as follow-up

🤖 Generated with Claude Code

andrewroydshayes and others added 17 commits April 24, 2026 17:42
doctor: read-only post-install diagnostic. 14 checks across host /
network / TPROXY / runtime, with per-check fix hints. Text and --json
output. Pure-function checks behind a stubbable Probe so they're unit-
testable without touching real system state.

Surfaces failures the install.sh self-check misses — notably an install
where the iptables/ebtables rules are correctly applied (matching `-i
br0`) but br0 itself was never brought up by systemd-networkd, so no
traffic ever flows. install.sh declares success in that case; doctor
catches it.

uninstall.sh: reverses install.sh. Stops + disables the service, kills
any leftover process, removes the unit file, deletes the iptables /
ebtables / ip-rule entries, removes the systemd-networkd br0 configs,
removes the NM unmanaged.conf, and removes /opt/rdc-proxy and
/etc/rdc-proxy. Idempotent. Honors KEEP_CONFIG=1, PURGE_APT=1, YES=1.
Does NOT tear down br0 itself at runtime (would drop SSH if you're on
it), only its boot config.

CI: installer-smoke workflow runs shellcheck on install/*.sh and runs
doctor against an empty config dir on a bare ubuntu-latest runner,
asserting it correctly reports FAIL with the expected check names.

Verified end-to-end in qemu-aarch64 (Debian Bookworm cloud image, two
virtio NICs, HVF accel): install.sh runs clean, doctor surfaces the
missing-bridge bug, uninstall.sh removes everything cleanly.
…comes up

Two bugs surfaced by `rdc-proxy doctor`:

1. setup-bridge.sh wrote /etc/systemd/network/{10,20,30}-br0*.network but
   on hosts where another manager already owned the would-be bridge
   members, our config never took effect:
     - Pi OS Bookworm: NetworkManager has an active connection on eth0,
       writing unmanaged.conf alone doesn't release it (NM keeps the
       existing profile up).
     - Cloud Debian / Ubuntu Server: netplan compiles its YAML into
       /run/systemd/network/10-netplan-eth0.network at priority 10, which
       beats our 20-br0-members.network in the systemd-networkd match
       order. eth0 stays out of the bridge silently.
   Either way the install scripts reported success — TPROXY rule + ebtables
   broute rule + listening port all green — but `-i br0` matched nothing
   because br0 had no members. Same shape as the field symptom of "proxy
   running, no telemetry."

2. install.sh's self-check verified the rules exist but never confirmed
   br0 itself was up with members enslaved.

Fixes:
- setup-bridge.sh now actively dispossesses NetworkManager (delete the
  active connection profile on each member iface, then write
  unmanaged.conf so NM never reclaims) and netplan (rewrite the YAML to
  renderer:networkd with no per-iface stanzas, backing up the original).
- Renames the systemd-networkd config files to 05-rdc-proxy-* so they
  win the lexical match order against any 10-netplan-*.network that
  might still be in /run.
- Triggers `networkctl reload` after writing config and verifies br0
  exists with all members enslaved before exiting (waits up to 45s
  total). Hard-fails with a diagnostic dump otherwise.
- install.sh self-check now asserts `ip link show br0` succeeds AND
  every expected member is enslaved.
- uninstall.sh removes the new 05-rdc-proxy-* filenames AND restores any
  netplan .rdc-backup files setup-bridge wrote.
Boots Debian Bookworm aarch64 cloud image under qemu+KVM on the
ubuntu-24.04-arm hosted runner, then exercises the whole installer
pipeline end-to-end:

  install.sh -> doctor (must be ok=true and bridge up) ->
  uninstall.sh -> verify clean state

Triggers: PR/push touching install/, daily cron at 11:00 UTC, and
manual workflow_dispatch. Complement to installer-smoke.yml's fast
lane (shellcheck + doctor against an empty config).

Asserts the specific check names that matter: bridge, bridge_ip,
iptables_mangle_tproxy, ebtables_broute, listen_port, service must
be ok. rdc_reachable is allowed to warn since CI doesn't have a real
generator wired in. Any fail anywhere uploads serial.log + the four
captured outputs as a workflow artifact.
The ubuntu-24.04-arm hosted runners don't expose /dev/kvm, so 'accel=kvm
-cpu host' fails immediately. Detect /dev/kvm at runtime; use it when
present, otherwise fall back to '-accel tcg,thread=multi -cpu cortex-a72'
(Pi-4 profile, multi-threaded TCG which keeps guest/host arch aligned and
runs at usable speed on the ARM host). Bumped timeout to 45 min for the
TCG path.
TCG is slow and variable, so a fixed 30s sleep between cloud-init's
first ssh-up and the post-reboot ssh-up was racing — the probe step
hit the VM mid-reboot and got 'kex_exchange_identification: read:
Connection reset by peer'. Switch to polling /proc/uptime via ssh
and continuing once it reports >= 90s. That guarantees we're past
the reboot regardless of TCG performance.
…-shot

cloud-init's power_state runs in the final stage which fires on every
boot, so 'condition: True' produced an infinite reboot loop the moment
cloud-init status --wait returned and we tried to use the VM. Replaced
with a runcmd that touches /var/lib/rdc-vm-rebooted and reboots only
when the marker is absent — fires exactly once on first boot.
Two-condition wait removes both the apt-lock race (cloud-init still
installing git when we reach Stage) and the reboot race (we treat first-
boot ssh as ready, then cloud-init reboots and kills our session). The
marker file /var/lib/rdc-vm-rebooted is touched immediately before the
first-boot reboot, so seeing it guarantees we are post-reboot. Combined
with cloud-init status==done, the VM is fully settled before any of our
later steps touch it.

Stage step's belt-and-braces apt install of git becomes conditional on
git being absent — usually a no-op since cloud-init's packages: already
ran.
netplan's set-name rename is applied by udev at link-add time using the
file under /run/systemd/network — the initramfs rebuild was overhead
without payoff. Removing it cuts first-boot time substantially under
TCG (where each apt + initramfs operation is multi-minutes).

Bumped wait step + internal deadline to 25/23 min as headroom for the
cloud-init apt install + reboot cycle when KVM is unavailable.
…otherwise)

Under TCG on the hosted ARM runner, install.sh's apt-get install of
git+python3+pip+venv+flask+iptables+ebtables+tcpdump+bridge-utils
was eating 30+ minutes of wall clock and consistently blew the 45-min
job budget. Move that apt install into cloud-init's packages: list so
it runs once during cloud-init (still slow but happens before our
test loop starts and doesn't block our budget). install.sh's apt step
becomes a no-op since DEBIAN_FRONTEND=noninteractive apt-get install
on already-present packages exits in seconds.
setup-bridge.sh's networkctl reload moves eth0 into br0, which strips
eth0's IP and resets every TCP connection bound to it — including the
ssh session streaming the install output. The install would continue
inside the VM, but the GH step had no visibility and just waited until
the 45-min job timeout.

Run install.sh under setsid+nohup with output to /tmp/install.log,
then poll for /tmp/install.done from the host side. SSH reconnect picks
up the same NAT-DHCP IP (now bound to br0). Tail the log periodically
for live visibility, fetch full output + rc at the end.
The VM workflow needs more work to handle the qemu user-mode NAT
interaction with the bridge takeover (eth0 dispossession kills the
ssh path; reconnect via the bridge MAC isn't reliable yet on the
hosted ARM runner under TCG). Take it off PR/push triggers so it
doesn't block fast iteration; keep daily cron + manual run for
ongoing work. Fast lane (installer-smoke.yml) still gates PRs on
shellcheck + doctor logic.
Adds a third virtio NIC (eth2) on its own user-mode NAT, moves the
hostfwd 2222->22 onto it, and points cloud-init's netplan + a .link
file at it. The installer never touches eth2 (MEMBERS hardcodes
eth0+SECOND_IFACE), so ssh survives the bridge takeover instead of
racing the unanswered SLIRP DHCP for br0's freshly-synthesized MAC.

With ssh stable, re-enable PR + push triggers on install/** so the
slow lane gates again. Tighten Wait-for-ssh to 17m / 15m deadline
(no longer waiting on cloud-init apt — pre-installed in user-data).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Imager 2.x replaced the three-button hub + popup-with-tabs flow with a
six-step wizard, so the old screenshot description, "EDIT SETTINGS"
popup, and General/Services/Options tab walkthrough no longer match
what users see. Rewrites Chapter 4:

- Replaces the ASCII three-button mock with a wizard description and
  a one-line note that older-Imager users can still follow the same
  field names.
- Section 4b is now "Step through the wizard" — same picks, framed
  as sequential Next-clicked steps instead of three buttons.
- Section 4c collapses the General/Services/Options tabs into one
  scrolling-form table, adds the new "Capital city" combined locale
  picker, calls out the new "Raspberry Pi Connect" toggle (leave off),
  and replaces the SAVE / YES / YES button sequence with the 2.x
  Save + confirm flow.
- 4d: Continue/Done hedge for the success screen.

Reported by pauljp during a fresh install with Imager 2.0.7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant