Skip to content

network: periodic TAP reaper + stop/create release fallbacks#222

Draft
sjmiller609 wants to merge 1 commit intomainfrom
hypeship/tap-gc-and-stop-nil-guard
Draft

network: periodic TAP reaper + stop/create release fallbacks#222
sjmiller609 wants to merge 1 commit intomainfrom
hypeship/tap-gc-and-stop-nil-guard

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

Summary

  • Add a ReleaseByInstanceID best-effort fallback on the network manager and wire it into stop.go and create.go rollback. Both currently rely on GetAllocation succeeding; when it fails, the existing nil-guard silently skips TAP release despite the warning claiming cleanup will be attempted.
  • Add a periodic TAP garbage collector that calls the existing CleanupOrphanedTAPs every 60s with a 5-minute age filter (sysfs ctime). Same preservation rules as the startup pass: Running, Initializing, Unknown.
  • CleanupOrphanedTAPs now takes a minAge time.Duration; startup passes 0 (no concurrent creates can be in flight at boot).

Why

CleanupOrphanedTAPs was startup-only. While hypeman runs, TAPs leak from:

  • stop.go nil-guard: GetAllocation errors → networkAlloc = nilReleaseAllocation is skipped at the != nil guard despite the warning log claiming otherwise.
  • create.go rollback: same pattern; if GetAllocation fails after CreateAllocation, the just-created TAP is never released.
  • VMM crashes outside hypeman: nothing triggers ReleaseAllocation.

These leaks contributed to a recent bridge-full incident (orphan TAPs accumulated until vmbr0 hit BR_MAX_PORTS = 1024).

Race avoidance

The reaper uses sysfs ctime of /sys/class/net/<dev> to skip TAPs younger than 5 minutes. Heavy create work (image pull, etc.) happens before CreateAllocation; post-allocation work is bounded by volume attach + config disk + boot, which is well under 5 minutes in practice. ctime resets on host reboot, but the startup CleanupOrphanedTAPs pass runs once at boot with no age filter at a time when no concurrent creates exist, so the reset is safe.

The reaper also short-circuits when the preserve set is empty (matches startup behavior, prevents clobbering TAPs from concurrent hypeman processes/tests) and skips when ListInstances errors.

Observability

hypeman_network_tap_operations_total{operation="cleanup_orphan"} is now incremented for each GC delete (the existing metric counter already tracked create/delete; orphan cleanup wasn't counted before).

Test plan

  • CI green
  • Manual: deploy to dev-yul-hypeman-0, leave a deliberately-orphaned TAP (ip tuntap add hype-fake mode tap), wait ~6 min, confirm it's reaped via ip link show | grep hype- and the new metric counter
  • Manual: confirm the reaper does not touch a TAP for an Unknown-state instance

The startup-only CleanupOrphanedTAPs lets TAPs accumulate while hypeman is
running when:
- stop.go's GetAllocation fails: networkAlloc stays nil and the existing
  guard skips ReleaseAllocation despite the warning saying it would attempt
  cleanup, silently leaking the TAP.
- create.go's rollback hits the same pattern: the cleanup closure depends
  on GetAllocation succeeding, otherwise the just-allocated TAP is never
  released.
- A VMM crashes outside hypeman, so stop/release never runs.

These leaks compounded into a recent bridge_full incident.

Add a ReleaseByInstanceID best-effort fallback that synthesizes a minimal
Allocation from the deterministic TAP name and persisted class ID, then
wire it into both leak sites.

Add a periodic reaper that calls CleanupOrphanedTAPs with a 5-minute age
filter (sysfs ctime). Heavy create work (image pull) happens before
CreateAllocation, so post-allocation work is bounded well under 5 minutes
and the age filter avoids deleting TAPs whose owning instance metadata
hasn't been persisted yet. The reaper preserves TAPs for Running,
Initializing, and Unknown instances — same rules as the startup pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant