network: periodic TAP reaper + stop/create release fallbacks#222
Draft
sjmiller609 wants to merge 1 commit intomainfrom
Draft
network: periodic TAP reaper + stop/create release fallbacks#222sjmiller609 wants to merge 1 commit intomainfrom
sjmiller609 wants to merge 1 commit intomainfrom
Conversation
The startup-only CleanupOrphanedTAPs lets TAPs accumulate while hypeman is running when: - stop.go's GetAllocation fails: networkAlloc stays nil and the existing guard skips ReleaseAllocation despite the warning saying it would attempt cleanup, silently leaking the TAP. - create.go's rollback hits the same pattern: the cleanup closure depends on GetAllocation succeeding, otherwise the just-allocated TAP is never released. - A VMM crashes outside hypeman, so stop/release never runs. These leaks compounded into a recent bridge_full incident. Add a ReleaseByInstanceID best-effort fallback that synthesizes a minimal Allocation from the deterministic TAP name and persisted class ID, then wire it into both leak sites. Add a periodic reaper that calls CleanupOrphanedTAPs with a 5-minute age filter (sysfs ctime). Heavy create work (image pull) happens before CreateAllocation, so post-allocation work is bounded well under 5 minutes and the age filter avoids deleting TAPs whose owning instance metadata hasn't been persisted yet. The reaper preserves TAPs for Running, Initializing, and Unknown instances — same rules as the startup pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ReleaseByInstanceIDbest-effort fallback on the network manager and wire it intostop.goandcreate.gorollback. Both currently rely onGetAllocationsucceeding; when it fails, the existing nil-guard silently skips TAP release despite the warning claiming cleanup will be attempted.CleanupOrphanedTAPsevery 60s with a 5-minute age filter (sysfs ctime). Same preservation rules as the startup pass: Running, Initializing, Unknown.CleanupOrphanedTAPsnow takes aminAge time.Duration; startup passes 0 (no concurrent creates can be in flight at boot).Why
CleanupOrphanedTAPswas startup-only. While hypeman runs, TAPs leak from:stop.gonil-guard:GetAllocationerrors →networkAlloc = nil→ReleaseAllocationis skipped at the!= nilguard despite the warning log claiming otherwise.create.gorollback: same pattern; ifGetAllocationfails afterCreateAllocation, the just-created TAP is never released.ReleaseAllocation.These leaks contributed to a recent bridge-full incident (orphan TAPs accumulated until
vmbr0hitBR_MAX_PORTS = 1024).Race avoidance
The reaper uses sysfs
ctimeof/sys/class/net/<dev>to skip TAPs younger than 5 minutes. Heavy create work (image pull, etc.) happens beforeCreateAllocation; post-allocation work is bounded by volume attach + config disk + boot, which is well under 5 minutes in practice.ctimeresets on host reboot, but the startupCleanupOrphanedTAPspass runs once at boot with no age filter at a time when no concurrent creates exist, so the reset is safe.The reaper also short-circuits when the preserve set is empty (matches startup behavior, prevents clobbering TAPs from concurrent hypeman processes/tests) and skips when
ListInstanceserrors.Observability
hypeman_network_tap_operations_total{operation="cleanup_orphan"}is now incremented for each GC delete (the existing metric counter already tracked create/delete; orphan cleanup wasn't counted before).Test plan
dev-yul-hypeman-0, leave a deliberately-orphaned TAP (ip tuntap add hype-fake mode tap), wait ~6 min, confirm it's reaped viaip link show | grep hype-and the new metric counterUnknown-state instance