Skip to content

Delete tc classes by TAP filter ownership#296

Merged
sjmiller609 merged 8 commits into
mainfrom
hypeship/minimal-tc-cleanup
Jun 30, 2026
Merged

Delete tc classes by TAP filter ownership#296
sjmiller609 merged 8 commits into
mainfrom
hypeship/minimal-tc-cleanup

Conversation

@sjmiller609

@sjmiller609 sjmiller609 commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Linear: KERNEL-1500

Summary

This changes TAP deletion to remove bridge tc state only when ownership is proven by the TAP’s live filter. Before deleting a TAP, hypeman now reads the TAP ifindex, finds bridge filters whose meta(rt_iif) matches that ifindex, deletes those filter handles, and deletes the classes referenced by their flowids.

It also extends the existing TAP GC cadence to reap orphaned bridge tc state. The reaper uses live hype-* TAP ifindexes as the source of truth, deletes filters whose rt_iif is not live, then deletes non-root HTB classes not referenced by the remaining live filters. If filters exist but no rt_iif values parse, it bails without deleting anything.

The read-only dev-host probe matched this model: 16 live TAPs, 109 bridge filters/classes, and 93 stale filters/classes that the reaper would prune.

Finally, it adds a raw host accounting metric, hypeman_network_bridge_htb_class_count, which observes the current number of non-root HTB classes on the configured bridge by listing tc state directly.

Tests

  • go test ./lib/network
  • GOOS=darwin go test -c ./lib/network -o /tmp/hypeman-network-darwin.test
  • Updated TestCreateInstanceWithNetwork to exercise real bridge tc state: it creates upload shaping, stages one stale bridge filter/class pair, runs cleanup, and asserts the live filter/class remains while the stale pair is removed.
  • go test ./lib/instances -run TestCreateInstanceWithNetwork -count=1 did not run locally because required embedded binaries are absent from this checkout:
    • lib/vmm/binaries_linux.go: pattern binaries/cloud-hypervisor/v49.0/aarch64/cloud-hypervisor: no matching files found
    • lib/system/guest_agent_binary.go: pattern guest_agent/guest-agent: no matching files found

Note

Medium Risk
Changes Linux host bridge traffic-control delete/GC paths; mistakes could remove live shaping or leave stale state, though parsing failures skip destructive cleanup and live TAP filters are explicitly protected.

Overview
Bridge upload shaping cleanup now keys off live tc ownership instead of persisted or hash-derived class IDs. On TAP delete/release, hypeman resolves the TAP ifindex, finds bridge basic filters whose meta(rt_iif) matches it, removes those handles, then drops the referenced HTB classes—so probed/collision class IDs are torn down correctly without relying on classid files for deletion.

Orphaned bridge tc reaping is reworked to treat live hype-* TAP ifindexes as truth: CleanupOrphanedClasses parses filter output, deletes filters whose rt_iif is not live, then removes non-root classes not protected by remaining filters; it no-ops if filters exist but no rt_iif values parse. The periodic TAP GC reconciler still runs this pass when instance listing fails (TAP cleanup may be skipped, but stale filters/classes can still be pruned).

Adds hypeman_network_bridge_htb_class_count (non-root HTB classes on the bridge), unit tests for filter/class planning, and integration coverage that stages live vs stale bridge tc and asserts cleanup keeps the live pair.

Reviewed by Cursor Bugbot for commit aa42934. Bugbot is set up for automated code reviews on this repo. Configure here.

@sjmiller609 sjmiller609 marked this pull request as ready for review June 30, 2026 18:52
Comment thread lib/network/bridge_linux.go
Comment thread lib/instances/tap_gc.go Outdated
Comment thread lib/network/bridge_linux.go
@sjmiller609 sjmiller609 requested a review from hiroTamada June 30, 2026 19:53

@hiroTamada hiroTamada left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving — solid PR. re-rooting tc cleanup on live filter ownership is the right call, and it fixes a real, measured leak. a couple of intentional-behavior confirmations and some nits below, none blocking.

what I verified holds up:

  • the tcMu lock added to CleanupOrphanedClasses genuinely closes the create-vs-GC race: addVMClass runs under the same lock (allocate.go:268) and creates the class before its filter, so without the lock GC could delete a half-built class out from under an in-flight allocation. good catch.
  • every ambiguous case errs toward leaving state for a later sweep rather than deleting live state; ifindex reuse can only cause a benign leak, never a wrong deletion.
  • decoupling the class sweep from ListInstances success is a real robustness win — the sweep is driven by live netlink TAPs, so it's safe to run even when the instance store is down.
  • builds on linux + darwin; pure parser/planner unit tests are good and green.

Questions / confirm intent

  • lib/network/bridge_linux.go:949planOrphanedBridgeTC's safety bail is all-or-nothing (fires only when every candidate filter fails to parse rt_iif). a lone live filter that parses as rtIif=-1 would be classified stale and have its filter+class deleted, silently dropping that VM's upload shaping (HTB has no default class). TestPlanOrphanedBridgeTC encodes this as intended — worth weighing a per-filter "unparsed → skip, don't delete" stance vs the global bail.
  • lib/network/bridge_linux.go:1036 — behavior change: deleteTAPDevice no longer reaps the HTB class when the TAP link is already gone (old code cleaned via stored/derived id); now deferred to the periodic GC. confirm intentional — class survives until the next sweep instead of a synchronous reap on release.
  • lib/network/allocate.go:418 / types.go:39 — the persisted classid file + ClassID field are now write-only: still saved / loaded into the struct / cleared on delete, but nothing reads the value for any decision (the old classIDForDelete was the only consumer). keep for observability or remove in a follow-up?

Nits

  • lib/network/bridge_linux.go:992removeVMClass logs the list-filters error via context.Background(), losing trace correlation (fn doesn't take a ctx).
  • lib/network/bridge_linux.go:930bridgeHTBClassCount omits the SysProcAttr{AmbientCaps: CAP_NET_ADMIN} that every other tc call in this file sets; mirrors the existing class show so probably fine, but worth confirming it doesn't need it.

Rollout note

the integration test that exercises real deletes wasn't run (missing embedded binaries per the description), and the dev-host probe was read-only — so what's been validated is what gets selected for deletion (the high-risk half), not the live delete execution. suggest canarying one host and watching hypeman_network_bridge_htb_class_count against live-VM count, since that metric is purpose-built to catch this exact failure mode.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit dfba062. Configure here.

Comment thread lib/network/bridge_linux.go
@sjmiller609 sjmiller609 merged commit 189c12f into main Jun 30, 2026
13 of 14 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/minimal-tc-cleanup branch June 30, 2026 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants