Delete tc classes by TAP filter ownership#296
Conversation
hiroTamada
left a comment
There was a problem hiding this comment.
approving — solid PR. re-rooting tc cleanup on live filter ownership is the right call, and it fixes a real, measured leak. a couple of intentional-behavior confirmations and some nits below, none blocking.
what I verified holds up:
- the
tcMulock added toCleanupOrphanedClassesgenuinely closes the create-vs-GC race:addVMClassruns under the same lock (allocate.go:268) and creates the class before its filter, so without the lock GC could delete a half-built class out from under an in-flight allocation. good catch. - every ambiguous case errs toward leaving state for a later sweep rather than deleting live state; ifindex reuse can only cause a benign leak, never a wrong deletion.
- decoupling the class sweep from
ListInstancessuccess is a real robustness win — the sweep is driven by live netlink TAPs, so it's safe to run even when the instance store is down. - builds on linux + darwin; pure parser/planner unit tests are good and green.
Questions / confirm intent
lib/network/bridge_linux.go:949—planOrphanedBridgeTC's safety bail is all-or-nothing (fires only when every candidate filter fails to parsert_iif). a lone live filter that parses asrtIif=-1would be classified stale and have its filter+class deleted, silently dropping that VM's upload shaping (HTB has no default class).TestPlanOrphanedBridgeTCencodes this as intended — worth weighing a per-filter "unparsed → skip, don't delete" stance vs the global bail.lib/network/bridge_linux.go:1036— behavior change:deleteTAPDeviceno longer reaps the HTB class when the TAP link is already gone (old code cleaned via stored/derived id); now deferred to the periodic GC. confirm intentional — class survives until the next sweep instead of a synchronous reap on release.lib/network/allocate.go:418/types.go:39— the persistedclassidfile +ClassIDfield are now write-only: still saved / loaded into the struct / cleared on delete, but nothing reads the value for any decision (the oldclassIDForDeletewas the only consumer). keep for observability or remove in a follow-up?
Nits
lib/network/bridge_linux.go:992—removeVMClasslogs the list-filters error viacontext.Background(), losing trace correlation (fn doesn't take actx).lib/network/bridge_linux.go:930—bridgeHTBClassCountomits theSysProcAttr{AmbientCaps: CAP_NET_ADMIN}that every othertccall in this file sets; mirrors the existingclass showso probably fine, but worth confirming it doesn't need it.
Rollout note
the integration test that exercises real deletes wasn't run (missing embedded binaries per the description), and the dev-host probe was read-only — so what's been validated is what gets selected for deletion (the high-risk half), not the live delete execution. suggest canarying one host and watching hypeman_network_bridge_htb_class_count against live-VM count, since that metric is purpose-built to catch this exact failure mode.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using high effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit dfba062. Configure here.

Linear: KERNEL-1500
Summary
This changes TAP deletion to remove bridge tc state only when ownership is proven by the TAP’s live filter. Before deleting a TAP, hypeman now reads the TAP ifindex, finds bridge filters whose
meta(rt_iif)matches that ifindex, deletes those filter handles, and deletes the classes referenced by their flowids.It also extends the existing TAP GC cadence to reap orphaned bridge tc state. The reaper uses live
hype-*TAP ifindexes as the source of truth, deletes filters whosert_iifis not live, then deletes non-root HTB classes not referenced by the remaining live filters. If filters exist but nort_iifvalues parse, it bails without deleting anything.The read-only dev-host probe matched this model: 16 live TAPs, 109 bridge filters/classes, and 93 stale filters/classes that the reaper would prune.
Finally, it adds a raw host accounting metric,
hypeman_network_bridge_htb_class_count, which observes the current number of non-root HTB classes on the configured bridge by listing tc state directly.Tests
go test ./lib/networkGOOS=darwin go test -c ./lib/network -o /tmp/hypeman-network-darwin.testTestCreateInstanceWithNetworkto exercise real bridge tc state: it creates upload shaping, stages one stale bridge filter/class pair, runs cleanup, and asserts the live filter/class remains while the stale pair is removed.go test ./lib/instances -run TestCreateInstanceWithNetwork -count=1did not run locally because required embedded binaries are absent from this checkout:lib/vmm/binaries_linux.go: pattern binaries/cloud-hypervisor/v49.0/aarch64/cloud-hypervisor: no matching files foundlib/system/guest_agent_binary.go: pattern guest_agent/guest-agent: no matching files foundNote
Medium Risk
Changes Linux host bridge traffic-control delete/GC paths; mistakes could remove live shaping or leave stale state, though parsing failures skip destructive cleanup and live TAP filters are explicitly protected.
Overview
Bridge upload shaping cleanup now keys off live
tcownership instead of persisted or hash-derived class IDs. On TAP delete/release, hypeman resolves the TAP ifindex, finds bridgebasicfilters whosemeta(rt_iif)matches it, removes those handles, then drops the referenced HTB classes—so probed/collision class IDs are torn down correctly without relying onclassidfiles for deletion.Orphaned bridge
tcreaping is reworked to treat livehype-*TAP ifindexes as truth:CleanupOrphanedClassesparses filter output, deletes filters whosert_iifis not live, then removes non-root classes not protected by remaining filters; it no-ops if filters exist but nort_iifvalues parse. The periodic TAP GC reconciler still runs this pass when instance listing fails (TAP cleanup may be skipped, but stale filters/classes can still be pruned).Adds
hypeman_network_bridge_htb_class_count(non-root HTB classes on the bridge), unit tests for filter/class planning, and integration coverage that stages live vs stale bridgetcand asserts cleanup keeps the live pair.Reviewed by Cursor Bugbot for commit aa42934. Bugbot is set up for automated code reviews on this repo. Configure here.