@@ -244,6 +244,43 @@ kubectl -n "$NS" patch nodefeature "${NODE}-gpustack-worker" --type=merge \
244244kubectl -n default delete instance gpustack-e2e-instance
245245```
246246
247+ ### 4b. Managed toggle — a * second, independent* drain trigger (run when the change touches the ResourceFlavor/Cohort Node-watch)
248+
249+ Excluding a node from management (` gpustack.ai/managed=false ` ) must drain its single-node
250+ ResourceFlavors with the ** same** chain as §4 (flavor ` schedule.gpustack.ai/drain=true ` → ClusterQueue
251+ ` HoldAndDrain ` → the InstanceType's running Instances ` spec.stop=true ` ). What is non-obvious is that it
252+ is a * different trigger on a different code path* :
253+
254+ - A §4 capacity reshape changes a * feature label* , so any feature-prefix predicate fires. ** A managed
255+ toggle changes only ` gpustack.ai/managed ` ** — no feature label — so it drains ** only if** the
256+ ` ResourceFlavorReconciler ` /` CohortReconciler ` Node-watch ` UpdateFunc ` predicates include
257+ ` systemname.ManagedLabelKey ` in their ` mapx.EqualWithStringPrefix(...) `
258+ (` pkg/worker/controllers/worker/{resourceflavor,cohort}.go ` ). Missing it is the historical bug: the
259+ flavor is never enqueued or drained, while the ClusterQueue silently recomputes to a misleading
260+ ` 0/-1 ` (Active but negative-remaining) quota and the Instance keeps running.
261+ - ** Restart masks it.** The ` For ` -watch start-up resync re-reconciles every ResourceFlavor, so a freshly
262+ (re)started operator drains the orphan regardless of the predicate. Verify against a ** continuously
263+ running** operator — do not restart between the toggle and the assertion.
264+ - Toggle via the NodeFeature, not the node (§4 explains why: NFD reverts a direct node label). The unit
265+ cases ` unmanaged node drains flavor ` / ` unmanaged node deletes cohort ` only guard the index filter,
266+ ** not** the predicate — so this live check is the only guard for the enqueue path.
267+
268+ ``` bash
269+ NS=gpustack-system; NODE=$( kubectl get nodes -o jsonpath=' {.items[0].metadata.name}' )
270+ before=$( kubectl get node " $NODE " -o jsonpath=' {.metadata.labels.gpustack\.ai/managed}' )
271+
272+ # Toggle out of management, then poll the §4 chain (flavor drain → CQ HoldAndDrain → Instance stop).
273+ kubectl -n " $NS " patch nodefeature " ${NODE} -gpustack-worker" --type=merge \
274+ -p ' {"spec":{"labels":{"gpustack.ai/managed":"false"}}}'
275+
276+ # Restore (skip if doing a full §6 teardown).
277+ kubectl -n " $NS " patch nodefeature " ${NODE} -gpustack-worker" --type=merge \
278+ -p " {\" spec\" :{\" labels\" :{\" gpustack.ai/managed\" :\" ${before:- true} \" }}}"
279+ ```
280+
281+ > Toggling a node that hosts a * running* Instance Stops that Instance, so on a shared cluster pick a node
282+ > whose Instances you can disrupt (or one with none, to assert the flavor/CQ drain alone).
283+
247284## 5. Optional — simulated accelerator & drain-recycle (accelerated chain)
248285
249286This exercises the accelerated chain and the drain-recycle behavior (the ` ResourceFlavor ` tombstone,
0 commit comments