Skip to content

## Feature: Prevent Workload Reshuffling During Control-Plane Disconnects #173

@HAHermsen

Description

@HAHermsen

The Problem

When control-plane connectivity to a worker node is lost, K8s immediately assumes the node has failed and begins evicting and rescheduling workloads to other nodes.

For location-bound real-time workloads (EtherCAT, PROFINET connections), this is catastrophic. The workload is still running fine on its hardware, but K8s reshuffles it anyway—breaking the control loop.

K8s can't distinguish between:

  • Transient network loss (worker is healthy, just disconnected)
  • Actual node failure (worker is dead)

So it treats both the same: reshuffle everything.

Motivation

Industrial edge deployments operate in environments with unreliable network connectivity (4G/5G dropouts, WiFi interference, cellular gaps). Control-plane disconnects are temporary and expected.

But current K8s behavior treats every control-plane disconnect as permanent node failure and immediately reshuffles workloads. This breaks location-bound real-time control loops that are physically wired to specific hardware.

A network hiccup (or longer outage) shouldn't destroy production. Margo needs semantics to distinguish transient control-plane loss from actual node failure, so location-bound workloads can survive network interruptions without being evicted (even after the connection restores.)

What We Need

Orchestration semantics that:

  • Don't assume node failure just because control-plane lost heartbeat
  • Keep location-bound workloads pinned during control-plane disconnects
  • Only evict if the worker node itself is actually unhealthy

This requires the ability to mark workloads as "location-bound" so the orchestrator knows: transient network loss ≠ node failure.

How This Relates to Margo

Device capabilities (#96, #136) could enable this by allowing WFM to understand which workloads are location-bound and shouldn't be evicted on control-plane loss.


Posted as an individual contributor.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions