Skip to content

feature: metis: implement dynamic allocation and draining in metis daemon#1110

Open
YifeiZhuang wants to merge 16 commits into
kubernetes:masterfrom
YifeiZhuang:dynamic-alloc-impl
Open

feature: metis: implement dynamic allocation and draining in metis daemon#1110
YifeiZhuang wants to merge 16 commits into
kubernetes:masterfrom
YifeiZhuang:dynamic-alloc-impl

Conversation

@YifeiZhuang
Copy link
Copy Markdown
Contributor

This change implements the dynamic allocation and draining of Pod CIDR blocks in the Metis Daemon.

Components and Changes:

  1. Daemon Server
    Maintains an in-memory registry of blocked requests (requestsMap). This is a nested map keyed by network and containerID pointing to notification channels (chan struct{}).

  2. Daemon Monitor (monitor.go)
    It uses a standard Kubernetes workqueue (specifically a TypedRateLimitingInterface). Networks are enqueued for evaluation.
    Periodic Enqueuer(for pre-fetch, run syncNetwork task): A background goroutine periodically lists all active networks from the store and pushes them into the workqueue to ensure no network starves.
    Dynamic Allocation: Daemon server enqueue syncNetwork task. Pre-fetch and dynamic allocation runs the same scaling up code path.
    Scale-Up Logic: Computes desired target pods to ensure that the total capacity represented in the CR (Pods + Initial IPs) matches or exceeds the current local total IPs plus pending requests, with a buffer to maintain target utilization.
    Scale-Down (Drain) Logic: Identifies excessive blocks when utilization falls below threshold for a sustained period and marks them as Draining.
    Stateful Timers: It maintains an in-memory map of timestamps (lowUtilizationTimers) to track how long a network has been under-utilized, ensuring capacity is not released too eagerly during transient low-load periods.
    CRD Interaction: It acts as a client to the Kubernetes API, sending JSON Merge Patches to the NodeNetworkConfig (NNC) custom resource.

  3. Watcher (watcher.go)
    Informer Pattern: It uses a Kubernetes SharedInformer to watch for modifications to the NodeNetworkConfig resource matching the local node's name.
    DB Operations: On receiving an event, it compares the NodeNetworkConfig Status.PodCIDRs with the local store. It inserts newly added CIDRs and removes CIDRBlocks from local store that are removed from GCE .
    Callback Interface: It does not directly know about the Server's blocked requests. Instead, it uses a registered callback (onCIDRAdded) to notify the server when new capacity is successfully written to the DB.

Dynamic Allocation Path

1: Exhaustion & Request Blocking (Server)
A CNI request arrives at AllocatePodIP.
The server attempts to allocate an IP from the local store via store.AllocateIPv4. If all blocks are full, it returns ErrNoAvailableIPs.
The server calls handleDynamicAllocation, which:
Enqueues the network to the Monitor.
Creates a channel, stores it in requestsMap, and executes a select statement blocking on that channel and a context timeout.
2: Capacity Request (Monitor)
A Monitor worker pops the network from the queue and calls syncNetwork.
It calculates the required baseline capacity as total local IPs + pending requests.
It ensures that the total IPs represented in the CR (newPods + initialIPs) will match this baseline at minimum, or be larger to maintain a target utilization buffer. If the newly calculated newPods is greater than currentPods in the CRD, it triggers the scale-up patch.
3: Cloud Allocation (GCE): External
4: Synchronization & Wakeup (Watcher & Server)
The daemon's Watcher receives the NNC update event via the Informer.
It iterates through Status.PodCIDRs, finds the new block, and inserts it into the local SQLite DB.
The Watcher calls server.onCIDRAdded().
The server looks up the blocked channels for that network in requestsMap and closes them.

cc. @zhaoqsh @arvindbr8 @gnossen

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 6, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If the repository mantainers determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: YifeiZhuang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 6, 2026
@k8s-ci-robot k8s-ci-robot requested review from cheftako and justinsb May 6, 2026 23:39
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 6, 2026
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 7, 2026
@YifeiZhuang YifeiZhuang force-pushed the dynamic-alloc-impl branch from ed1eef9 to 915df63 Compare May 7, 2026 00:16
@YifeiZhuang YifeiZhuang force-pushed the dynamic-alloc-impl branch from 10bbde7 to e457f25 Compare May 8, 2026 18:28
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 13, 2026
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 15, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@YifeiZhuang: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
cloud-provider-gcp-tests cc01171 link true /test cloud-provider-gcp-tests

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown

@gnossen gnossen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a partial review. Will try to get the rest in by the end of the day."

fs.StringVar(&o.DBPath, "db-path", pkg.DefaultDBPath, "Path to the SQLite database file")
fs.StringVar(&o.SocketPath, "socket-path", pkg.DefaultSockPath, "Path to the Unix domain socket")
fs.DurationVar(&o.DrainingExpiration, "draining-expiration", daemon.DefaultDrainingExpiration, "Draining expiration duration (e.g., 5h)")
fs.DurationVar(&o.SustainedLowUtilizationDuration, "sustained-low-utilization-duration", daemon.DefaultSustainedLowUtilizationDuration, "Sustained low utilization duration (e.g., 8h)")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a little bit more explanation here? Currently, this kind of just repeats the flag name without expanding. I'd also be fine if this went in as docstrings instead.


nodeName := os.Getenv("NODE_NAME")
if nodeName == "" {
logger.Info("NODE_NAME environment variable not set")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you deploy kubelet to a Node, it defaults the name of that node to the hostname of the machine it's running on. I'd recommend using os.Hostname in case NODE_NAME is not set.

var nncClient nncclientset.Interface
var kubeClient kubernetes.Interface
if err != nil {
logger.Info("Failed to get in-cluster config, clients will not be initialized")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a fatal failure? You could argue no and that the daemon should operate in a degraded mode and still allocate IPs from the initial range.

But it should at least log at a level higher than Info, right?

Same question/comment for nncClient and kubeClient below.

defaultMonitorWorkers = 4
)

type Monitor struct {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty central struct. It would be nice to have a docstring explaining what Monitor does and what it's responsible for (there will definitely be a bit of duplication with your design doc).

})

cooldownPushbackInterval := cfg.CooldownPushbackInterval
if cooldownPushbackInterval <= 0 {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "0 is interpreted as default value" semantic should be reflected in the flag help text. Same for all of the flags doing this.


const defaultWatcherWorkers = 4

type Watcher struct {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring explaining Watcher would be nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants