feature: metis: implement dynamic allocation and draining in metis daemon#1110
feature: metis: implement dynamic allocation and draining in metis daemon#1110YifeiZhuang wants to merge 16 commits into
Conversation
|
This issue is currently awaiting triage. If the repository mantainers determine this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: YifeiZhuang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ed1eef9 to
915df63
Compare
10bbde7 to
e457f25
Compare
…o dynamic-alloc-impl
|
@YifeiZhuang: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
gnossen
left a comment
There was a problem hiding this comment.
Just a partial review. Will try to get the rest in by the end of the day."
| fs.StringVar(&o.DBPath, "db-path", pkg.DefaultDBPath, "Path to the SQLite database file") | ||
| fs.StringVar(&o.SocketPath, "socket-path", pkg.DefaultSockPath, "Path to the Unix domain socket") | ||
| fs.DurationVar(&o.DrainingExpiration, "draining-expiration", daemon.DefaultDrainingExpiration, "Draining expiration duration (e.g., 5h)") | ||
| fs.DurationVar(&o.SustainedLowUtilizationDuration, "sustained-low-utilization-duration", daemon.DefaultSustainedLowUtilizationDuration, "Sustained low utilization duration (e.g., 8h)") |
There was a problem hiding this comment.
Can we get a little bit more explanation here? Currently, this kind of just repeats the flag name without expanding. I'd also be fine if this went in as docstrings instead.
|
|
||
| nodeName := os.Getenv("NODE_NAME") | ||
| if nodeName == "" { | ||
| logger.Info("NODE_NAME environment variable not set") |
There was a problem hiding this comment.
When you deploy kubelet to a Node, it defaults the name of that node to the hostname of the machine it's running on. I'd recommend using os.Hostname in case NODE_NAME is not set.
| var nncClient nncclientset.Interface | ||
| var kubeClient kubernetes.Interface | ||
| if err != nil { | ||
| logger.Info("Failed to get in-cluster config, clients will not be initialized") |
There was a problem hiding this comment.
Should this be a fatal failure? You could argue no and that the daemon should operate in a degraded mode and still allocate IPs from the initial range.
But it should at least log at a level higher than Info, right?
Same question/comment for nncClient and kubeClient below.
| defaultMonitorWorkers = 4 | ||
| ) | ||
|
|
||
| type Monitor struct { |
There was a problem hiding this comment.
This is a pretty central struct. It would be nice to have a docstring explaining what Monitor does and what it's responsible for (there will definitely be a bit of duplication with your design doc).
| }) | ||
|
|
||
| cooldownPushbackInterval := cfg.CooldownPushbackInterval | ||
| if cooldownPushbackInterval <= 0 { |
There was a problem hiding this comment.
This "0 is interpreted as default value" semantic should be reflected in the flag help text. Same for all of the flags doing this.
|
|
||
| const defaultWatcherWorkers = 4 | ||
|
|
||
| type Watcher struct { |
There was a problem hiding this comment.
Docstring explaining Watcher would be nice.
This change implements the dynamic allocation and draining of Pod CIDR blocks in the Metis Daemon.
Components and Changes:
Daemon Server
Maintains an in-memory registry of blocked requests (requestsMap). This is a nested map keyed by network and containerID pointing to notification channels (
chan struct{}).Daemon Monitor (monitor.go)
It uses a standard Kubernetes workqueue (specifically a
TypedRateLimitingInterface). Networks are enqueued for evaluation.Periodic Enqueuer(for pre-fetch, run
syncNetworktask): A background goroutine periodically lists all active networks from the store and pushes them into the workqueue to ensure no network starves.Dynamic Allocation: Daemon server enqueue
syncNetworktask. Pre-fetch and dynamic allocation runs the same scaling up code path.Scale-Up Logic: Computes desired target pods to ensure that the total capacity represented in the
CR(Pods + Initial IPs) matches or exceeds the current local total IPs plus pending requests, with a buffer to maintain target utilization.Scale-Down (Drain) Logic: Identifies excessive blocks when utilization falls below threshold for a sustained period and marks them as Draining.
Stateful Timers: It maintains an in-memory map of timestamps (lowUtilizationTimers) to track how long a network has been under-utilized, ensuring capacity is not released too eagerly during transient low-load periods.
CRDInteraction: It acts as a client to the Kubernetes API, sending JSON Merge Patches to theNodeNetworkConfig(NNC) custom resource.Watcher (
watcher.go)Informer Pattern: It uses a Kubernetes SharedInformer to watch for modifications to the NodeNetworkConfig resource matching the local node's name.
DB Operations: On receiving an event, it compares the
NodeNetworkConfigStatus.PodCIDRswith the local store. It inserts newly added CIDRs and removes CIDRBlocks from local store that are removed from GCE .Callback Interface: It does not directly know about the Server's blocked requests. Instead, it uses a registered callback (
onCIDRAdded) to notify the server when new capacity is successfully written to the DB.Dynamic Allocation Path
1: Exhaustion & Request Blocking (Server)
A CNI request arrives at
AllocatePodIP.The server attempts to allocate an IP from the local store via
store.AllocateIPv4. If all blocks are full, it returnsErrNoAvailableIPs.The server calls
handleDynamicAllocation, which:Enqueues the network to the Monitor.
Creates a channel, stores it in requestsMap, and executes a select statement blocking on that channel and a context timeout.
2: Capacity Request (Monitor)
A Monitor worker pops the network from the queue and calls
syncNetwork.It calculates the required baseline capacity as total local IPs + pending requests.
It ensures that the total IPs represented in the CR (newPods + initialIPs) will match this baseline at minimum, or be larger to maintain a target utilization buffer. If the newly calculated newPods is greater than currentPods in the CRD, it triggers the scale-up patch.
3: Cloud Allocation (GCE): External
4: Synchronization & Wakeup (Watcher & Server)
The daemon's Watcher receives the
NNCupdate event via the Informer.It iterates through
Status.PodCIDRs, finds the new block, and inserts it into the local SQLite DB.The Watcher calls
server.onCIDRAdded().The server looks up the blocked channels for that network in requestsMap and closes them.
cc. @zhaoqsh @arvindbr8 @gnossen