-
Notifications
You must be signed in to change notification settings - Fork 51
CFP-44188: Vtep Improvements with CRD #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,296 @@ | ||
| # CFP-44188: CiliumVTEPConfig CRD for Dynamic VTEP Management | ||
|
|
||
| **SIG:** SIG-Datapath ([View all current SIGs](https://docs.cilium.io/en/stable/community/community/#all-sigs)) | ||
|
|
||
| **Begin Design Discussion:** 2026-04-07 | ||
|
|
||
| **Cilium Release:** 1.20 | ||
|
|
||
| **Authors:** Murat Parlakisik <parlakisik@gmail.com> | ||
|
|
||
| **Status:** Draft | ||
|
|
||
| ## Summary | ||
|
|
||
| Replace the static CLI flag-based VTEP configuration with a cluster-scoped | ||
| `CiliumVTEPConfig` CRD that supports dynamic updates, per-node assignment via | ||
| `nodeSelector`, and per-endpoint status reporting. This enables production use | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is "endpoint" here a "VTEP endpoint"? I would suggest not using shorthand, because endpoint is already overloaded (twice - k8s and cilium have established meanings for this word which are not fully aligned) |
||
| cases where operators need simple overlay | ||
| connectivity to external gateways without requiring BGP or L2 announcements. | ||
|
|
||
| ## Motivation | ||
|
|
||
| The VTEP integration has been in beta since its introduction in | ||
| [PR #17370](https://github.com/cilium/cilium/pull/17370) (Cilium 1.12). The | ||
| original design uses static CLI flags (`--vtep-endpoint`, `--vtep-cidr`, | ||
| `--vtep-mac`, `--vtep-mask`) baked into the Cilium ConfigMap at install time. | ||
| This has several operational problems: | ||
|
|
||
| 1. **Agent restarts required for any change.** Adding, removing, or modifying a | ||
| VTEP endpoint requires updating the ConfigMap and restarting every Cilium | ||
| agent in the cluster, causing datapath disruption. | ||
|
|
||
| 2. **Single mask for all CIDRs.** The `--vtep-mask` flag applies one prefix | ||
| length to every VTEP CIDR, preventing mixed prefix lengths (e.g., `/24` and | ||
| `/16` in the same cluster). | ||
|
|
||
| 3. **No per-node differentiation.** Every node gets the same VTEP | ||
| configuration. In multi-zone or multi-site deployments, different nodes | ||
| need to reach the same external CIDR via different VTEP gateways. | ||
|
|
||
| 4. **No operational visibility.** There is no way to determine whether a VTEP | ||
| endpoint is successfully programmed in the BPF map or if configuration | ||
| errors exist. | ||
|
|
||
| 5. **No CI/CD integration**. The feature is still in beta stage due to support and e2e test cases . | ||
|
|
||
| These limitations block adoption in environments where VTEP integration would | ||
| otherwise be the simplest and most natural connectivity solution. | ||
|
|
||
| ### The Case for Overlay-to-Gateway Simplicity | ||
|
|
||
| If user doesnt want to manage BGP or L2 annoutment to send traffic to some network via external gateway. | ||
| The VTEP approach offers a fundamentally simpler model: | ||
|
|
||
| Pods send traffic via the existing VXLAN overlay directly to an external | ||
| vtep endpoint. No BGP sessions to configure and maintain. No L2 announcement | ||
| policies. No route redistribution. The Cilium agent simply encapsulates | ||
| traffic destined for external CIDRs and sends it to a known VTEP endpoint. | ||
|
Comment on lines
+52
to
+58
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BGP and L2 announcements features are about advertising addresses on connected networks in order to configure remote clusters how to transmit towards Cilium. This feature rather seems to be about how to configure Cilium in order to route traffic from Cilium. Are they really equivalent? |
||
|
|
||
|
|
||
|
|
||
| ## Goals | ||
|
|
||
| * Replace static CLI flags with a `CiliumVTEPConfig` CRD for VTEP | ||
| configuration | ||
| * Support dynamic add/update/remove of VTEP endpoints without agent restarts | ||
| * Enable per-node VTEP assignment via `nodeSelector` for multi-zone deployments | ||
| * Provide per-endpoint status reporting (synced, errors, last sync time) | ||
| * Support variable prefix lengths per VTEP CIDR (via BPF LPM Trie) | ||
| * Enable simple overlay-to-gateway connectivity without requiring BGP or L2 | ||
| announcement infrastructure | ||
|
|
||
| ## Non-Goals | ||
|
|
||
| * Multi-VNI support — the existing VNI=2 / world-identity model is preserved | ||
| * IPv6 VTEP endpoints — IPv4 only, consistent with current VTEP support | ||
| * Changes to the VXLAN encapsulation format or behavior | ||
|
|
||
|
|
||
| ## Proposal | ||
|
|
||
| ### Overview | ||
|
|
||
|  | ||
|
|
||
| Worker nodes can be independently | ||
| assigned to different VTEP endpoints using `nodeSelector`. The red and blue arrows represent traffic | ||
| from different `CiliumVTEPConfig` objects — each config targets a subset of | ||
| nodes via label selectors and directs their VXLAN-encapsulated traffic to | ||
| the appropriate external VTEP endpoint. This per-node assignment is what | ||
| enables multi-zone and multi-site deployments where each group of nodes has | ||
| its own local gateway. | ||
|
|
||
| The design introduces three components: | ||
|
|
||
| 1. **CiliumVTEPConfig CRD** — a cluster-scoped custom resource that declares | ||
| VTEP endpoints with optional node targeting | ||
| 2. **VTEPReconciler** — an agent-side controller that watches CRD events, | ||
| evaluates `nodeSelector`, and reconciles the BPF map | ||
| 3. **BPF LPM Trie map** — replaces the existing Hash map to support | ||
| variable-length prefix matching | ||
|
|
||
| ### CRD API | ||
|
|
||
| ```yaml | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I won't review the API now since there's plenty of other open questions to consider first, but the API will need to be reviewed. |
||
| apiVersion: cilium.io/v2 | ||
| kind: CiliumVTEPConfig | ||
| metadata: | ||
| name: zone-a | ||
| spec: | ||
| nodeSelector: # optional | ||
| matchLabels: | ||
| topology.kubernetes.io/zone: "zone-a" | ||
| endpoints: | ||
| - name: dc1-router | ||
| cidr: "10.1.1.0/24" | ||
| tunnelEndpoint: "10.169.72.236" | ||
| mac: "82:36:4c:98:2e:56" | ||
| - name: dc1-lb | ||
| cidr: "10.2.0.0/16" | ||
| tunnelEndpoint: "10.169.72.237" | ||
| mac: "aa:bb:cc:dd:ee:01" | ||
| ``` | ||
|
|
||
| **Key design decisions:** | ||
|
|
||
| | Decision | Rationale | | ||
| |---|---| | ||
| | Cluster-scoped (not namespaced) | VTEP config is infrastructure-level, managed by platform teams | | ||
| | `nodeSelector` on the config, not per-endpoint | Matches the physical topology model: a set of endpoints belongs to a site/zone | | ||
| | Max 8 endpoints per config | BPF map size constraint; multiple configs can target different node groups | | ||
| | `shortName: cvtep` | Quick operational access: `kubectl get cvtep` | | ||
| | Per-endpoint named entries with `+listType=map` | Enables strategic merge patch for individual endpoint updates | | ||
|
|
||
| **Validation:** | ||
|
|
||
| All fields use kubebuilder validation markers: | ||
| - `tunnelEndpoint`: IPv4 address regex | ||
| - `cidr`: IPv4 CIDR notation regex (e.g., `10.1.1.0/24`) | ||
| - `mac`: MAC address regex (colon-separated hex) | ||
| - `name`: DNS label format (lowercase alphanumeric, hyphens, 1-63 chars) | ||
| - `endpoints`: MinItems=1, MaxItems=8 | ||
|
|
||
| ### VTEPReconciler | ||
|
|
||
| The reconciler runs as a Hive cell in each Cilium agent | ||
|
|
||
|
|
||
| **Reconciliation flow:** | ||
|
|
||
| 1. On startup, the reconciler receives all `CiliumVTEPConfig` objects via | ||
| `resource.Resource[*CiliumVTEPConfig]`. | ||
|
|
||
| 2. For each event (upsert or delete), it re-evaluates which configs match the | ||
| local node's labels using `nodeSelector`. | ||
|
|
||
| 3. It computes the **desired state** — a map of normalized CIDR → endpoint | ||
| from all matching configs. | ||
|
|
||
| 4. It detects **CIDR conflicts** — if the same CIDR appears in multiple | ||
| matching configs, neither is applied and both configs receive an error | ||
| status. | ||
|
|
||
| 5. It diffs desired state against **last-applied state** and performs | ||
| incremental BPF map updates: | ||
| - New CIDRs → `UpdateEntry()` | ||
| - Changed endpoints → `UpdateEntry()` (overwrite) | ||
| - Removed CIDRs → `DeleteByCIDR()` | ||
|
|
||
| 6. It updates Linux routing table entries for VTEP CIDRs. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This actually seems like the crux of the functionality proposed here (+ the encap/decap config). I wonder whether we really want to have a dedicated VTEP CRD for this or whether it's time to consider a more generic "routing" CRD (even if initially focused just on the VTEP use case). I had a draft for such an idea about three years ago, but it never got traction so I didn't end up posting it publicly. But if this is interesting, maybe I can dust it off. |
||
|
|
||
| 7. It writes per-endpoint status back to the CRD's `.status` subresource. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a general statement, we avoid putting logic into cilium-agent to update
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably this is a case for understanding the tradeoff with @cilium/sig-scalability , especially size of targeted environments, and then considering how we might gain the desired operational visibility without introducing scalability concerns. |
||
|
|
||
| **Node label change handling:** The reconciler watches the local node's labels. When labels change, it re-evaluates all | ||
| `nodeSelector` predicates and reconciles the BPF map accordingly. | ||
|
|
||
| **Benefits:** | ||
| - Each endpoint can use a different prefix length (`/16`, `/24`, `/25`, etc.) | ||
| - The BPF LPM trie performs longest-prefix-match automatically in the datapath | ||
| - Removes the `--vtep-mask` global setting entirely | ||
|
|
||
| ### Status Reporting | ||
|
|
||
| Each `CiliumVTEPConfig` object reports status via the `.status` subresource: | ||
|
|
||
| ```yaml | ||
| status: | ||
| endpointCount: 2 | ||
| conditions: | ||
| - type: Ready | ||
| status: "True" | ||
| lastTransitionTime: "2026-04-07T10:00:00Z" | ||
| reason: AllEndpointsSynced | ||
| message: "All 2 endpoints synced to BPF map" | ||
| endpointStatuses: | ||
| - name: dc1-router | ||
| synced: true | ||
| lastSyncTime: "2026-04-07T10:00:00Z" | ||
| - name: dc1-lb | ||
| synced: true | ||
| lastSyncTime: "2026-04-07T10:00:00Z" | ||
| ``` | ||
|
|
||
| Operators can monitor VTEP health at a glance: | ||
|
|
||
| ```shell | ||
| $ kubectl get cvtep | ||
| NAME ENDPOINTS READY AGE | ||
| zone-a 2 True 1h | ||
| zone-b 2 True 1h | ||
| ``` | ||
|
|
||
| ### Helm Integration | ||
|
|
||
| VTEP is enabled via Helm with a single toggle: | ||
|
|
||
| ```shell | ||
| helm upgrade cilium cilium/cilium \ | ||
| --namespace kube-system \ | ||
| --reuse-values \ | ||
| --set vtep.enabled=true | ||
| ``` | ||
|
|
||
| This registers the `CiliumVTEPConfig` CRD. VTEP endpoints are then configured | ||
| by applying CRD objects. | ||
|
|
||
| ### Per-Zone VTEP Endpoints: Worked Example | ||
|
|
||
| Two zones with different VTEP gateways for the same destination CIDR: | ||
|
|
||
| ```yaml | ||
| # Zone-A nodes route 10.200.0.0/16 via VTEP gateway 10.100.1.1 | ||
| apiVersion: cilium.io/v2 | ||
| kind: CiliumVTEPConfig | ||
| metadata: | ||
| name: zone-a | ||
| spec: | ||
| nodeSelector: | ||
| matchLabels: | ||
| topology.kubernetes.io/zone: "zone-a" | ||
| endpoints: | ||
| - name: gw-a | ||
| cidr: "10.200.0.0/16" | ||
| tunnelEndpoint: "10.100.1.1" | ||
| mac: "aa:bb:cc:00:01:01" | ||
| --- | ||
| # Zone-B nodes route 10.200.0.0/16 via VTEP gateway 10.100.2.1 | ||
| apiVersion: cilium.io/v2 | ||
| kind: CiliumVTEPConfig | ||
| metadata: | ||
| name: zone-b | ||
| spec: | ||
| nodeSelector: | ||
| matchLabels: | ||
| topology.kubernetes.io/zone: "zone-b" | ||
| endpoints: | ||
| - name: gw-b | ||
| cidr: "10.200.0.0/16" | ||
| tunnelEndpoint: "10.100.2.1" | ||
| mac: "aa:bb:cc:00:02:01" | ||
| ``` | ||
|
|
||
| Both zones route traffic to `10.200.0.0/16` but via their local VTEP gateway. | ||
| When a new zone comes online, the | ||
| operator applies a new `CiliumVTEPConfig` | ||
|
|
||
| ## Impacts / Key Questions | ||
|
|
||
| ### Impact: Existing VTEP Users | ||
|
|
||
| The CLI flags (`--vtep-endpoint`, `--vtep-cidr`, `--vtep-mac`, `--vtep-mask`, | ||
| `--vtep-sync-interval`) have been removed. Users must migrate to the CRD. | ||
| The migration is straightforward: each set of CLI flag values maps to one | ||
| `CiliumVTEPConfig` object with a single endpoint. | ||
|
|
||
| ### Impact: BPF Map Type Change | ||
|
|
||
| Changing from Hash to LPM Trie alters the map's behavior: | ||
|
|
||
| | Property | Hash | LPM Trie | | ||
| |---|---|---| | ||
| | Lookup semantics | Exact match | Longest prefix match | | ||
| | Key size | 4 bytes | 8 bytes (4 prefix + 4 IP) | | ||
| | Preallocation | Supported | Not supported (`BPF_F_NO_PREALLOC` required) | | ||
| | Max entries | 8 | 8 | | ||
|
|
||
| The LPM Trie is strictly more capable. Existing configurations with uniform | ||
| prefix lengths produce identical routing behavior. | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Key question: How will this integrate with the broader architecture? Ideally whatever we come up with, it has a path to eventually integrate with all other GA features where applicable. Specifically, consider how does this interface with masquerading, egress gateway, encryption? |
||
| ## Future Milestones | ||
|
|
||
| ### Graduating VTEP to GA | ||
|
|
||
| With CRD-based management, status reporting, and CI conformance tests, the | ||
| VTEP feature has a clearer path to GA status. The CRD API provides the | ||
| validation, observability, and operational tooling expected of a GA feature. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two high level thoughts, related, which I wonder about for the overall architecture. I don't have a strong opinion on these approaches, but they seem within the possible design space, so they're worth considering: