|
| 1 | +# CFP-44188: CiliumVTEPConfig CRD for Dynamic VTEP Management |
| 2 | + |
| 3 | +**SIG:** SIG-Datapath ([View all current SIGs](https://docs.cilium.io/en/stable/community/community/#all-sigs)) |
| 4 | + |
| 5 | +**Begin Design Discussion:** 2026-04-07 |
| 6 | + |
| 7 | +**Cilium Release:** 1.20 |
| 8 | + |
| 9 | +**Authors:** Murat Parlakisik <parlakisik@gmail.com> |
| 10 | + |
| 11 | +**Status:** Draft |
| 12 | + |
| 13 | +## Summary |
| 14 | + |
| 15 | +Replace the static CLI flag-based VTEP configuration with a cluster-scoped |
| 16 | +`CiliumVTEPConfig` CRD that supports dynamic updates, per-node assignment via |
| 17 | +`nodeSelector`, and per-endpoint status reporting. This enables production use |
| 18 | +cases where operators need simple overlay |
| 19 | +connectivity to external gateways without requiring BGP or L2 announcements. |
| 20 | + |
| 21 | +## Motivation |
| 22 | + |
| 23 | +The VTEP integration has been in beta since its introduction in |
| 24 | +[PR #17370](https://github.com/cilium/cilium/pull/17370) (Cilium 1.12). The |
| 25 | +original design uses static CLI flags (`--vtep-endpoint`, `--vtep-cidr`, |
| 26 | +`--vtep-mac`, `--vtep-mask`) baked into the Cilium ConfigMap at install time. |
| 27 | +This has several operational problems: |
| 28 | + |
| 29 | +1. **Agent restarts required for any change.** Adding, removing, or modifying a |
| 30 | + VTEP endpoint requires updating the ConfigMap and restarting every Cilium |
| 31 | + agent in the cluster, causing datapath disruption. |
| 32 | + |
| 33 | +2. **Single mask for all CIDRs.** The `--vtep-mask` flag applies one prefix |
| 34 | + length to every VTEP CIDR, preventing mixed prefix lengths (e.g., `/24` and |
| 35 | + `/16` in the same cluster). |
| 36 | + |
| 37 | +3. **No per-node differentiation.** Every node gets the same VTEP |
| 38 | + configuration. In multi-zone or multi-site deployments, different nodes |
| 39 | + need to reach the same external CIDR via different VTEP gateways. |
| 40 | + |
| 41 | +4. **No operational visibility.** There is no way to determine whether a VTEP |
| 42 | + endpoint is successfully programmed in the BPF map or if configuration |
| 43 | + errors exist. |
| 44 | + |
| 45 | +5. **No CI/CD integration**. The feature is still in beta stage due to support and e2e test cases . |
| 46 | + |
| 47 | +These limitations block adoption in environments where VTEP integration would |
| 48 | +otherwise be the simplest and most natural connectivity solution. |
| 49 | + |
| 50 | +### The Case for Overlay-to-Gateway Simplicity |
| 51 | + |
| 52 | +If user doesnt want to manage BGP or L2 annoutment to send traffic to some network via external gateway. |
| 53 | +The VTEP approach offers a fundamentally simpler model: |
| 54 | + |
| 55 | +Pods send traffic via the existing VXLAN overlay directly to an external |
| 56 | +vtep endpoint. No BGP sessions to configure and maintain. No L2 announcement |
| 57 | +policies. No route redistribution. The Cilium agent simply encapsulates |
| 58 | +traffic destined for external CIDRs and sends it to a known VTEP endpoint. |
| 59 | + |
| 60 | + |
| 61 | + |
| 62 | +## Goals |
| 63 | + |
| 64 | +* Replace static CLI flags with a `CiliumVTEPConfig` CRD for VTEP |
| 65 | + configuration |
| 66 | +* Support dynamic add/update/remove of VTEP endpoints without agent restarts |
| 67 | +* Enable per-node VTEP assignment via `nodeSelector` for multi-zone deployments |
| 68 | +* Provide per-endpoint status reporting (synced, errors, last sync time) |
| 69 | +* Support variable prefix lengths per VTEP CIDR (via BPF LPM Trie) |
| 70 | +* Enable simple overlay-to-gateway connectivity without requiring BGP or L2 |
| 71 | + announcement infrastructure |
| 72 | + |
| 73 | +## Non-Goals |
| 74 | + |
| 75 | +* Multi-VNI support — the existing VNI=2 / world-identity model is preserved |
| 76 | +* IPv6 VTEP endpoints — IPv4 only, consistent with current VTEP support |
| 77 | +* Changes to the VXLAN encapsulation format or behavior |
| 78 | + |
| 79 | + |
| 80 | +## Proposal |
| 81 | + |
| 82 | +### Overview |
| 83 | + |
| 84 | + |
| 85 | + |
| 86 | +Worker nodes can be independently |
| 87 | +assigned to different VTEP endpoints using `nodeSelector`. The red and blue arrows represent traffic |
| 88 | +from different `CiliumVTEPConfig` objects — each config targets a subset of |
| 89 | +nodes via label selectors and directs their VXLAN-encapsulated traffic to |
| 90 | +the appropriate external VTEP endpoint. This per-node assignment is what |
| 91 | +enables multi-zone and multi-site deployments where each group of nodes has |
| 92 | +its own local gateway. |
| 93 | + |
| 94 | +The design introduces three components: |
| 95 | + |
| 96 | +1. **CiliumVTEPConfig CRD** — a cluster-scoped custom resource that declares |
| 97 | + VTEP endpoints with optional node targeting |
| 98 | +2. **VTEPReconciler** — an agent-side controller that watches CRD events, |
| 99 | + evaluates `nodeSelector`, and reconciles the BPF map |
| 100 | +3. **BPF LPM Trie map** — replaces the existing Hash map to support |
| 101 | + variable-length prefix matching |
| 102 | + |
| 103 | +### CRD API |
| 104 | + |
| 105 | +```yaml |
| 106 | +apiVersion: cilium.io/v2 |
| 107 | +kind: CiliumVTEPConfig |
| 108 | +metadata: |
| 109 | + name: zone-a |
| 110 | +spec: |
| 111 | + nodeSelector: # optional |
| 112 | + matchLabels: |
| 113 | + topology.kubernetes.io/zone: "zone-a" |
| 114 | + endpoints: |
| 115 | + - name: dc1-router |
| 116 | + cidr: "10.1.1.0/24" |
| 117 | + tunnelEndpoint: "10.169.72.236" |
| 118 | + mac: "82:36:4c:98:2e:56" |
| 119 | + - name: dc1-lb |
| 120 | + cidr: "10.2.0.0/16" |
| 121 | + tunnelEndpoint: "10.169.72.237" |
| 122 | + mac: "aa:bb:cc:dd:ee:01" |
| 123 | +``` |
| 124 | +
|
| 125 | +**Key design decisions:** |
| 126 | +
|
| 127 | +| Decision | Rationale | |
| 128 | +|---|---| |
| 129 | +| Cluster-scoped (not namespaced) | VTEP config is infrastructure-level, managed by platform teams | |
| 130 | +| `nodeSelector` on the config, not per-endpoint | Matches the physical topology model: a set of endpoints belongs to a site/zone | |
| 131 | +| Max 8 endpoints per config | BPF map size constraint; multiple configs can target different node groups | |
| 132 | +| `shortName: cvtep` | Quick operational access: `kubectl get cvtep` | |
| 133 | +| Per-endpoint named entries with `+listType=map` | Enables strategic merge patch for individual endpoint updates | |
| 134 | + |
| 135 | +**Validation:** |
| 136 | + |
| 137 | +All fields use kubebuilder validation markers: |
| 138 | +- `tunnelEndpoint`: IPv4 address regex |
| 139 | +- `cidr`: IPv4 CIDR notation regex (e.g., `10.1.1.0/24`) |
| 140 | +- `mac`: MAC address regex (colon-separated hex) |
| 141 | +- `name`: DNS label format (lowercase alphanumeric, hyphens, 1-63 chars) |
| 142 | +- `endpoints`: MinItems=1, MaxItems=8 |
| 143 | + |
| 144 | +### VTEPReconciler |
| 145 | + |
| 146 | +The reconciler runs as a Hive cell in each Cilium agent |
| 147 | + |
| 148 | + |
| 149 | +**Reconciliation flow:** |
| 150 | + |
| 151 | +1. On startup, the reconciler receives all `CiliumVTEPConfig` objects via |
| 152 | + `resource.Resource[*CiliumVTEPConfig]`. |
| 153 | + |
| 154 | +2. For each event (upsert or delete), it re-evaluates which configs match the |
| 155 | + local node's labels using `nodeSelector`. |
| 156 | + |
| 157 | +3. It computes the **desired state** — a map of normalized CIDR → endpoint |
| 158 | + from all matching configs. |
| 159 | + |
| 160 | +4. It detects **CIDR conflicts** — if the same CIDR appears in multiple |
| 161 | + matching configs, neither is applied and both configs receive an error |
| 162 | + status. |
| 163 | + |
| 164 | +5. It diffs desired state against **last-applied state** and performs |
| 165 | + incremental BPF map updates: |
| 166 | + - New CIDRs → `UpdateEntry()` |
| 167 | + - Changed endpoints → `UpdateEntry()` (overwrite) |
| 168 | + - Removed CIDRs → `DeleteByCIDR()` |
| 169 | + |
| 170 | +6. It updates Linux routing table entries for VTEP CIDRs. |
| 171 | + |
| 172 | +7. It writes per-endpoint status back to the CRD's `.status` subresource. |
| 173 | + |
| 174 | +**Node label change handling:** The reconciler watches the local node's labels. When labels change, it re-evaluates all |
| 175 | +`nodeSelector` predicates and reconciles the BPF map accordingly. |
| 176 | + |
| 177 | +**Benefits:** |
| 178 | +- Each endpoint can use a different prefix length (`/16`, `/24`, `/25`, etc.) |
| 179 | +- The BPF LPM trie performs longest-prefix-match automatically in the datapath |
| 180 | +- Removes the `--vtep-mask` global setting entirely |
| 181 | + |
| 182 | +### Status Reporting |
| 183 | + |
| 184 | +Each `CiliumVTEPConfig` object reports status via the `.status` subresource: |
| 185 | + |
| 186 | +```yaml |
| 187 | +status: |
| 188 | + endpointCount: 2 |
| 189 | + conditions: |
| 190 | + - type: Ready |
| 191 | + status: "True" |
| 192 | + lastTransitionTime: "2026-04-07T10:00:00Z" |
| 193 | + reason: AllEndpointsSynced |
| 194 | + message: "All 2 endpoints synced to BPF map" |
| 195 | + endpointStatuses: |
| 196 | + - name: dc1-router |
| 197 | + synced: true |
| 198 | + lastSyncTime: "2026-04-07T10:00:00Z" |
| 199 | + - name: dc1-lb |
| 200 | + synced: true |
| 201 | + lastSyncTime: "2026-04-07T10:00:00Z" |
| 202 | +``` |
| 203 | + |
| 204 | +Operators can monitor VTEP health at a glance: |
| 205 | + |
| 206 | +```shell |
| 207 | +$ kubectl get cvtep |
| 208 | +NAME ENDPOINTS READY AGE |
| 209 | +zone-a 2 True 1h |
| 210 | +zone-b 2 True 1h |
| 211 | +``` |
| 212 | + |
| 213 | +### Helm Integration |
| 214 | + |
| 215 | +VTEP is enabled via Helm with a single toggle: |
| 216 | + |
| 217 | +```shell |
| 218 | +helm upgrade cilium cilium/cilium \ |
| 219 | + --namespace kube-system \ |
| 220 | + --reuse-values \ |
| 221 | + --set vtep.enabled=true |
| 222 | +``` |
| 223 | + |
| 224 | +This registers the `CiliumVTEPConfig` CRD. VTEP endpoints are then configured |
| 225 | +by applying CRD objects. |
| 226 | + |
| 227 | +### Per-Zone VTEP Endpoints: Worked Example |
| 228 | + |
| 229 | +Two zones with different VTEP gateways for the same destination CIDR: |
| 230 | + |
| 231 | +```yaml |
| 232 | +# Zone-A nodes route 10.200.0.0/16 via VTEP gateway 10.100.1.1 |
| 233 | +apiVersion: cilium.io/v2 |
| 234 | +kind: CiliumVTEPConfig |
| 235 | +metadata: |
| 236 | + name: zone-a |
| 237 | +spec: |
| 238 | + nodeSelector: |
| 239 | + matchLabels: |
| 240 | + topology.kubernetes.io/zone: "zone-a" |
| 241 | + endpoints: |
| 242 | + - name: gw-a |
| 243 | + cidr: "10.200.0.0/16" |
| 244 | + tunnelEndpoint: "10.100.1.1" |
| 245 | + mac: "aa:bb:cc:00:01:01" |
| 246 | +--- |
| 247 | +# Zone-B nodes route 10.200.0.0/16 via VTEP gateway 10.100.2.1 |
| 248 | +apiVersion: cilium.io/v2 |
| 249 | +kind: CiliumVTEPConfig |
| 250 | +metadata: |
| 251 | + name: zone-b |
| 252 | +spec: |
| 253 | + nodeSelector: |
| 254 | + matchLabels: |
| 255 | + topology.kubernetes.io/zone: "zone-b" |
| 256 | + endpoints: |
| 257 | + - name: gw-b |
| 258 | + cidr: "10.200.0.0/16" |
| 259 | + tunnelEndpoint: "10.100.2.1" |
| 260 | + mac: "aa:bb:cc:00:02:01" |
| 261 | +``` |
| 262 | + |
| 263 | +Both zones route traffic to `10.200.0.0/16` but via their local VTEP gateway. |
| 264 | +When a new zone comes online, the |
| 265 | +operator applies a new `CiliumVTEPConfig` |
| 266 | + |
| 267 | +## Impacts / Key Questions |
| 268 | + |
| 269 | +### Impact: Existing VTEP Users |
| 270 | + |
| 271 | +The CLI flags (`--vtep-endpoint`, `--vtep-cidr`, `--vtep-mac`, `--vtep-mask`, |
| 272 | +`--vtep-sync-interval`) have been removed. Users must migrate to the CRD. |
| 273 | +The migration is straightforward: each set of CLI flag values maps to one |
| 274 | +`CiliumVTEPConfig` object with a single endpoint. |
| 275 | + |
| 276 | +### Impact: BPF Map Type Change |
| 277 | + |
| 278 | +Changing from Hash to LPM Trie alters the map's behavior: |
| 279 | + |
| 280 | +| Property | Hash | LPM Trie | |
| 281 | +|---|---|---| |
| 282 | +| Lookup semantics | Exact match | Longest prefix match | |
| 283 | +| Key size | 4 bytes | 8 bytes (4 prefix + 4 IP) | |
| 284 | +| Preallocation | Supported | Not supported (`BPF_F_NO_PREALLOC` required) | |
| 285 | +| Max entries | 8 | 8 | |
| 286 | + |
| 287 | +The LPM Trie is strictly more capable. Existing configurations with uniform |
| 288 | +prefix lengths produce identical routing behavior. |
| 289 | + |
| 290 | +## Future Milestones |
| 291 | + |
| 292 | +### Graduating VTEP to GA |
| 293 | + |
| 294 | +With CRD-based management, status reporting, and CI conformance tests, the |
| 295 | +VTEP feature has a clearer path to GA status. The CRD API provides the |
| 296 | +validation, observability, and operational tooling expected of a GA feature. |
0 commit comments