Skip to content

Commit d0c2ee8

Browse files
committed
CFP-44188: Vtep Improvements with CRD
Replace the static CLI flag-based VTEP configuration with a cluster-scoped CiliumVTEPConfig CRD that supports dynamic updates, per-node assignment via nodeSelector, and per-endpoint status reporting. Signed-off-by: Murat Parlakisik <parlakisik@gmail.com>
1 parent 3abd976 commit d0c2ee8

2 files changed

Lines changed: 296 additions & 0 deletions

File tree

Lines changed: 296 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,296 @@
1+
# CFP-44188: CiliumVTEPConfig CRD for Dynamic VTEP Management
2+
3+
**SIG:** SIG-Datapath ([View all current SIGs](https://docs.cilium.io/en/stable/community/community/#all-sigs))
4+
5+
**Begin Design Discussion:** 2026-04-07
6+
7+
**Cilium Release:** 1.20
8+
9+
**Authors:** Murat Parlakisik <parlakisik@gmail.com>
10+
11+
**Status:** Draft
12+
13+
## Summary
14+
15+
Replace the static CLI flag-based VTEP configuration with a cluster-scoped
16+
`CiliumVTEPConfig` CRD that supports dynamic updates, per-node assignment via
17+
`nodeSelector`, and per-endpoint status reporting. This enables production use
18+
cases where operators need simple overlay
19+
connectivity to external gateways without requiring BGP or L2 announcements.
20+
21+
## Motivation
22+
23+
The VTEP integration has been in beta since its introduction in
24+
[PR #17370](https://github.com/cilium/cilium/pull/17370) (Cilium 1.12). The
25+
original design uses static CLI flags (`--vtep-endpoint`, `--vtep-cidr`,
26+
`--vtep-mac`, `--vtep-mask`) baked into the Cilium ConfigMap at install time.
27+
This has several operational problems:
28+
29+
1. **Agent restarts required for any change.** Adding, removing, or modifying a
30+
VTEP endpoint requires updating the ConfigMap and restarting every Cilium
31+
agent in the cluster, causing datapath disruption.
32+
33+
2. **Single mask for all CIDRs.** The `--vtep-mask` flag applies one prefix
34+
length to every VTEP CIDR, preventing mixed prefix lengths (e.g., `/24` and
35+
`/16` in the same cluster).
36+
37+
3. **No per-node differentiation.** Every node gets the same VTEP
38+
configuration. In multi-zone or multi-site deployments, different nodes
39+
need to reach the same external CIDR via different VTEP gateways.
40+
41+
4. **No operational visibility.** There is no way to determine whether a VTEP
42+
endpoint is successfully programmed in the BPF map or if configuration
43+
errors exist.
44+
45+
5. **No CI/CD integration**. The feature is still in beta stage due to support and e2e test cases .
46+
47+
These limitations block adoption in environments where VTEP integration would
48+
otherwise be the simplest and most natural connectivity solution.
49+
50+
### The Case for Overlay-to-Gateway Simplicity
51+
52+
If user doesnt want to manage BGP or L2 annoutment to send traffic to some network via external gateway.
53+
The VTEP approach offers a fundamentally simpler model:
54+
55+
Pods send traffic via the existing VXLAN overlay directly to an external
56+
vtep endpoint. No BGP sessions to configure and maintain. No L2 announcement
57+
policies. No route redistribution. The Cilium agent simply encapsulates
58+
traffic destined for external CIDRs and sends it to a known VTEP endpoint.
59+
60+
61+
62+
## Goals
63+
64+
* Replace static CLI flags with a `CiliumVTEPConfig` CRD for VTEP
65+
configuration
66+
* Support dynamic add/update/remove of VTEP endpoints without agent restarts
67+
* Enable per-node VTEP assignment via `nodeSelector` for multi-zone deployments
68+
* Provide per-endpoint status reporting (synced, errors, last sync time)
69+
* Support variable prefix lengths per VTEP CIDR (via BPF LPM Trie)
70+
* Enable simple overlay-to-gateway connectivity without requiring BGP or L2
71+
announcement infrastructure
72+
73+
## Non-Goals
74+
75+
* Multi-VNI support — the existing VNI=2 / world-identity model is preserved
76+
* IPv6 VTEP endpoints — IPv4 only, consistent with current VTEP support
77+
* Changes to the VXLAN encapsulation format or behavior
78+
79+
80+
## Proposal
81+
82+
### Overview
83+
84+
![VTEP Architecture](../cilium/images/CFP-44188-vtep-connectivity.png)
85+
86+
Worker nodes can be independently
87+
assigned to different VTEP endpoints using `nodeSelector`. The red and blue arrows represent traffic
88+
from different `CiliumVTEPConfig` objects — each config targets a subset of
89+
nodes via label selectors and directs their VXLAN-encapsulated traffic to
90+
the appropriate external VTEP endpoint. This per-node assignment is what
91+
enables multi-zone and multi-site deployments where each group of nodes has
92+
its own local gateway.
93+
94+
The design introduces three components:
95+
96+
1. **CiliumVTEPConfig CRD** — a cluster-scoped custom resource that declares
97+
VTEP endpoints with optional node targeting
98+
2. **VTEPReconciler** — an agent-side controller that watches CRD events,
99+
evaluates `nodeSelector`, and reconciles the BPF map
100+
3. **BPF LPM Trie map** — replaces the existing Hash map to support
101+
variable-length prefix matching
102+
103+
### CRD API
104+
105+
```yaml
106+
apiVersion: cilium.io/v2
107+
kind: CiliumVTEPConfig
108+
metadata:
109+
name: zone-a
110+
spec:
111+
nodeSelector: # optional
112+
matchLabels:
113+
topology.kubernetes.io/zone: "zone-a"
114+
endpoints:
115+
- name: dc1-router
116+
cidr: "10.1.1.0/24"
117+
tunnelEndpoint: "10.169.72.236"
118+
mac: "82:36:4c:98:2e:56"
119+
- name: dc1-lb
120+
cidr: "10.2.0.0/16"
121+
tunnelEndpoint: "10.169.72.237"
122+
mac: "aa:bb:cc:dd:ee:01"
123+
```
124+
125+
**Key design decisions:**
126+
127+
| Decision | Rationale |
128+
|---|---|
129+
| Cluster-scoped (not namespaced) | VTEP config is infrastructure-level, managed by platform teams |
130+
| `nodeSelector` on the config, not per-endpoint | Matches the physical topology model: a set of endpoints belongs to a site/zone |
131+
| Max 8 endpoints per config | BPF map size constraint; multiple configs can target different node groups |
132+
| `shortName: cvtep` | Quick operational access: `kubectl get cvtep` |
133+
| Per-endpoint named entries with `+listType=map` | Enables strategic merge patch for individual endpoint updates |
134+
135+
**Validation:**
136+
137+
All fields use kubebuilder validation markers:
138+
- `tunnelEndpoint`: IPv4 address regex
139+
- `cidr`: IPv4 CIDR notation regex (e.g., `10.1.1.0/24`)
140+
- `mac`: MAC address regex (colon-separated hex)
141+
- `name`: DNS label format (lowercase alphanumeric, hyphens, 1-63 chars)
142+
- `endpoints`: MinItems=1, MaxItems=8
143+
144+
### VTEPReconciler
145+
146+
The reconciler runs as a Hive cell in each Cilium agent
147+
148+
149+
**Reconciliation flow:**
150+
151+
1. On startup, the reconciler receives all `CiliumVTEPConfig` objects via
152+
`resource.Resource[*CiliumVTEPConfig]`.
153+
154+
2. For each event (upsert or delete), it re-evaluates which configs match the
155+
local node's labels using `nodeSelector`.
156+
157+
3. It computes the **desired state** — a map of normalized CIDR → endpoint
158+
from all matching configs.
159+
160+
4. It detects **CIDR conflicts** — if the same CIDR appears in multiple
161+
matching configs, neither is applied and both configs receive an error
162+
status.
163+
164+
5. It diffs desired state against **last-applied state** and performs
165+
incremental BPF map updates:
166+
- New CIDRs → `UpdateEntry()`
167+
- Changed endpoints → `UpdateEntry()` (overwrite)
168+
- Removed CIDRs → `DeleteByCIDR()`
169+
170+
6. It updates Linux routing table entries for VTEP CIDRs.
171+
172+
7. It writes per-endpoint status back to the CRD's `.status` subresource.
173+
174+
**Node label change handling:** The reconciler watches the local node's labels. When labels change, it re-evaluates all
175+
`nodeSelector` predicates and reconciles the BPF map accordingly.
176+
177+
**Benefits:**
178+
- Each endpoint can use a different prefix length (`/16`, `/24`, `/25`, etc.)
179+
- The BPF LPM trie performs longest-prefix-match automatically in the datapath
180+
- Removes the `--vtep-mask` global setting entirely
181+
182+
### Status Reporting
183+
184+
Each `CiliumVTEPConfig` object reports status via the `.status` subresource:
185+
186+
```yaml
187+
status:
188+
endpointCount: 2
189+
conditions:
190+
- type: Ready
191+
status: "True"
192+
lastTransitionTime: "2026-04-07T10:00:00Z"
193+
reason: AllEndpointsSynced
194+
message: "All 2 endpoints synced to BPF map"
195+
endpointStatuses:
196+
- name: dc1-router
197+
synced: true
198+
lastSyncTime: "2026-04-07T10:00:00Z"
199+
- name: dc1-lb
200+
synced: true
201+
lastSyncTime: "2026-04-07T10:00:00Z"
202+
```
203+
204+
Operators can monitor VTEP health at a glance:
205+
206+
```shell
207+
$ kubectl get cvtep
208+
NAME ENDPOINTS READY AGE
209+
zone-a 2 True 1h
210+
zone-b 2 True 1h
211+
```
212+
213+
### Helm Integration
214+
215+
VTEP is enabled via Helm with a single toggle:
216+
217+
```shell
218+
helm upgrade cilium cilium/cilium \
219+
--namespace kube-system \
220+
--reuse-values \
221+
--set vtep.enabled=true
222+
```
223+
224+
This registers the `CiliumVTEPConfig` CRD. VTEP endpoints are then configured
225+
by applying CRD objects.
226+
227+
### Per-Zone VTEP Endpoints: Worked Example
228+
229+
Two zones with different VTEP gateways for the same destination CIDR:
230+
231+
```yaml
232+
# Zone-A nodes route 10.200.0.0/16 via VTEP gateway 10.100.1.1
233+
apiVersion: cilium.io/v2
234+
kind: CiliumVTEPConfig
235+
metadata:
236+
name: zone-a
237+
spec:
238+
nodeSelector:
239+
matchLabels:
240+
topology.kubernetes.io/zone: "zone-a"
241+
endpoints:
242+
- name: gw-a
243+
cidr: "10.200.0.0/16"
244+
tunnelEndpoint: "10.100.1.1"
245+
mac: "aa:bb:cc:00:01:01"
246+
---
247+
# Zone-B nodes route 10.200.0.0/16 via VTEP gateway 10.100.2.1
248+
apiVersion: cilium.io/v2
249+
kind: CiliumVTEPConfig
250+
metadata:
251+
name: zone-b
252+
spec:
253+
nodeSelector:
254+
matchLabels:
255+
topology.kubernetes.io/zone: "zone-b"
256+
endpoints:
257+
- name: gw-b
258+
cidr: "10.200.0.0/16"
259+
tunnelEndpoint: "10.100.2.1"
260+
mac: "aa:bb:cc:00:02:01"
261+
```
262+
263+
Both zones route traffic to `10.200.0.0/16` but via their local VTEP gateway.
264+
When a new zone comes online, the
265+
operator applies a new `CiliumVTEPConfig`
266+
267+
## Impacts / Key Questions
268+
269+
### Impact: Existing VTEP Users
270+
271+
The CLI flags (`--vtep-endpoint`, `--vtep-cidr`, `--vtep-mac`, `--vtep-mask`,
272+
`--vtep-sync-interval`) have been removed. Users must migrate to the CRD.
273+
The migration is straightforward: each set of CLI flag values maps to one
274+
`CiliumVTEPConfig` object with a single endpoint.
275+
276+
### Impact: BPF Map Type Change
277+
278+
Changing from Hash to LPM Trie alters the map's behavior:
279+
280+
| Property | Hash | LPM Trie |
281+
|---|---|---|
282+
| Lookup semantics | Exact match | Longest prefix match |
283+
| Key size | 4 bytes | 8 bytes (4 prefix + 4 IP) |
284+
| Preallocation | Supported | Not supported (`BPF_F_NO_PREALLOC` required) |
285+
| Max entries | 8 | 8 |
286+
287+
The LPM Trie is strictly more capable. Existing configurations with uniform
288+
prefix lengths produce identical routing behavior.
289+
290+
## Future Milestones
291+
292+
### Graduating VTEP to GA
293+
294+
With CRD-based management, status reporting, and CI conformance tests, the
295+
VTEP feature has a clearer path to GA status. The CRD API provides the
296+
validation, observability, and operational tooling expected of a GA feature.
206 KB
Loading

0 commit comments

Comments
 (0)