Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
296 changes: 296 additions & 0 deletions cilium/CFP-44188-ciliumvtepconfig-crd.md
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two high level thoughts, related, which I wonder about for the overall architecture. I don't have a strong opinion on these approaches, but they seem within the possible design space, so they're worth considering:

  1. Could Cilium integrate natively to the Linux stack to delegate routing of this traffic to the Linux routing table, then have another component sync the desired state into the kernel routing table?
  2. Alternatively if this is difficult due to VNI selection, could perhaps datapath plugins provide an alternative integration mode?

Original file line number Diff line number Diff line change
@@ -0,0 +1,296 @@
# CFP-44188: CiliumVTEPConfig CRD for Dynamic VTEP Management

**SIG:** SIG-Datapath ([View all current SIGs](https://docs.cilium.io/en/stable/community/community/#all-sigs))

**Begin Design Discussion:** 2026-04-07

**Cilium Release:** 1.20

**Authors:** Murat Parlakisik <parlakisik@gmail.com>

**Status:** Draft

## Summary

Replace the static CLI flag-based VTEP configuration with a cluster-scoped
`CiliumVTEPConfig` CRD that supports dynamic updates, per-node assignment via
`nodeSelector`, and per-endpoint status reporting. This enables production use
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "endpoint" here a "VTEP endpoint"? I would suggest not using shorthand, because endpoint is already overloaded (twice - k8s and cilium have established meanings for this word which are not fully aligned)

cases where operators need simple overlay
connectivity to external gateways without requiring BGP or L2 announcements.

## Motivation

The VTEP integration has been in beta since its introduction in
[PR #17370](https://github.com/cilium/cilium/pull/17370) (Cilium 1.12). The
original design uses static CLI flags (`--vtep-endpoint`, `--vtep-cidr`,
`--vtep-mac`, `--vtep-mask`) baked into the Cilium ConfigMap at install time.
This has several operational problems:

1. **Agent restarts required for any change.** Adding, removing, or modifying a
VTEP endpoint requires updating the ConfigMap and restarting every Cilium
agent in the cluster, causing datapath disruption.

2. **Single mask for all CIDRs.** The `--vtep-mask` flag applies one prefix
length to every VTEP CIDR, preventing mixed prefix lengths (e.g., `/24` and
`/16` in the same cluster).

3. **No per-node differentiation.** Every node gets the same VTEP
configuration. In multi-zone or multi-site deployments, different nodes
need to reach the same external CIDR via different VTEP gateways.

4. **No operational visibility.** There is no way to determine whether a VTEP
endpoint is successfully programmed in the BPF map or if configuration
errors exist.

5. **No CI/CD integration**. The feature is still in beta stage due to support and e2e test cases .

These limitations block adoption in environments where VTEP integration would
otherwise be the simplest and most natural connectivity solution.

### The Case for Overlay-to-Gateway Simplicity

If user doesnt want to manage BGP or L2 annoutment to send traffic to some network via external gateway.
The VTEP approach offers a fundamentally simpler model:

Pods send traffic via the existing VXLAN overlay directly to an external
vtep endpoint. No BGP sessions to configure and maintain. No L2 announcement
policies. No route redistribution. The Cilium agent simply encapsulates
traffic destined for external CIDRs and sends it to a known VTEP endpoint.
Comment on lines +52 to +58
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BGP and L2 announcements features are about advertising addresses on connected networks in order to configure remote clusters how to transmit towards Cilium. This feature rather seems to be about how to configure Cilium in order to route traffic from Cilium. Are they really equivalent?




## Goals

* Replace static CLI flags with a `CiliumVTEPConfig` CRD for VTEP
configuration
* Support dynamic add/update/remove of VTEP endpoints without agent restarts
* Enable per-node VTEP assignment via `nodeSelector` for multi-zone deployments
* Provide per-endpoint status reporting (synced, errors, last sync time)
* Support variable prefix lengths per VTEP CIDR (via BPF LPM Trie)
* Enable simple overlay-to-gateway connectivity without requiring BGP or L2
announcement infrastructure

## Non-Goals

* Multi-VNI support — the existing VNI=2 / world-identity model is preserved
* IPv6 VTEP endpoints — IPv4 only, consistent with current VTEP support
* Changes to the VXLAN encapsulation format or behavior


## Proposal

### Overview

![VTEP Architecture](../cilium/images/CFP-44188-vtep-connectivity.png)

Worker nodes can be independently
assigned to different VTEP endpoints using `nodeSelector`. The red and blue arrows represent traffic
from different `CiliumVTEPConfig` objects — each config targets a subset of
nodes via label selectors and directs their VXLAN-encapsulated traffic to
the appropriate external VTEP endpoint. This per-node assignment is what
enables multi-zone and multi-site deployments where each group of nodes has
its own local gateway.

The design introduces three components:

1. **CiliumVTEPConfig CRD** — a cluster-scoped custom resource that declares
VTEP endpoints with optional node targeting
2. **VTEPReconciler** — an agent-side controller that watches CRD events,
evaluates `nodeSelector`, and reconciles the BPF map
3. **BPF LPM Trie map** — replaces the existing Hash map to support
variable-length prefix matching

### CRD API

```yaml
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't review the API now since there's plenty of other open questions to consider first, but the API will need to be reviewed.

apiVersion: cilium.io/v2
kind: CiliumVTEPConfig
metadata:
name: zone-a
spec:
nodeSelector: # optional
matchLabels:
topology.kubernetes.io/zone: "zone-a"
endpoints:
- name: dc1-router
cidr: "10.1.1.0/24"
tunnelEndpoint: "10.169.72.236"
mac: "82:36:4c:98:2e:56"
- name: dc1-lb
cidr: "10.2.0.0/16"
tunnelEndpoint: "10.169.72.237"
mac: "aa:bb:cc:dd:ee:01"
```

**Key design decisions:**

| Decision | Rationale |
|---|---|
| Cluster-scoped (not namespaced) | VTEP config is infrastructure-level, managed by platform teams |
| `nodeSelector` on the config, not per-endpoint | Matches the physical topology model: a set of endpoints belongs to a site/zone |
| Max 8 endpoints per config | BPF map size constraint; multiple configs can target different node groups |
| `shortName: cvtep` | Quick operational access: `kubectl get cvtep` |
| Per-endpoint named entries with `+listType=map` | Enables strategic merge patch for individual endpoint updates |

**Validation:**

All fields use kubebuilder validation markers:
- `tunnelEndpoint`: IPv4 address regex
- `cidr`: IPv4 CIDR notation regex (e.g., `10.1.1.0/24`)
- `mac`: MAC address regex (colon-separated hex)
- `name`: DNS label format (lowercase alphanumeric, hyphens, 1-63 chars)
- `endpoints`: MinItems=1, MaxItems=8

### VTEPReconciler

The reconciler runs as a Hive cell in each Cilium agent


**Reconciliation flow:**

1. On startup, the reconciler receives all `CiliumVTEPConfig` objects via
`resource.Resource[*CiliumVTEPConfig]`.

2. For each event (upsert or delete), it re-evaluates which configs match the
local node's labels using `nodeSelector`.

3. It computes the **desired state** — a map of normalized CIDR → endpoint
from all matching configs.

4. It detects **CIDR conflicts** — if the same CIDR appears in multiple
matching configs, neither is applied and both configs receive an error
status.

5. It diffs desired state against **last-applied state** and performs
incremental BPF map updates:
- New CIDRs → `UpdateEntry()`
- Changed endpoints → `UpdateEntry()` (overwrite)
- Removed CIDRs → `DeleteByCIDR()`

6. It updates Linux routing table entries for VTEP CIDRs.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually seems like the crux of the functionality proposed here (+ the encap/decap config). I wonder whether we really want to have a dedicated VTEP CRD for this or whether it's time to consider a more generic "routing" CRD (even if initially focused just on the VTEP use case). I had a draft for such an idea about three years ago, but it never got traction so I didn't end up posting it publicly. But if this is interesting, maybe I can dust it off.


7. It writes per-endpoint status back to the CRD's `.status` subresource.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general statement, we avoid putting logic into cilium-agent to update .status because as you scale up, this causes significant load and conflicts on kube-apiserver due to competing agents attempting to make similar updates, often at the same time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this is a case for understanding the tradeoff with @cilium/sig-scalability , especially size of targeted environments, and then considering how we might gain the desired operational visibility without introducing scalability concerns.


**Node label change handling:** The reconciler watches the local node's labels. When labels change, it re-evaluates all
`nodeSelector` predicates and reconciles the BPF map accordingly.

**Benefits:**
- Each endpoint can use a different prefix length (`/16`, `/24`, `/25`, etc.)
- The BPF LPM trie performs longest-prefix-match automatically in the datapath
- Removes the `--vtep-mask` global setting entirely

### Status Reporting

Each `CiliumVTEPConfig` object reports status via the `.status` subresource:

```yaml
status:
endpointCount: 2
conditions:
- type: Ready
status: "True"
lastTransitionTime: "2026-04-07T10:00:00Z"
reason: AllEndpointsSynced
message: "All 2 endpoints synced to BPF map"
endpointStatuses:
- name: dc1-router
synced: true
lastSyncTime: "2026-04-07T10:00:00Z"
- name: dc1-lb
synced: true
lastSyncTime: "2026-04-07T10:00:00Z"
```

Operators can monitor VTEP health at a glance:

```shell
$ kubectl get cvtep
NAME ENDPOINTS READY AGE
zone-a 2 True 1h
zone-b 2 True 1h
```

### Helm Integration

VTEP is enabled via Helm with a single toggle:

```shell
helm upgrade cilium cilium/cilium \
--namespace kube-system \
--reuse-values \
--set vtep.enabled=true
```

This registers the `CiliumVTEPConfig` CRD. VTEP endpoints are then configured
by applying CRD objects.

### Per-Zone VTEP Endpoints: Worked Example

Two zones with different VTEP gateways for the same destination CIDR:

```yaml
# Zone-A nodes route 10.200.0.0/16 via VTEP gateway 10.100.1.1
apiVersion: cilium.io/v2
kind: CiliumVTEPConfig
metadata:
name: zone-a
spec:
nodeSelector:
matchLabels:
topology.kubernetes.io/zone: "zone-a"
endpoints:
- name: gw-a
cidr: "10.200.0.0/16"
tunnelEndpoint: "10.100.1.1"
mac: "aa:bb:cc:00:01:01"
---
# Zone-B nodes route 10.200.0.0/16 via VTEP gateway 10.100.2.1
apiVersion: cilium.io/v2
kind: CiliumVTEPConfig
metadata:
name: zone-b
spec:
nodeSelector:
matchLabels:
topology.kubernetes.io/zone: "zone-b"
endpoints:
- name: gw-b
cidr: "10.200.0.0/16"
tunnelEndpoint: "10.100.2.1"
mac: "aa:bb:cc:00:02:01"
```

Both zones route traffic to `10.200.0.0/16` but via their local VTEP gateway.
When a new zone comes online, the
operator applies a new `CiliumVTEPConfig`

## Impacts / Key Questions

### Impact: Existing VTEP Users

The CLI flags (`--vtep-endpoint`, `--vtep-cidr`, `--vtep-mac`, `--vtep-mask`,
`--vtep-sync-interval`) have been removed. Users must migrate to the CRD.
The migration is straightforward: each set of CLI flag values maps to one
`CiliumVTEPConfig` object with a single endpoint.

### Impact: BPF Map Type Change

Changing from Hash to LPM Trie alters the map's behavior:

| Property | Hash | LPM Trie |
|---|---|---|
| Lookup semantics | Exact match | Longest prefix match |
| Key size | 4 bytes | 8 bytes (4 prefix + 4 IP) |
| Preallocation | Supported | Not supported (`BPF_F_NO_PREALLOC` required) |
| Max entries | 8 | 8 |

The LPM Trie is strictly more capable. Existing configurations with uniform
prefix lengths produce identical routing behavior.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key question: How will this integrate with the broader architecture? Ideally whatever we come up with, it has a path to eventually integrate with all other GA features where applicable.

Specifically, consider how does this interface with masquerading, egress gateway, encryption?

## Future Milestones

### Graduating VTEP to GA

With CRD-based management, status reporting, and CI conformance tests, the
VTEP feature has a clearer path to GA status. The CRD API provides the
validation, observability, and operational tooling expected of a GA feature.
Binary file added cilium/images/CFP-44188-vtep-connectivity.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.