Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
321 changes: 321 additions & 0 deletions cilium/CFP-34702-per-endpoint-vlan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,321 @@
# CFP-34702: Per-Endpoint VLAN support in datapath

**SIG: SIG-DATAPATH**

**Begin Design Discussion:** 2026-05-11

**Cilium Release:** 1.21

**Authors:** l1b0k <libokang.dev@gmail.com>

**Status:** Draft

## Summary

Associate each Cilium-managed endpoint with an 802.1Q VLAN ID. On ingress
the datapath strips the VLAN tag so that BPF can route based on the
destination IP. On egress the datapath pushes the source endpoint's VLAN
tag back onto the frame before it leaves through the parent device.

The feature is opt-in via `--enable-endpoint-vlan` and is mutually
exclusive with `--vlan-bpf-bypass` (`VLAN_FILTER`).

Tracking issue: [cilium/cilium#34702](https://github.com/cilium/cilium/issues/34702).
Implementation: [cilium/cilium#44619](https://github.com/cilium/cilium/pull/44619).

## Motivation

In several deployment models a single host NIC carries multiple VLANs and
each pod logically belongs to exactly one of them:

- Cloud environments using trunk ENIs: a single trunk ENI on the host
carries traffic for many pods, each tagged with its own VLAN ID.
- On-prem networks that segment pods across VLAN sub-interfaces of a
shared uplink.

Today Cilium offers two workarounds, both with significant downsides:

1. Pre-create a Linux VLAN sub-interface per VLAN and attach Cilium to
each. This multiplies datapath state per node and complicates IPAM
and BPF map management.
2. Use `--vlan-bpf-bypass` to let VLAN-tagged traffic skip BPF. This
gives up policy enforcement and observability on exactly the traffic
that needs it.

There is no way today to keep one tc/tcx attachment on the parent NIC
while still letting BPF know which VLAN each endpoint belongs to. This
CFP closes that gap so that policy, identity, and observability work
uniformly across VLANs sharing a single parent NIC.

## Goals

- Persist a `vlan_id` attribute on each endpoint, exposed to BPF via the
existing `lxcmap` `endpoint_info`.
- On ingress, strip the VLAN tag before further BPF processing so that
routing decisions are based on the destination IP.
- On egress, push the VLAN tag based on the source endpoint's `vlan_id`.
- No change to existing data flow paths. Intra-node pod-to-pod traffic
continues to use veth pairs and is not subject to VLAN tagging. VLAN
operations only apply to traffic traversing the trunk NIC.
- Expose an API surface (`vlan-id` on `EndpointChangeRequest`) so that
upstream CNI plugins can populate the VLAN ID. Cilium does not own
VLAN allocation or chaining.
- Keep the feature opt-in and off by default.

## Non-Goals

- VLAN discovery, VLAN allocation policy, and IP-to-VLAN mapping. These
remain the responsibility of the upstream CNI plugin.
- Extending Cilium-native IPAM to model VLANs as first-class objects.
- Coexistence with `--vlan-bpf-bypass`. The two are mutually exclusive
by design (see Impacts).
- QinQ / double-tagged VLANs.
- Interactions with VXLAN / Geneve encapsulation. Overlay modes are out
of scope for this CFP.

## Proposal

### Overview

Add a per-endpoint `vlan_id` (uint16) that flows from the CNI plugin
into the daemon, into `lxcmap`, and finally into BPF. Extend the tc/tcx
hooks on the parent NIC to strip VLAN tags on ingress and push them on
egress:

**Ingress (`cil_from_netdev`):** when a VLAN-tagged frame arrives, check
`ctx->vlan_present` and extract the VLAN ID. If `allow_vlan(ifindex,
vlan_id)` returns true (the predicate generated by the existing
`VLAN_FILTER` macro from `--vlan-bpf-bypass`), pass it to the kernel
for sub-interface handling. Otherwise, if `CONFIG(enable_endpoint_vlan)`
is set, call `skb_vlan_pop` to strip the tag so that BPF routes the
packet by destination IP as usual. If neither condition matches, drop
with `DROP_VLAN_FILTERED` (the same drop reason the existing bypass-list
path uses; reused here so operators don't need a new metric to track
"VLAN tagged but no endpoint configured").

**Egress (`cil_to_netdev`):** after all BPF processing is done, if
`CONFIG(enable_endpoint_vlan)` is set and `ctx->vlan_present` is false,
call `ep_vlan_push_egress`. This helper resolves the source endpoint
via IP lookup (`__lookup_ip4_endpoint` / `__lookup_ip6_endpoint`, so
both v4 and v6 are handled); if `ep->vlan_id` is non-zero, it pushes
the corresponding 802.1Q tag via `skb_vlan_push`. Source-IP lookup is
chosen over ifindex/skb metadata because at `cil_to_netdev` the frame
has already been routed onto the trunk NIC and no per-veth context
remains; the source IP at this point is the post-policy, post-NAT
address that uniquely identifies a local endpoint.

```
Ingress Egress
───────────────── ─────────────────
parent NIC (trunk) parent NIC (trunk)
│ ▲
▼ │
┌─────────────────────┐ ┌─────────────────────┐
│ cil_from_netdev │ │ cil_to_netdev │
│ │ │ │
│ vlan_present? │ │ ep = lookup(src_ip) │
│ ├─ allow_vlan() │ │ if ep->vlan_id: │
│ │ → pass to kern │ │ skb_vlan_push() │
│ ├─ endpoint_vlan? │ └─────────────────────┘
│ │ → skb_vlan_pop │ ▲
│ └─ else │ │
│ → DROP │ veth ← pod
└──────────┬──────────┘
┌─────────────────────┐
│ route by dst IP │
└──────────┬──────────┘
veth → pod
```

### Configuration

- New agent flag `--enable-endpoint-vlan`, default `false`.
- New option constant `option.EnableEndpointVLAN` and matching field on
`DaemonConfig`.
- The toggle is exposed to BPF via the `DECLARE_CONFIG` /
`CONFIG(enable_endpoint_vlan)` mechanism (no new `#define` in
`node_config.h`); the declaration lives in `bpf/lib/endpoint_vlan.h`
and the Go side is wired through `pkg/datapath/config/host_config.go`.
- The daemon rejects startup when both `--enable-endpoint-vlan` and
`--vlan-bpf-bypass` are set.

### Data model

`bpf/lib/eps.h` (where `struct endpoint_info` is defined):

```c
struct endpoint_info {
...
__u16 vlan_id; /* 802.1Q VLAN ID for trunk ENI, 0 = no VLAN */
...
};
```

`pkg/maps/lxcmap`:

- `EndpointInfo` gains a `VlanID uint16` field (BTF-aligned to
`vlan_id`).
- `EndpointFrontend` interface gains `GetVlanID() uint16`.
- `String()` includes `vlan_id=` (only when non-zero) so
`cilium bpf endpoint list` shows it.

`pkg/endpoint`:

- `Endpoint` struct gains a `vlanID` field, persisted across restart
via `restore.go`, `cache.go`, and `api.go`.
- API setter validates the range `0..4094` (with `0` meaning "no
VLAN"). Out-of-range values are rejected at endpoint create / update
time.

### Datapath

`bpf/lib/endpoint_vlan.h` (new) hosts the `DECLARE_CONFIG` for
`enable_endpoint_vlan` and a single helper:

- `ep_vlan_push_egress(ctx, proto)` resolves the source endpoint via
`__lookup_ip4_endpoint` / `__lookup_ip6_endpoint` and, when
`ep->vlan_id != 0`, calls `skb_vlan_push(ctx, ETH_P_8021Q,
ep->vlan_id)`. Returns `CTX_ACT_OK` on success or no-op, or a negative
drop reason on failure.

Ingress stripping does not get its own wrapper: `cil_from_netdev` calls
`skb_vlan_pop(ctx)` directly inside the decision tree below.

`bpf_host.c`:

- `cil_from_netdev`: when `CONFIG(enable_endpoint_vlan)` is set and
`ctx->vlan_present` is true (and `allow_vlan` did not match), call
`skb_vlan_pop(ctx)` so BPF then routes the packet based on the
destination IP as usual.
- `cil_to_netdev`: after BPF processing, when
`CONFIG(enable_endpoint_vlan) && !ctx->vlan_present`, call
`ep_vlan_push_egress(ctx, proto)`.



### CNI integration

This CFP adds a `vlan-id` field to `EndpointChangeRequest` in
`api/v1/openapi.yaml`:

```yaml
# api/v1/openapi.yaml – EndpointChangeRequest properties
vlan-id:
description: >-
802.1Q VLAN ID for endpoint traffic isolation. When set, VLAN tags
are applied at the network boundary for this endpoint.
0 means no VLAN tagging.
type: integer
```

Cilium itself does not manage, allocate, or discover VLANs. Upstream
CNI plugins that wish to use this feature must populate the `vlan-id`
field when creating or updating endpoints through the Cilium API.

The field is optional and defaults to `0` (no VLAN), so older clients
and existing CNI integrations continue to work unchanged. Older agents
that predate this CFP ignore the field on the wire.

### Testing

- BPF unit test `bpf/tests/tc_endpoint_vlan.c` covers ingress VLAN
pop, egress VLAN push (v4 and v6), and the no-VLAN passthrough case.
- Go unit tests for the new `lxcmap` field and for option validation
(mutual-exclusion with `--vlan-bpf-bypass`).

## Impacts / Key Questions

### Impact: mutual exclusion with `--vlan-bpf-bypass`

`--vlan-bpf-bypass` (`VLAN_FILTER`) and `--enable-endpoint-vlan` have
opposite semantics and cannot coexist:

| | `--vlan-bpf-bypass` | `--enable-endpoint-vlan` |
|---|---|---|
| **Purpose** | Let listed VLANs bypass BPF entirely | Let BPF strip/push VLAN tags per endpoint |
| **BPF involvement** | None — tagged frames go straight to kernel sub-interfaces | Full — BPF routes by dst IP after stripping the tag |
| **Policy / observability** | Skipped for bypassed VLANs | Fully enforced |

Enabling both is a configuration error; the daemon will refuse to start.

**Ingress decision tree in `cil_from_netdev`:**

```
ctx->vlan_present?
├─ YES
│ ├─ allow_vlan(ifindex, vlan_id)? ← VLAN_FILTER macro
│ │ └─ YES → return CTX_ACT_OK (pass to kernel sub-interface)
│ ├─ CONFIG(enable_endpoint_vlan)?
│ │ └─ YES → skb_vlan_pop() → continue BPF routing by dst IP
│ └─ else → DROP_VLAN_FILTERED
└─ NO → normal BPF processing
```

The `VLAN_FILTER` macro is generated at compile time from the
`--vlan-bpf-bypass` VLAN ID list and expands inline to the
`allow_vlan(ifindex, vlan_id)` predicate referenced above. When
`--vlan-bpf-bypass` is not set, the macro expands to `return false`,
so `allow_vlan()` never matches and the branch is dead code. This means
when only `--enable-endpoint-vlan` is active, every VLAN-tagged frame
hits the `skb_vlan_pop` path — no traffic silently bypasses BPF.

**Egress path in `cil_to_netdev`:** after all BPF processing completes,
if `CONFIG(enable_endpoint_vlan)` is set and `ctx->vlan_present` is
false, the helper `ep_vlan_push_egress` resolves the source endpoint
by IP (v4 or v6) and pushes its `ep->vlan_id` via `skb_vlan_push`.

### Impact: relies on VLAN skb helpers

This design uses the existing kernel helpers for VLAN pop and push at
the host-device hook. The implementation should verify that the selected
hook point observes the VLAN information needed for ingress pop and can
emit the configured endpoint VLAN on egress. Environment-specific
behavior should be validated in the implementation PR on the target
deployments.

### Impact: performance

When `--enable-endpoint-vlan` is disabled, the existing datapath is
unchanged. When enabled, VLAN handling is limited to traffic crossing
the parent device: ingress strips the tag once before normal BPF
processing, and egress pushes the configured endpoint VLAN once after
normal BPF processing. Intra-node pod traffic continues to use the veth
path and does not take the VLAN path.

The expected runtime cost is therefore very small and bounded to the
enabled trunk-device path.

### Impact: VLAN spoofing

VLAN tags in this design serve as a datapath-level routing label, not a
security boundary. Network isolation between pods is enforced by
NetworkPolicy, not by VLAN membership.

**Ingress:** traffic arrives from the physical network through the trunk
NIC. The VLAN tag is set by the upstream switch or hypervisor and is not
controlled by the pod, so spoofing is not a concern on this path. The
ingress decision tree therefore strips the tag unconditionally (when
endpoint-VLAN is enabled and the frame did not match the bypass list)
without verifying that the tag matches the destination endpoint's
configured VLAN. Endpoint-to-VLAN binding correctness is enforced at
the upstream switch/hypervisor and on egress (below); duplicating it on
ingress would cost a per-packet endpoint lookup with no security
benefit on this trust boundary.

**Egress:** `cil_to_netdev` performs a source endpoint lookup and
unconditionally pushes `ep->vlan_id` onto the frame, overwriting any
tag the pod may have set. This effectively enforces VLAN correctness on
egress with negligible overhead — the lookup is a single `lxcmap` hash
hit (already cache-hot) and `skb_vlan_push` operates on skb metadata
without touching packet data. Pods on the same node communicate through
their veth pairs without traversing the trunk NIC, so VLAN tags are
never involved in intra-node traffic.

## Future Milestones

### Hubble visibility

Surface `vlan_id` as a field on `Flow` (in addition to the drop reason)
so users can filter and group flows by VLAN.