diff --git a/cilium/CFP-34702-per-endpoint-vlan.md b/cilium/CFP-34702-per-endpoint-vlan.md new file mode 100644 index 0000000..94b1aec --- /dev/null +++ b/cilium/CFP-34702-per-endpoint-vlan.md @@ -0,0 +1,321 @@ +# CFP-34702: Per-Endpoint VLAN support in datapath + +**SIG: SIG-DATAPATH** + +**Begin Design Discussion:** 2026-05-11 + +**Cilium Release:** 1.21 + +**Authors:** l1b0k + +**Status:** Draft + +## Summary + +Associate each Cilium-managed endpoint with an 802.1Q VLAN ID. On ingress +the datapath strips the VLAN tag so that BPF can route based on the +destination IP. On egress the datapath pushes the source endpoint's VLAN +tag back onto the frame before it leaves through the parent device. + +The feature is opt-in via `--enable-endpoint-vlan` and is mutually +exclusive with `--vlan-bpf-bypass` (`VLAN_FILTER`). + +Tracking issue: [cilium/cilium#34702](https://github.com/cilium/cilium/issues/34702). +Implementation: [cilium/cilium#44619](https://github.com/cilium/cilium/pull/44619). + +## Motivation + +In several deployment models a single host NIC carries multiple VLANs and +each pod logically belongs to exactly one of them: + +- Cloud environments using trunk ENIs: a single trunk ENI on the host + carries traffic for many pods, each tagged with its own VLAN ID. +- On-prem networks that segment pods across VLAN sub-interfaces of a + shared uplink. + +Today Cilium offers two workarounds, both with significant downsides: + +1. Pre-create a Linux VLAN sub-interface per VLAN and attach Cilium to + each. This multiplies datapath state per node and complicates IPAM + and BPF map management. +2. Use `--vlan-bpf-bypass` to let VLAN-tagged traffic skip BPF. This + gives up policy enforcement and observability on exactly the traffic + that needs it. + +There is no way today to keep one tc/tcx attachment on the parent NIC +while still letting BPF know which VLAN each endpoint belongs to. This +CFP closes that gap so that policy, identity, and observability work +uniformly across VLANs sharing a single parent NIC. + +## Goals + +- Persist a `vlan_id` attribute on each endpoint, exposed to BPF via the + existing `lxcmap` `endpoint_info`. +- On ingress, strip the VLAN tag before further BPF processing so that + routing decisions are based on the destination IP. +- On egress, push the VLAN tag based on the source endpoint's `vlan_id`. +- No change to existing data flow paths. Intra-node pod-to-pod traffic + continues to use veth pairs and is not subject to VLAN tagging. VLAN + operations only apply to traffic traversing the trunk NIC. +- Expose an API surface (`vlan-id` on `EndpointChangeRequest`) so that + upstream CNI plugins can populate the VLAN ID. Cilium does not own + VLAN allocation or chaining. +- Keep the feature opt-in and off by default. + +## Non-Goals + +- VLAN discovery, VLAN allocation policy, and IP-to-VLAN mapping. These + remain the responsibility of the upstream CNI plugin. +- Extending Cilium-native IPAM to model VLANs as first-class objects. +- Coexistence with `--vlan-bpf-bypass`. The two are mutually exclusive + by design (see Impacts). +- QinQ / double-tagged VLANs. +- Interactions with VXLAN / Geneve encapsulation. Overlay modes are out + of scope for this CFP. + +## Proposal + +### Overview + +Add a per-endpoint `vlan_id` (uint16) that flows from the CNI plugin +into the daemon, into `lxcmap`, and finally into BPF. Extend the tc/tcx +hooks on the parent NIC to strip VLAN tags on ingress and push them on +egress: + +**Ingress (`cil_from_netdev`):** when a VLAN-tagged frame arrives, check +`ctx->vlan_present` and extract the VLAN ID. If `allow_vlan(ifindex, +vlan_id)` returns true (the predicate generated by the existing +`VLAN_FILTER` macro from `--vlan-bpf-bypass`), pass it to the kernel +for sub-interface handling. Otherwise, if `CONFIG(enable_endpoint_vlan)` +is set, call `skb_vlan_pop` to strip the tag so that BPF routes the +packet by destination IP as usual. If neither condition matches, drop +with `DROP_VLAN_FILTERED` (the same drop reason the existing bypass-list +path uses; reused here so operators don't need a new metric to track +"VLAN tagged but no endpoint configured"). + +**Egress (`cil_to_netdev`):** after all BPF processing is done, if +`CONFIG(enable_endpoint_vlan)` is set and `ctx->vlan_present` is false, +call `ep_vlan_push_egress`. This helper resolves the source endpoint +via IP lookup (`__lookup_ip4_endpoint` / `__lookup_ip6_endpoint`, so +both v4 and v6 are handled); if `ep->vlan_id` is non-zero, it pushes +the corresponding 802.1Q tag via `skb_vlan_push`. Source-IP lookup is +chosen over ifindex/skb metadata because at `cil_to_netdev` the frame +has already been routed onto the trunk NIC and no per-veth context +remains; the source IP at this point is the post-policy, post-NAT +address that uniquely identifies a local endpoint. + +``` + Ingress Egress + ───────────────── ───────────────── + parent NIC (trunk) parent NIC (trunk) + │ ▲ + ▼ │ + ┌─────────────────────┐ ┌─────────────────────┐ + │ cil_from_netdev │ │ cil_to_netdev │ + │ │ │ │ + │ vlan_present? │ │ ep = lookup(src_ip) │ + │ ├─ allow_vlan() │ │ if ep->vlan_id: │ + │ │ → pass to kern │ │ skb_vlan_push() │ + │ ├─ endpoint_vlan? │ └─────────────────────┘ + │ │ → skb_vlan_pop │ ▲ + │ └─ else │ │ + │ → DROP │ veth ← pod + └──────────┬──────────┘ + ▼ + ┌─────────────────────┐ + │ route by dst IP │ + └──────────┬──────────┘ + ▼ + veth → pod +``` + +### Configuration + +- New agent flag `--enable-endpoint-vlan`, default `false`. +- New option constant `option.EnableEndpointVLAN` and matching field on + `DaemonConfig`. +- The toggle is exposed to BPF via the `DECLARE_CONFIG` / + `CONFIG(enable_endpoint_vlan)` mechanism (no new `#define` in + `node_config.h`); the declaration lives in `bpf/lib/endpoint_vlan.h` + and the Go side is wired through `pkg/datapath/config/host_config.go`. +- The daemon rejects startup when both `--enable-endpoint-vlan` and + `--vlan-bpf-bypass` are set. + +### Data model + +`bpf/lib/eps.h` (where `struct endpoint_info` is defined): + +```c +struct endpoint_info { + ... + __u16 vlan_id; /* 802.1Q VLAN ID for trunk ENI, 0 = no VLAN */ + ... +}; +``` + +`pkg/maps/lxcmap`: + +- `EndpointInfo` gains a `VlanID uint16` field (BTF-aligned to + `vlan_id`). +- `EndpointFrontend` interface gains `GetVlanID() uint16`. +- `String()` includes `vlan_id=` (only when non-zero) so + `cilium bpf endpoint list` shows it. + +`pkg/endpoint`: + +- `Endpoint` struct gains a `vlanID` field, persisted across restart + via `restore.go`, `cache.go`, and `api.go`. +- API setter validates the range `0..4094` (with `0` meaning "no + VLAN"). Out-of-range values are rejected at endpoint create / update + time. + +### Datapath + +`bpf/lib/endpoint_vlan.h` (new) hosts the `DECLARE_CONFIG` for +`enable_endpoint_vlan` and a single helper: + +- `ep_vlan_push_egress(ctx, proto)` resolves the source endpoint via + `__lookup_ip4_endpoint` / `__lookup_ip6_endpoint` and, when + `ep->vlan_id != 0`, calls `skb_vlan_push(ctx, ETH_P_8021Q, + ep->vlan_id)`. Returns `CTX_ACT_OK` on success or no-op, or a negative + drop reason on failure. + +Ingress stripping does not get its own wrapper: `cil_from_netdev` calls +`skb_vlan_pop(ctx)` directly inside the decision tree below. + +`bpf_host.c`: + +- `cil_from_netdev`: when `CONFIG(enable_endpoint_vlan)` is set and + `ctx->vlan_present` is true (and `allow_vlan` did not match), call + `skb_vlan_pop(ctx)` so BPF then routes the packet based on the + destination IP as usual. +- `cil_to_netdev`: after BPF processing, when + `CONFIG(enable_endpoint_vlan) && !ctx->vlan_present`, call + `ep_vlan_push_egress(ctx, proto)`. + + + +### CNI integration + +This CFP adds a `vlan-id` field to `EndpointChangeRequest` in +`api/v1/openapi.yaml`: + +```yaml +# api/v1/openapi.yaml – EndpointChangeRequest properties +vlan-id: + description: >- + 802.1Q VLAN ID for endpoint traffic isolation. When set, VLAN tags + are applied at the network boundary for this endpoint. + 0 means no VLAN tagging. + type: integer +``` + +Cilium itself does not manage, allocate, or discover VLANs. Upstream +CNI plugins that wish to use this feature must populate the `vlan-id` +field when creating or updating endpoints through the Cilium API. + +The field is optional and defaults to `0` (no VLAN), so older clients +and existing CNI integrations continue to work unchanged. Older agents +that predate this CFP ignore the field on the wire. + +### Testing + +- BPF unit test `bpf/tests/tc_endpoint_vlan.c` covers ingress VLAN + pop, egress VLAN push (v4 and v6), and the no-VLAN passthrough case. +- Go unit tests for the new `lxcmap` field and for option validation + (mutual-exclusion with `--vlan-bpf-bypass`). + +## Impacts / Key Questions + +### Impact: mutual exclusion with `--vlan-bpf-bypass` + +`--vlan-bpf-bypass` (`VLAN_FILTER`) and `--enable-endpoint-vlan` have +opposite semantics and cannot coexist: + +| | `--vlan-bpf-bypass` | `--enable-endpoint-vlan` | +|---|---|---| +| **Purpose** | Let listed VLANs bypass BPF entirely | Let BPF strip/push VLAN tags per endpoint | +| **BPF involvement** | None — tagged frames go straight to kernel sub-interfaces | Full — BPF routes by dst IP after stripping the tag | +| **Policy / observability** | Skipped for bypassed VLANs | Fully enforced | + +Enabling both is a configuration error; the daemon will refuse to start. + +**Ingress decision tree in `cil_from_netdev`:** + +``` +ctx->vlan_present? + ├─ YES + │ ├─ allow_vlan(ifindex, vlan_id)? ← VLAN_FILTER macro + │ │ └─ YES → return CTX_ACT_OK (pass to kernel sub-interface) + │ ├─ CONFIG(enable_endpoint_vlan)? + │ │ └─ YES → skb_vlan_pop() → continue BPF routing by dst IP + │ └─ else → DROP_VLAN_FILTERED + └─ NO → normal BPF processing +``` + +The `VLAN_FILTER` macro is generated at compile time from the +`--vlan-bpf-bypass` VLAN ID list and expands inline to the +`allow_vlan(ifindex, vlan_id)` predicate referenced above. When +`--vlan-bpf-bypass` is not set, the macro expands to `return false`, +so `allow_vlan()` never matches and the branch is dead code. This means +when only `--enable-endpoint-vlan` is active, every VLAN-tagged frame +hits the `skb_vlan_pop` path — no traffic silently bypasses BPF. + +**Egress path in `cil_to_netdev`:** after all BPF processing completes, +if `CONFIG(enable_endpoint_vlan)` is set and `ctx->vlan_present` is +false, the helper `ep_vlan_push_egress` resolves the source endpoint +by IP (v4 or v6) and pushes its `ep->vlan_id` via `skb_vlan_push`. + +### Impact: relies on VLAN skb helpers + +This design uses the existing kernel helpers for VLAN pop and push at +the host-device hook. The implementation should verify that the selected +hook point observes the VLAN information needed for ingress pop and can +emit the configured endpoint VLAN on egress. Environment-specific +behavior should be validated in the implementation PR on the target +deployments. + +### Impact: performance + +When `--enable-endpoint-vlan` is disabled, the existing datapath is +unchanged. When enabled, VLAN handling is limited to traffic crossing +the parent device: ingress strips the tag once before normal BPF +processing, and egress pushes the configured endpoint VLAN once after +normal BPF processing. Intra-node pod traffic continues to use the veth +path and does not take the VLAN path. + +The expected runtime cost is therefore very small and bounded to the +enabled trunk-device path. + +### Impact: VLAN spoofing + +VLAN tags in this design serve as a datapath-level routing label, not a +security boundary. Network isolation between pods is enforced by +NetworkPolicy, not by VLAN membership. + +**Ingress:** traffic arrives from the physical network through the trunk +NIC. The VLAN tag is set by the upstream switch or hypervisor and is not +controlled by the pod, so spoofing is not a concern on this path. The +ingress decision tree therefore strips the tag unconditionally (when +endpoint-VLAN is enabled and the frame did not match the bypass list) +without verifying that the tag matches the destination endpoint's +configured VLAN. Endpoint-to-VLAN binding correctness is enforced at +the upstream switch/hypervisor and on egress (below); duplicating it on +ingress would cost a per-packet endpoint lookup with no security +benefit on this trust boundary. + +**Egress:** `cil_to_netdev` performs a source endpoint lookup and +unconditionally pushes `ep->vlan_id` onto the frame, overwriting any +tag the pod may have set. This effectively enforces VLAN correctness on +egress with negligible overhead — the lookup is a single `lxcmap` hash +hit (already cache-hot) and `skb_vlan_push` operates on skb metadata +without touching packet data. Pods on the same node communicate through +their veth pairs without traversing the trunk NIC, so VLAN tags are +never involved in intra-node traffic. + +## Future Milestones + +### Hubble visibility + +Surface `vlan_id` as a field on `Flow` (in addition to the drop reason) +so users can filter and group flows by VLAN.