cilium · l1b0k · May 11, 2026
diff --git a/cilium/CFP-34702-per-endpoint-vlan.md b/cilium/CFP-34702-per-endpoint-vlan.md
@@ -0,0 +1,321 @@
+# CFP-34702: Per-Endpoint VLAN support in datapath
+
+**SIG: SIG-DATAPATH**
+
+**Begin Design Discussion:** 2026-05-11
+
+**Cilium Release:** 1.21
+
+**Authors:** l1b0k <libokang.dev@gmail.com>
+
+**Status:** Draft
+
+## Summary
+
+Associate each Cilium-managed endpoint with an 802.1Q VLAN ID. On ingress
+the datapath strips the VLAN tag so that BPF can route based on the
+destination IP. On egress the datapath pushes the source endpoint's VLAN
+tag back onto the frame before it leaves through the parent device.
+
+The feature is opt-in via `--enable-endpoint-vlan` and is mutually
+exclusive with `--vlan-bpf-bypass` (`VLAN_FILTER`).
+
+Tracking issue: [cilium/cilium#34702](https://github.com/cilium/cilium/issues/34702).
+Implementation: [cilium/cilium#44619](https://github.com/cilium/cilium/pull/44619).
+
+## Motivation
+
+In several deployment models a single host NIC carries multiple VLANs and
+each pod logically belongs to exactly one of them:
+
+- Cloud environments using trunk ENIs: a single trunk ENI on the host
+  carries traffic for many pods, each tagged with its own VLAN ID.
+- On-prem networks that segment pods across VLAN sub-interfaces of a
+  shared uplink.
+
+Today Cilium offers two workarounds, both with significant downsides:
+
+1. Pre-create a Linux VLAN sub-interface per VLAN and attach Cilium to
+   each. This multiplies datapath state per node and complicates IPAM
+   and BPF map management.
+2. Use `--vlan-bpf-bypass` to let VLAN-tagged traffic skip BPF. This
+   gives up policy enforcement and observability on exactly the traffic
+   that needs it.
+
+There is no way today to keep one tc/tcx attachment on the parent NIC
+while still letting BPF know which VLAN each endpoint belongs to. This
+CFP closes that gap so that policy, identity, and observability work
+uniformly across VLANs sharing a single parent NIC.
+
+## Goals
+
+- Persist a `vlan_id` attribute on each endpoint, exposed to BPF via the
+  existing `lxcmap` `endpoint_info`.
+- On ingress, strip the VLAN tag before further BPF processing so that
+  routing decisions are based on the destination IP.
+- On egress, push the VLAN tag based on the source endpoint's `vlan_id`.
+- No change to existing data flow paths. Intra-node pod-to-pod traffic
+  continues to use veth pairs and is not subject to VLAN tagging. VLAN
+  operations only apply to traffic traversing the trunk NIC.
+- Expose an API surface (`vlan-id` on `EndpointChangeRequest`) so that
+  upstream CNI plugins can populate the VLAN ID. Cilium does not own
+  VLAN allocation or chaining.
+- Keep the feature opt-in and off by default.
+
+## Non-Goals
+
+- VLAN discovery, VLAN allocation policy, and IP-to-VLAN mapping. These
+  remain the responsibility of the upstream CNI plugin.
+- Extending Cilium-native IPAM to model VLANs as first-class objects.
+- Coexistence with `--vlan-bpf-bypass`. The two are mutually exclusive
+  by design (see Impacts).
+- QinQ / double-tagged VLANs.
+- Interactions with VXLAN / Geneve encapsulation. Overlay modes are out
+  of scope for this CFP.
+
+## Proposal
+
+### Overview
+
+Add a per-endpoint `vlan_id` (uint16) that flows from the CNI plugin
+into the daemon, into `lxcmap`, and finally into BPF. Extend the tc/tcx
+hooks on the parent NIC to strip VLAN tags on ingress and push them on
+egress:
+
+**Ingress (`cil_from_netdev`):** when a VLAN-tagged frame arrives, check
+`ctx->vlan_present` and extract the VLAN ID. If `allow_vlan(ifindex,
+vlan_id)` returns true (the predicate generated by the existing
+`VLAN_FILTER` macro from `--vlan-bpf-bypass`), pass it to the kernel
+for sub-interface handling. Otherwise, if `CONFIG(enable_endpoint_vlan)`
+is set, call `skb_vlan_pop` to strip the tag so that BPF routes the
+packet by destination IP as usual. If neither condition matches, drop
+with `DROP_VLAN_FILTERED` (the same drop reason the existing bypass-list
+path uses; reused here so operators don't need a new metric to track
+"VLAN tagged but no endpoint configured").
+
+**Egress (`cil_to_netdev`):** after all BPF processing is done, if
+`CONFIG(enable_endpoint_vlan)` is set and `ctx->vlan_present` is false,
+call `ep_vlan_push_egress`. This helper resolves the source endpoint
+via IP lookup (`__lookup_ip4_endpoint` / `__lookup_ip6_endpoint`, so
+both v4 and v6 are handled); if `ep->vlan_id` is non-zero, it pushes
+the corresponding 802.1Q tag via `skb_vlan_push`. Source-IP lookup is
+chosen over ifindex/skb metadata because at `cil_to_netdev` the frame
+has already been routed onto the trunk NIC and no per-veth context
+remains; the source IP at this point is the post-policy, post-NAT
+address that uniquely identifies a local endpoint.
+
+```
+              Ingress                          Egress
+       ─────────────────              ─────────────────
+       parent NIC (trunk)              parent NIC (trunk)
+              │                               ▲
+              ▼                               │
+   ┌─────────────────────┐        ┌─────────────────────┐
+   │ cil_from_netdev     │        │ cil_to_netdev       │
+   │                     │        │                     │
+   │ vlan_present?       │        │ ep = lookup(src_ip) │
+   │  ├─ allow_vlan()    │        │ if ep->vlan_id:     │
+   │  │   → pass to kern │        │   skb_vlan_push()   │
+   │  ├─ endpoint_vlan?  │        └─────────────────────┘
+   │  │   → skb_vlan_pop │                  ▲
+   │  └─ else            │                  │
+   │      → DROP         │            veth ← pod
+   └──────────┬──────────┘
+              ▼
+   ┌─────────────────────┐
+   │ route by dst IP     │
+   └──────────┬──────────┘
+              ▼
+        veth → pod
+```
+
+### Configuration
+
+- New agent flag `--enable-endpoint-vlan`, default `false`.
+- New option constant `option.EnableEndpointVLAN` and matching field on
+  `DaemonConfig`.
+- The toggle is exposed to BPF via the `DECLARE_CONFIG` /
+  `CONFIG(enable_endpoint_vlan)` mechanism (no new `#define` in
+  `node_config.h`); the declaration lives in `bpf/lib/endpoint_vlan.h`
+  and the Go side is wired through `pkg/datapath/config/host_config.go`.
+- The daemon rejects startup when both `--enable-endpoint-vlan` and
+  `--vlan-bpf-bypass` are set.
+
+### Data model
+
+`bpf/lib/eps.h` (where `struct endpoint_info` is defined):
+
+```c
+struct endpoint_info {
+    ...
+    __u16 vlan_id;    /* 802.1Q VLAN ID for trunk ENI, 0 = no VLAN */
+    ...
+};
+```
+
+`pkg/maps/lxcmap`:
+
+- `EndpointInfo` gains a `VlanID uint16` field (BTF-aligned to
+  `vlan_id`).
+- `EndpointFrontend` interface gains `GetVlanID() uint16`.
+- `String()` includes `vlan_id=` (only when non-zero) so
+  `cilium bpf endpoint list` shows it.
+
+`pkg/endpoint`:
+
+- `Endpoint` struct gains a `vlanID` field, persisted across restart
+  via `restore.go`, `cache.go`, and `api.go`.
+- API setter validates the range `0..4094` (with `0` meaning "no
+  VLAN"). Out-of-range values are rejected at endpoint create / update
+  time.
+
+### Datapath
+
+`bpf/lib/endpoint_vlan.h` (new) hosts the `DECLARE_CONFIG` for
+`enable_endpoint_vlan` and a single helper:
+
+- `ep_vlan_push_egress(ctx, proto)` resolves the source endpoint via
+  `__lookup_ip4_endpoint` / `__lookup_ip6_endpoint` and, when
+  `ep->vlan_id != 0`, calls `skb_vlan_push(ctx, ETH_P_8021Q,
+  ep->vlan_id)`. Returns `CTX_ACT_OK` on success or no-op, or a negative
+  drop reason on failure.
+
+Ingress stripping does not get its own wrapper: `cil_from_netdev` calls
+`skb_vlan_pop(ctx)` directly inside the decision tree below.
+
+`bpf_host.c`:
+
+- `cil_from_netdev`: when `CONFIG(enable_endpoint_vlan)` is set and
+  `ctx->vlan_present` is true (and `allow_vlan` did not match), call
+  `skb_vlan_pop(ctx)` so BPF then routes the packet based on the
+  destination IP as usual.
+- `cil_to_netdev`: after BPF processing, when
+  `CONFIG(enable_endpoint_vlan) && !ctx->vlan_present`, call
+  `ep_vlan_push_egress(ctx, proto)`.
+
+
+
+### CNI integration
+
+This CFP adds a `vlan-id` field to `EndpointChangeRequest` in
+`api/v1/openapi.yaml`:
+
+```yaml
+# api/v1/openapi.yaml – EndpointChangeRequest properties
+vlan-id:
+  description: >-
+    802.1Q VLAN ID for endpoint traffic isolation. When set, VLAN tags
+    are applied at the network boundary for this endpoint.
+    0 means no VLAN tagging.
+  type: integer
+```
+
+Cilium itself does not manage, allocate, or discover VLANs. Upstream
+CNI plugins that wish to use this feature must populate the `vlan-id`
+field when creating or updating endpoints through the Cilium API.
+
+The field is optional and defaults to `0` (no VLAN), so older clients
+and existing CNI integrations continue to work unchanged. Older agents
+that predate this CFP ignore the field on the wire.
+
+### Testing
+
+- BPF unit test `bpf/tests/tc_endpoint_vlan.c` covers ingress VLAN
+  pop, egress VLAN push (v4 and v6), and the no-VLAN passthrough case.
+- Go unit tests for the new `lxcmap` field and for option validation
+  (mutual-exclusion with `--vlan-bpf-bypass`).
+
+## Impacts / Key Questions
+
+### Impact: mutual exclusion with `--vlan-bpf-bypass`
+
+`--vlan-bpf-bypass` (`VLAN_FILTER`) and `--enable-endpoint-vlan` have
+opposite semantics and cannot coexist:
+
+| | `--vlan-bpf-bypass` | `--enable-endpoint-vlan` |
+|---|---|---|
+| **Purpose** | Let listed VLANs bypass BPF entirely | Let BPF strip/push VLAN tags per endpoint |
+| **BPF involvement** | None — tagged frames go straight to kernel sub-interfaces | Full — BPF routes by dst IP after stripping the tag |
+| **Policy / observability** | Skipped for bypassed VLANs | Fully enforced |
+
+Enabling both is a configuration error; the daemon will refuse to start.
+
+**Ingress decision tree in `cil_from_netdev`:**
+
+```
+ctx->vlan_present?
+ ├─ YES
+ │   ├─ allow_vlan(ifindex, vlan_id)?          ← VLAN_FILTER macro
+ │   │   └─ YES → return CTX_ACT_OK (pass to kernel sub-interface)
+ │   ├─ CONFIG(enable_endpoint_vlan)?
+ │   │   └─ YES → skb_vlan_pop() → continue BPF routing by dst IP
+ │   └─ else → DROP_VLAN_FILTERED
+ └─ NO → normal BPF processing
+```
+
+The `VLAN_FILTER` macro is generated at compile time from the
+`--vlan-bpf-bypass` VLAN ID list and expands inline to the
+`allow_vlan(ifindex, vlan_id)` predicate referenced above. When
+`--vlan-bpf-bypass` is not set, the macro expands to `return false`,
+so `allow_vlan()` never matches and the branch is dead code. This means
+when only `--enable-endpoint-vlan` is active, every VLAN-tagged frame
+hits the `skb_vlan_pop` path — no traffic silently bypasses BPF.
+
+**Egress path in `cil_to_netdev`:** after all BPF processing completes,
+if `CONFIG(enable_endpoint_vlan)` is set and `ctx->vlan_present` is
+false, the helper `ep_vlan_push_egress` resolves the source endpoint
+by IP (v4 or v6) and pushes its `ep->vlan_id` via `skb_vlan_push`.
+
+### Impact: relies on VLAN skb helpers
+
+This design uses the existing kernel helpers for VLAN pop and push at
+the host-device hook. The implementation should verify that the selected
+hook point observes the VLAN information needed for ingress pop and can
+emit the configured endpoint VLAN on egress. Environment-specific
+behavior should be validated in the implementation PR on the target
+deployments.
+
+### Impact: performance
+
+When `--enable-endpoint-vlan` is disabled, the existing datapath is
+unchanged. When enabled, VLAN handling is limited to traffic crossing
+the parent device: ingress strips the tag once before normal BPF
+processing, and egress pushes the configured endpoint VLAN once after
+normal BPF processing. Intra-node pod traffic continues to use the veth
+path and does not take the VLAN path.
+
+The expected runtime cost is therefore very small and bounded to the
+enabled trunk-device path.
+
+### Impact: VLAN spoofing
+
+VLAN tags in this design serve as a datapath-level routing label, not a
+security boundary. Network isolation between pods is enforced by
+NetworkPolicy, not by VLAN membership.
+
+**Ingress:** traffic arrives from the physical network through the trunk
+NIC. The VLAN tag is set by the upstream switch or hypervisor and is not
+controlled by the pod, so spoofing is not a concern on this path. The
+ingress decision tree therefore strips the tag unconditionally (when
+endpoint-VLAN is enabled and the frame did not match the bypass list)
+without verifying that the tag matches the destination endpoint's
+configured VLAN. Endpoint-to-VLAN binding correctness is enforced at
+the upstream switch/hypervisor and on egress (below); duplicating it on
+ingress would cost a per-packet endpoint lookup with no security
+benefit on this trust boundary.
+
+**Egress:** `cil_to_netdev` performs a source endpoint lookup and
+unconditionally pushes `ep->vlan_id` onto the frame, overwriting any
+tag the pod may have set. This effectively enforces VLAN correctness on
+egress with negligible overhead — the lookup is a single `lxcmap` hash
+hit (already cache-hot) and `skb_vlan_push` operates on skb metadata
+without touching packet data. Pods on the same node communicate through
+their veth pairs without traversing the trunk NIC, so VLAN tags are
+never involved in intra-node traffic.
+
+## Future Milestones
+
+### Hubble visibility
+
+Surface `vlan_id` as a field on `Flow` (in addition to the drop reason)
+so users can filter and group flows by VLAN.