|
| 1 | +# CFP-XXX: Shared Policy LPM |
| 2 | + |
| 3 | +**SIG: SIG-Policy, SIG-Datapath, SIG-Scalability** |
| 4 | + |
| 5 | +**Begin Design Discussion:** 2026-04-01 |
| 6 | + |
| 7 | +**Cilium Release:** 1.20+ |
| 8 | + |
| 9 | +**Authors:** Tsotne Chakhvadze <tsotne@google.com> |
| 10 | + |
| 11 | +**Status:** Proposal |
| 12 | + |
| 13 | +# Sharing Policy Maps to Save Memory (Without BPF Arena) |
| 14 | + |
| 15 | +## Summary |
| 16 | + |
| 17 | +This proposal introduces a **Shared LPM Trie architecture** to deduplicate network policy rules across endpoints on a single node. |
| 18 | + |
| 19 | +Currently, Cilium creates a per-endpoint policy map. If multiple pods run the same application profile (e.g. 1000 replicas of a microservice), Cilium creates replicate rules for each endpoint. This architecture proposes a single, node-wide Shared LPM Trie map where endpoints share a single set of rules, reducing control-plane memory usage significantly. |
| 20 | + |
| 21 | +This is a practical, immediate alternative to **BPF Arena**, as it works on older kernels and avoids complex pointer management. |
| 22 | + |
| 23 | +## Motivation |
| 24 | + |
| 25 | +As Kubernetes clusters scale to thousands of nodes and tens of thousands of pods, Cilium's per-endpoint policy architecture faces critical scaling bottlenecks: |
| 26 | + |
| 27 | +1. **Per-Endpoint Map Exhaustion**: Each endpoint has a fixed-size policy map (typically 16,384 entries). Complex environments with fine-grained segmentations can exceed this limit, causing policy drops or failures to attach. |
| 28 | +2. **Node Memory Pressure**: Replicating the same rule set across thousands of pods consumes gigabytes of memory per node. This reduces the memory available for user applications and increases infrastructure costs. |
| 29 | +3. **Bpf Arena Blockers**: While BPF Arena can solve this, it requires very new kernels (Linux 6.9+) and introduces native pointer complexities that slow down adoption. |
| 30 | + |
| 31 | +Most memory savings (estimated ~99%) come from **Rule Set Sharing** (Level 1 Deduplication). We can achieve this blocking requirement immediately using standard, broadly supported BPF primitives (`BPF_MAP_TYPE_LPM_TRIE`) that work on older kernels. This allows us to scale to massive clusters without hitting map limits or wasting node RAM. |
| 32 | + |
| 33 | +## Goals |
| 34 | + |
| 35 | +* **Memory Efficiency**: Reduce node memory footprint by sharing rule sets across identical endpoints. |
| 36 | +* **Kernel Compatibility**: Support older kernels without requiring Arena. |
| 37 | +* **Safety**: Change rules without dropping packets or accidentally mixing up different apps |
| 38 | + |
| 39 | +## Non-Goals |
| 40 | + |
| 41 | +* Per-rule fine-grained pointer sharing (Level 2 Deduplication), which remains the domain of future BPF Arena enhancements. |
| 42 | + |
| 43 | +## Proposal |
| 44 | + |
| 45 | +### Overview |
| 46 | + |
| 47 | +The design replaces the private per-endpoint policy map with a two-stage lookup mechanism using two BPF maps: |
| 48 | + |
| 49 | +1. **Overlay Map** (Hash): Maps an `Endpoint ID` to a `Rule Set ID`. |
| 50 | +2. **Shared LPM Trie Map** (Global): Maps `Rule Set ID` + packet details to a verdict. |
| 51 | + |
| 52 | +```mermaid |
| 53 | +graph LR |
| 54 | + subgraph Traffic ["Packet Flow"] |
| 55 | + direction TB |
| 56 | + EP["Endpoint ID: 42"] |
| 57 | + Pkt["[Identity: 100, Port: 80]"] |
| 58 | + end |
| 59 | +
|
| 60 | + subgraph Overlay ["Step 1: Get Rule Set ID"] |
| 61 | + direction TB |
| 62 | + EP_Key["Lookup Ep 42"] |
| 63 | + EP_Val["Returns: Rule Set 7"] |
| 64 | + end |
| 65 | +
|
| 66 | + subgraph SharedLPM ["Step 2: Evaluate Rules"] |
| 67 | + direction TB |
| 68 | + LPM_Key["Match Set 7 + Identity 100 + Port 80"] |
| 69 | + LPM_Val["Verdict: ALLOW"] |
| 70 | + end |
| 71 | +
|
| 72 | + EP --> EP_Key |
| 73 | + EP_Val --> LPM_Key |
| 74 | + Pkt --> LPM_Key |
| 75 | + LPM_Key --> LPM_Val |
| 76 | +``` |
| 77 | + |
| 78 | +### 1. BPF Maps Structure |
| 79 | + |
| 80 | +#### The Overlay Map (Endpoint → Rule Set ID) |
| 81 | + |
| 82 | +```c |
| 83 | +struct { |
| 84 | + __uint(type, BPF_MAP_TYPE_HASH); |
| 85 | + __type(key, __u16); // Endpoint ID |
| 86 | + __type(value, __u32); // Rule Set ID |
| 87 | + __uint(max_entries, 65535); // Example size |
| 88 | +} cilium_policy_overlay SEC(".maps"); |
| 89 | +``` |
| 90 | +
|
| 91 | +#### The Shared Rules Map (LPM Trie) |
| 92 | +
|
| 93 | +The key combines the `rule_set_id` with protocol selectors. |
| 94 | +
|
| 95 | +```c |
| 96 | +struct shared_policy_key { |
| 97 | + __u32 prefixlen; // LPM prefix length |
| 98 | + __u32 rule_set_id; // Scopes the rules to a set |
| 99 | + __u32 sec_label; // Destination Security Identity |
| 100 | + __u8 egress; // Direction |
| 101 | + __u8 protocol; // L4 Proto |
| 102 | + __u16 dport; // Dest Port |
| 103 | +} __attribute__((packed)); |
| 104 | +
|
| 105 | +struct policy_entry { |
| 106 | + __be16 proxy_port; |
| 107 | + __u8 deny; |
| 108 | + __u8 wildcard_protocol; |
| 109 | + __u8 wildcard_dport; |
| 110 | + __u16 auth_type; |
| 111 | + __u8 pad; |
| 112 | + __u8 pad2; |
| 113 | +}; |
| 114 | +
|
| 115 | +struct { |
| 116 | + __uint(type, BPF_MAP_TYPE_LPM_TRIE); |
| 117 | + __type(key, struct shared_policy_key); |
| 118 | + __type(value, struct policy_entry); |
| 119 | + __uint(max_entries, 1000000); // Configurable example |
| 120 | + __uint(map_flags, BPF_F_NO_PREALLOC); |
| 121 | +} cilium_policy_shared SEC(".maps"); |
| 122 | +``` |
| 123 | + |
| 124 | +### 2. Userspace Manager (Go) |
| 125 | + |
| 126 | +The Go agent manages the rules and tells BPF what to do. |
| 127 | + |
| 128 | +* **Finding if a Rule Set already exists (Zero Collisions)**: |
| 129 | + Go takes a "fingerprint" (hash) of the rules. If another app has the exact same fingerprint, Go double-checks them line-by-line to be 100% sure they are identical. If they match, they share the same rules. If they don't, Go gives them a new ID. This guarantees no two different configs accidentally mix. |
| 130 | +* **Updating rules without dropping packets**: |
| 131 | + If a rule changes for a group of endpoints, Go does not edit the live rules directly (which could cause temporary packet drops or security holes while editing). Instead, Go: |
| 132 | + 1. Creates a **brand new copy** of the new rules in the Shared Map with a new ID. |
| 133 | + 2. Instantly flips the switch in the Overlay Map to point the endpoints to the new ID. Packets flow without problem. |
| 134 | + 3. **Cleans up** (deletes) the old rules once no endpoints are using them anymore to free up memory. |
| 135 | + |
| 136 | +## Impacts / Key Questions |
| 137 | + |
| 138 | +### Impact: Memory Scale |
| 139 | + |
| 140 | +Calculated based on 500 rules per application profile across 1000 pods (10 application types total). |
| 141 | + |
| 142 | +| Feature Type | Total Graph Rules | Est. Node Memory | |
| 143 | +| :--- | :--- | :--- | |
| 144 | +| **Legacy (Individual maps)** | $500,000\text{ items}$ | $\approx 50\text{ MB}$ | |
| 145 | +| **Proposed (Shared LPM)** | $5,000\text{ items}$ | $\approx \text{Less than } 1\text{ MB}$ | |
| 146 | + |
| 147 | +### Key Question: Map Limitations |
| 148 | +The `BPF_F_NO_PREALLOC` flag is vital here so memory is only committed on demand, preserving metrics if max_entries is set high (e.g., millions). |
| 149 | + |
| 150 | +## Future Milestones |
| 151 | + |
| 152 | +### BPF Arena Consolidation |
| 153 | +In the future we can transition to Level 2 Deduplication via BPF Arena without breaking the API surface between Go and BPF. |
0 commit comments