Skip to content

Commit 4df11fa

Browse files
committed
Add shared policy lpm trie proposal.
Signed-off-by: Tsotne Chakhvadze <tsotne@google.com>
1 parent 3abd976 commit 4df11fa

1 file changed

Lines changed: 153 additions & 0 deletions

File tree

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
# CFP-XXX: Shared Policy LPM
2+
3+
**SIG: SIG-Policy, SIG-Datapath, SIG-Scalability**
4+
5+
**Begin Design Discussion:** 2026-04-01
6+
7+
**Cilium Release:** 1.20+
8+
9+
**Authors:** Tsotne Chakhvadze <tsotne@google.com>
10+
11+
**Status:** Proposal
12+
13+
# Sharing Policy Maps to Save Memory (Without BPF Arena)
14+
15+
## Summary
16+
17+
This proposal introduces a **Shared LPM Trie architecture** to deduplicate network policy rules across endpoints on a single node.
18+
19+
Currently, Cilium creates a per-endpoint policy map. If multiple pods run the same application profile (e.g. 1000 replicas of a microservice), Cilium creates replicate rules for each endpoint. This architecture proposes a single, node-wide Shared LPM Trie map where endpoints share a single set of rules, reducing control-plane memory usage significantly.
20+
21+
This is a practical, immediate alternative to **BPF Arena**, as it works on older kernels and avoids complex pointer management.
22+
23+
## Motivation
24+
25+
As Kubernetes clusters scale to thousands of nodes and tens of thousands of pods, Cilium's per-endpoint policy architecture faces critical scaling bottlenecks:
26+
27+
1. **Per-Endpoint Map Exhaustion**: Each endpoint has a fixed-size policy map (typically 16,384 entries). Complex environments with fine-grained segmentations can exceed this limit, causing policy drops or failures to attach.
28+
2. **Node Memory Pressure**: Replicating the same rule set across thousands of pods consumes gigabytes of memory per node. This reduces the memory available for user applications and increases infrastructure costs.
29+
3. **Bpf Arena Blockers**: While BPF Arena can solve this, it requires very new kernels (Linux 6.9+) and introduces native pointer complexities that slow down adoption.
30+
31+
Most memory savings (estimated ~99%) come from **Rule Set Sharing** (Level 1 Deduplication). We can achieve this blocking requirement immediately using standard, broadly supported BPF primitives (`BPF_MAP_TYPE_LPM_TRIE`) that work on older kernels. This allows us to scale to massive clusters without hitting map limits or wasting node RAM.
32+
33+
## Goals
34+
35+
* **Memory Efficiency**: Reduce node memory footprint by sharing rule sets across identical endpoints.
36+
* **Kernel Compatibility**: Support older kernels without requiring Arena.
37+
* **Safety**: Change rules without dropping packets or accidentally mixing up different apps
38+
39+
## Non-Goals
40+
41+
* Per-rule fine-grained pointer sharing (Level 2 Deduplication), which remains the domain of future BPF Arena enhancements.
42+
43+
## Proposal
44+
45+
### Overview
46+
47+
The design replaces the private per-endpoint policy map with a two-stage lookup mechanism using two BPF maps:
48+
49+
1. **Overlay Map** (Hash): Maps an `Endpoint ID` to a `Rule Set ID`.
50+
2. **Shared LPM Trie Map** (Global): Maps `Rule Set ID` + packet details to a verdict.
51+
52+
```mermaid
53+
graph LR
54+
subgraph Traffic ["Packet Flow"]
55+
direction TB
56+
EP["Endpoint ID: 42"]
57+
Pkt["[Identity: 100, Port: 80]"]
58+
end
59+
60+
subgraph Overlay ["Step 1: Get Rule Set ID"]
61+
direction TB
62+
EP_Key["Lookup Ep 42"]
63+
EP_Val["Returns: Rule Set 7"]
64+
end
65+
66+
subgraph SharedLPM ["Step 2: Evaluate Rules"]
67+
direction TB
68+
LPM_Key["Match Set 7 + Identity 100 + Port 80"]
69+
LPM_Val["Verdict: ALLOW"]
70+
end
71+
72+
EP --> EP_Key
73+
EP_Val --> LPM_Key
74+
Pkt --> LPM_Key
75+
LPM_Key --> LPM_Val
76+
```
77+
78+
### 1. BPF Maps Structure
79+
80+
#### The Overlay Map (Endpoint → Rule Set ID)
81+
82+
```c
83+
struct {
84+
__uint(type, BPF_MAP_TYPE_HASH);
85+
__type(key, __u16); // Endpoint ID
86+
__type(value, __u32); // Rule Set ID
87+
__uint(max_entries, 65535); // Example size
88+
} cilium_policy_overlay SEC(".maps");
89+
```
90+
91+
#### The Shared Rules Map (LPM Trie)
92+
93+
The key combines the `rule_set_id` with protocol selectors.
94+
95+
```c
96+
struct shared_policy_key {
97+
__u32 prefixlen; // LPM prefix length
98+
__u32 rule_set_id; // Scopes the rules to a set
99+
__u32 sec_label; // Destination Security Identity
100+
__u8 egress; // Direction
101+
__u8 protocol; // L4 Proto
102+
__u16 dport; // Dest Port
103+
} __attribute__((packed));
104+
105+
struct policy_entry {
106+
__be16 proxy_port;
107+
__u8 deny;
108+
__u8 wildcard_protocol;
109+
__u8 wildcard_dport;
110+
__u16 auth_type;
111+
__u8 pad;
112+
__u8 pad2;
113+
};
114+
115+
struct {
116+
__uint(type, BPF_MAP_TYPE_LPM_TRIE);
117+
__type(key, struct shared_policy_key);
118+
__type(value, struct policy_entry);
119+
__uint(max_entries, 1000000); // Configurable example
120+
__uint(map_flags, BPF_F_NO_PREALLOC);
121+
} cilium_policy_shared SEC(".maps");
122+
```
123+
124+
### 2. Userspace Manager (Go)
125+
126+
The Go agent manages the rules and tells BPF what to do.
127+
128+
* **Finding if a Rule Set already exists (Zero Collisions)**:
129+
Go takes a "fingerprint" (hash) of the rules. If another app has the exact same fingerprint, Go double-checks them line-by-line to be 100% sure they are identical. If they match, they share the same rules. If they don't, Go gives them a new ID. This guarantees no two different configs accidentally mix.
130+
* **Updating rules without dropping packets**:
131+
If a rule changes for a group of endpoints, Go does not edit the live rules directly (which could cause temporary packet drops or security holes while editing). Instead, Go:
132+
1. Creates a **brand new copy** of the new rules in the Shared Map with a new ID.
133+
2. Instantly flips the switch in the Overlay Map to point the endpoints to the new ID. Packets flow without problem.
134+
3. **Cleans up** (deletes) the old rules once no endpoints are using them anymore to free up memory.
135+
136+
## Impacts / Key Questions
137+
138+
### Impact: Memory Scale
139+
140+
Calculated based on 500 rules per application profile across 1000 pods (10 application types total).
141+
142+
| Feature Type | Total Graph Rules | Est. Node Memory |
143+
| :--- | :--- | :--- |
144+
| **Legacy (Individual maps)** | $500,000\text{ items}$ | $\approx 50\text{ MB}$ |
145+
| **Proposed (Shared LPM)** | $5,000\text{ items}$ | $\approx \text{Less than } 1\text{ MB}$ |
146+
147+
### Key Question: Map Limitations
148+
The `BPF_F_NO_PREALLOC` flag is vital here so memory is only committed on demand, preserving metrics if max_entries is set high (e.g., millions).
149+
150+
## Future Milestones
151+
152+
### BPF Arena Consolidation
153+
In the future we can transition to Level 2 Deduplication via BPF Arena without breaking the API surface between Go and BPF.

0 commit comments

Comments
 (0)