Skip to content

Commit 7c6318e

Browse files
committed
Add documentation for replaceChainRules enforcement cycle
Documents the vxlan-policy-agent enforcement cycle, including the decision tree for replaceChainRules, naming conventions, state evaluation and recovery, and safety explanations. Made-with: Cursor
1 parent 1bf2ce5 commit 7c6318e

1 file changed

Lines changed: 202 additions & 0 deletions

File tree

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
---
2+
title: VXLAN Policy Agent Enforcement Cycle
3+
expires_at: never
4+
tags: [cf-networking-release, silk-release, asg]
5+
---
6+
7+
<!-- vim-markdown-toc GFM -->
8+
9+
* [VXLAN Policy Agent Enforcement Cycle](#vxlan-policy-agent-enforcement-cycle)
10+
* [Overview](#overview)
11+
* [Naming Conventions](#naming-conventions)
12+
* [replaceChainRules Decision Tree](#replacechainrules-decision-tree)
13+
* [State-by-State Walkthrough](#state-by-state-walkthrough)
14+
* [Neither exists](#neither-exists)
15+
* [Only original exists](#only-original-exists)
16+
* [Only candidate exists (Scenario 1 recovery)](#only-candidate-exists-scenario-1-recovery)
17+
* [Both exist (Scenario 2 recovery)](#both-exist-scenario-2-recovery)
18+
* [Failure Scenarios Prevented](#failure-scenarios-prevented)
19+
* [Normal Update Path Detail](#normal-update-path-detail)
20+
21+
<!-- vim-markdown-toc -->
22+
23+
# VXLAN Policy Agent Enforcement Cycle
24+
25+
## Overview
26+
27+
The `vxlan-policy-agent` periodically polls the policy server for Application Security Group (ASG) rules and enforces them on the host using `iptables`. To ensure that network traffic is not interrupted or dropped during rule updates, the agent uses a "candidate and rename" strategy. Instead of modifying the active chain directly, it builds a new "candidate" chain, inserts a jump rule to it, and then swaps it into place by deleting the old chain and renaming the new one.
28+
29+
This document explains the decision tree in the `replaceChainRules` function, which is responsible for this atomic swap and for recovering from partial failures if the agent crashes mid-update.
30+
31+
## Naming Conventions
32+
33+
- **Original / Active Chain (`asg-XYZ`)**: The chain currently serving traffic for a container. `XYZ` is derived from the container handle.
34+
- **Candidate Chain (`casg-XYZ`)**: A temporary chain built during an update to hold the new rules.
35+
- **Jump Rule**: A rule in the parent chain (e.g., `netout-XYZ`) that directs traffic into the ASG chain (e.g., `-A netout-XYZ -j asg-XYZ`). The presence of a jump rule determines whether a chain is actively evaluating traffic.
36+
37+
## replaceChainRules Decision Tree
38+
39+
The `replaceChainRules` function begins by checking the parent chain for jump rules pointing to either the original chain or the candidate chain. Based on what it finds, it determines the current state and how to proceed.
40+
41+
```mermaid
42+
flowchart TD
43+
Start["replaceChainRules entry"] --> CheckJumps["Check jump rules in parent chain"]
44+
45+
CheckJumps --> ParentCheck{"Parent chain\nexists?"}
46+
ParentCheck -->|"No"| Skip["Skip container\n(Container Creation Race)"]
47+
Skip --> Abort["Abort (Retry next sync)"]
48+
49+
ParentCheck -->|"Yes"| EvaluateState["Evaluate jump rule state"]
50+
51+
EvaluateState --> Neither{"Neither jump\nexists?"}
52+
EvaluateState --> OnlyOrig{"Only original\njump exists?"}
53+
EvaluateState --> OnlyCandidate{"Only candidate\njump exists?"}
54+
EvaluateState --> BothExist{"Both jumps\nexist?"}
55+
56+
Neither -->|"First time setup"| EnforceDirect["Enforce directly on asg-XYZ\n[active: asg-XYZ]"]
57+
EnforceDirect --> Done["Done"]
58+
59+
OnlyOrig -->|"Normal path"| NormalUpdate["Normal Update Path\n(see below)"]
60+
NormalUpdate --> Done
61+
62+
OnlyCandidate -->|"Failed to rename"| CheckOrphan["Check if orphaned\nasg-XYZ chain exists"]
63+
CheckOrphan -->|"Yes"| FlushDelete["Flush and delete\norphaned asg-XYZ\n[active: casg-XYZ]"]
64+
CheckOrphan -->|"No"| RenameRecovery1["Rename casg-XYZ to asg-XYZ\n[active: asg-XYZ]"]
65+
FlushDelete --> RenameRecovery1
66+
RenameRecovery1 --> NormalUpdate
67+
68+
BothExist -->|"Failed to delete old"| DeleteOrig["Delete asg-XYZ chain\nand jump rule\n[active: casg-XYZ]"]
69+
DeleteOrig --> RenameRecovery2["Rename casg-XYZ to asg-XYZ\n[active: asg-XYZ]"]
70+
RenameRecovery2 --> NormalUpdate
71+
```
72+
73+
## State Evaluation and Recovery
74+
75+
The following table details the four possible states detected by `replaceChainRules`, the recovery actions taken, and why those actions are safe for the running application's traffic.
76+
77+
| State / Failure Mode | Jump Rules Found | Recovery Action | Active Chain (Latest-known-good) | Why it is Safe |
78+
| :--- | :--- | :--- | :--- | :--- |
79+
| **First-time Setup** | Neither | Enforce directly on `asg-XYZ` (create chain, append rules, insert jump). | `asg-XYZ` | No existing traffic or rules to disrupt. |
80+
| **Normal Update** | Only `asg-XYZ` | Build `casg-XYZ`, insert jump, delete `asg-XYZ` jump & chain, rename `casg-XYZ` -> `asg-XYZ`. | `asg-XYZ` -> `casg-XYZ` -> `asg-XYZ` | The new rules are fully built in `casg-XYZ` before its jump rule is inserted at position 1. Traffic seamlessly shifts to the new rules before the old chain is deleted. |
81+
| **Container Creation Race (Parent chain not ready)** | N/A (Check fails) | Skip container enforcement. Retry on next sync cycle. | None | The container's network interface (and parent chain) is still being created by the CNI plugin. No application traffic can escape the container until the parent chain is wired up, so skipping enforcement temporarily does not leak traffic. |
82+
| **Interrupted Update (Failed to create candidate)** | Only `asg-XYZ` | Normal update path retries. | `asg-XYZ` | The original chain and jump rule were never modified. Traffic continues to flow through the old rules uninterrupted. |
83+
| **Interrupted Update (Failed to insert candidate jump)** | Only `asg-XYZ` | Normal update path retries. Agent attempts to delete the orphaned candidate chain during the failure. | `asg-XYZ` | The parent chain was never successfully modified to point to the candidate. Traffic continues through the original chain. |
84+
| **Interrupted Update (Failed to append rules)** | Only `asg-XYZ` (if immediate cleanup succeeds) or Both (if cleanup fails) | Normal update or "Both exist" recovery. | `asg-XYZ` (if cleanup succeeds) or `casg-XYZ` (if cleanup fails) | The agent immediately attempts to delete the candidate chain and jump rule to revert traffic to the original chain. If this cleanup fails, it falls into the "Both exist" recovery state on the next sync. |
85+
| **Interrupted Update (Failed to rename)** | Only `casg-XYZ` | Flush/delete orphaned `asg-XYZ` chain, rename `casg-XYZ` -> `asg-XYZ`, run normal update. | `casg-XYZ` | The candidate chain contains the fully built ruleset from the previous run. Renaming it restores the standard naming convention without dropping traffic. Clearing the orphaned chain prevents `RenameChain` failures (indefinite loop bug). |
86+
| **Interrupted Update (Failed to delete old)** | Both | Delete `asg-XYZ` chain & jump, rename `casg-XYZ` -> `asg-XYZ`, run normal update. | `casg-XYZ` | `casg-XYZ` was inserted at position 1, so it is already actively evaluating traffic with the newer rules. Deleting the old chain safely cleans up unused rules and prevents traffic from falling back to old rules (traffic regression bug). |
87+
88+
## Normal Update Path Detail
89+
90+
When the agent executes a normal update (starting from "Only original exists" or after recovering from a partial failure), it follows these steps:
91+
92+
```mermaid
93+
flowchart TD
94+
Start["Normal Update Path"] --> CreateCand["Create casg-XYZ chain"]
95+
CreateCand -->|"Error"| FailCreate["Abort\n(Failed to create candidate)"]
96+
97+
CreateCand -->|"Success"| InsertJump["Insert jump rule to casg-XYZ\nat position 1 in parent"]
98+
InsertJump -->|"Error"| FailInsert["Delete casg-XYZ & Abort\n(Failed to insert candidate jump)"]
99+
100+
InsertJump -->|"Success\n[active: casg-XYZ]"| AppendRules["Append new rules to casg-XYZ"]
101+
AppendRules -->|"Error"| FailAppend["Delete casg-XYZ & jump rule, then Abort\n(Failed to append rules)"]
102+
103+
AppendRules -->|"Success"| DeleteOld["Delete old asg-XYZ chain\nand its jump rule"]
104+
DeleteOld -->|"Error"| FailDelete["Abort\n(Failed to delete old)"]
105+
106+
DeleteOld -->|"Success"| RenameCand["Rename casg-XYZ to asg-XYZ"]
107+
RenameCand -->|"Error"| FailRename["Abort\n(Failed to rename)"]
108+
109+
RenameCand -->|"Success\n[active: asg-XYZ]"| Cleanup["Cleanup extra parent jump rules"]
110+
Cleanup --> Done["Done"]
111+
```
112+
113+
## Happy Path iptables Examples
114+
115+
The following examples show the state of the `iptables` rules for a container (handle `abc123def456`) at each step of its lifecycle, from initial creation by the CNI plugin to a successful ASG update by the `vxlan-policy-agent`.
116+
117+
### 1. CNI netrules create
118+
The CNI plugin creates the parent chain (`netout-abc123def456`) and adds default rules to allow established connections and reject everything else. The ASG chain does not exist yet.
119+
120+
```iptables
121+
-N netout-abc123def456
122+
-A netout-abc123def456 -m state --state RELATED,ESTABLISHED -j ACCEPT
123+
-A netout-abc123def456 -j REJECT --reject-with icmp-port-unreachable
124+
```
125+
126+
### 2. CNI force asg (Initial enforcement)
127+
The CNI plugin calls the `vxlan-policy-agent` to force an immediate ASG sync. The agent creates the `asg-abc123def456` chain, populates it with the initial rules, and inserts a jump rule at position 1 in the parent chain.
128+
129+
```iptables
130+
-N asg-abc123def456
131+
-A asg-abc123def456 -d 10.0.0.0/8 -p tcp -m tcp --dport 80 -j ACCEPT
132+
-N netout-abc123def456
133+
-A netout-abc123def456 -j asg-abc123def456
134+
-A netout-abc123def456 -m state --state RELATED,ESTABLISHED -j ACCEPT
135+
-A netout-abc123def456 -j REJECT --reject-with icmp-port-unreachable
136+
```
137+
138+
### 3. vxlan-policy-agent updating (new chain created)
139+
During a periodic sync, the agent detects a rule change. It creates a new candidate chain (`casg-abc123def456`). The original chain is still active.
140+
141+
```iptables
142+
-N asg-abc123def456
143+
-A asg-abc123def456 -d 10.0.0.0/8 -p tcp -m tcp --dport 80 -j ACCEPT
144+
-N casg-abc123def456
145+
-N netout-abc123def456
146+
-A netout-abc123def456 -j asg-abc123def456
147+
-A netout-abc123def456 -m state --state RELATED,ESTABLISHED -j ACCEPT
148+
-A netout-abc123def456 -j REJECT --reject-with icmp-port-unreachable
149+
```
150+
151+
### 4. vxlan-policy-agent updating (new jump inserted & rules appended)
152+
The agent inserts a jump rule to the candidate chain at position 1 in the parent chain, and appends the new rules to the candidate chain. Traffic now flows through the candidate chain.
153+
154+
```iptables
155+
-N asg-abc123def456
156+
-A asg-abc123def456 -d 10.0.0.0/8 -p tcp -m tcp --dport 80 -j ACCEPT
157+
-N casg-abc123def456
158+
-A casg-abc123def456 -d 10.0.0.0/8 -p tcp -m tcp --dport 443 -j ACCEPT
159+
-N netout-abc123def456
160+
-A netout-abc123def456 -j casg-abc123def456
161+
-A netout-abc123def456 -j asg-abc123def456
162+
-A netout-abc123def456 -m state --state RELATED,ESTABLISHED -j ACCEPT
163+
-A netout-abc123def456 -j REJECT --reject-with icmp-port-unreachable
164+
```
165+
166+
### 5. vxlan-policy-agent updating (remove old jump rule)
167+
The agent deletes the jump rule pointing to the original `asg-abc123def456` chain.
168+
169+
```iptables
170+
-N asg-abc123def456
171+
-A asg-abc123def456 -d 10.0.0.0/8 -p tcp -m tcp --dport 80 -j ACCEPT
172+
-N casg-abc123def456
173+
-A casg-abc123def456 -d 10.0.0.0/8 -p tcp -m tcp --dport 443 -j ACCEPT
174+
-N netout-abc123def456
175+
-A netout-abc123def456 -j casg-abc123def456
176+
-A netout-abc123def456 -m state --state RELATED,ESTABLISHED -j ACCEPT
177+
-A netout-abc123def456 -j REJECT --reject-with icmp-port-unreachable
178+
```
179+
180+
### 6. vxlan-policy-agent updating (remove old chain)
181+
The agent flushes and deletes the original `asg-abc123def456` chain.
182+
183+
```iptables
184+
-N casg-abc123def456
185+
-A casg-abc123def456 -d 10.0.0.0/8 -p tcp -m tcp --dport 443 -j ACCEPT
186+
-N netout-abc123def456
187+
-A netout-abc123def456 -j casg-abc123def456
188+
-A netout-abc123def456 -m state --state RELATED,ESTABLISHED -j ACCEPT
189+
-A netout-abc123def456 -j REJECT --reject-with icmp-port-unreachable
190+
```
191+
192+
### 7. vxlan-policy-agent updating (rename new chain to old chain)
193+
The agent renames `casg-abc123def456` to `asg-abc123def456`. The jump rule in the parent chain is automatically updated by `iptables` to reflect the new name. The update is complete.
194+
195+
```iptables
196+
-N asg-abc123def456
197+
-A asg-abc123def456 -d 10.0.0.0/8 -p tcp -m tcp --dport 443 -j ACCEPT
198+
-N netout-abc123def456
199+
-A netout-abc123def456 -j asg-abc123def456
200+
-A netout-abc123def456 -m state --state RELATED,ESTABLISHED -j ACCEPT
201+
-A netout-abc123def456 -j REJECT --reject-with icmp-port-unreachable
202+
```

0 commit comments

Comments
 (0)