Skip to content

Commit bc36b6c

Browse files
committed
Add documentation for replaceChainRules enforcement cycle
Documents the vxlan-policy-agent enforcement cycle, including the decision tree for replaceChainRules, naming conventions, state evaluation and recovery, and safety explanations. Made-with: Cursor
1 parent 1bf2ce5 commit bc36b6c

1 file changed

Lines changed: 119 additions & 0 deletions

File tree

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
title: VXLAN Policy Agent Enforcement Cycle
3+
expires_at: never
4+
tags: [cf-networking-release, silk-release, asg]
5+
---
6+
7+
<!-- vim-markdown-toc GFM -->
8+
9+
* [VXLAN Policy Agent Enforcement Cycle](#vxlan-policy-agent-enforcement-cycle)
10+
* [Overview](#overview)
11+
* [Naming Conventions](#naming-conventions)
12+
* [replaceChainRules Decision Tree](#replacechainrules-decision-tree)
13+
* [State-by-State Walkthrough](#state-by-state-walkthrough)
14+
* [Neither exists](#neither-exists)
15+
* [Only original exists](#only-original-exists)
16+
* [Only candidate exists (Scenario 1 recovery)](#only-candidate-exists-scenario-1-recovery)
17+
* [Both exist (Scenario 2 recovery)](#both-exist-scenario-2-recovery)
18+
* [Failure Scenarios Prevented](#failure-scenarios-prevented)
19+
* [Normal Update Path Detail](#normal-update-path-detail)
20+
21+
<!-- vim-markdown-toc -->
22+
23+
# VXLAN Policy Agent Enforcement Cycle
24+
25+
## Overview
26+
27+
The `vxlan-policy-agent` periodically polls the policy server for Application Security Group (ASG) rules and enforces them on the host using `iptables`. To ensure that network traffic is not interrupted or dropped during rule updates, the agent uses a "candidate and rename" strategy. Instead of modifying the active chain directly, it builds a new "candidate" chain, inserts a jump rule to it, and then swaps it into place by deleting the old chain and renaming the new one.
28+
29+
This document explains the decision tree in the `replaceChainRules` function, which is responsible for this atomic swap and for recovering from partial failures if the agent crashes mid-update.
30+
31+
## Naming Conventions
32+
33+
- **Original / Active Chain (`asg-XYZ`)**: The chain currently serving traffic for a container. `XYZ` is derived from the container handle.
34+
- **Candidate Chain (`casg-XYZ`)**: A temporary chain built during an update to hold the new rules.
35+
- **Jump Rule**: A rule in the parent chain (e.g., `netout-XYZ`) that directs traffic into the ASG chain (e.g., `-A netout-XYZ -j asg-XYZ`). The presence of a jump rule determines whether a chain is actively evaluating traffic.
36+
37+
## replaceChainRules Decision Tree
38+
39+
The `replaceChainRules` function begins by checking the parent chain for jump rules pointing to either the original chain or the candidate chain. Based on what it finds, it determines the current state and how to proceed.
40+
41+
```mermaid
42+
flowchart TD
43+
Start["replaceChainRules entry"] --> CheckJumps["Check jump rules in parent chain"]
44+
45+
CheckJumps --> ParentCheck{"Parent chain\nexists?"}
46+
ParentCheck -->|"No"| Skip["Skip container\n(Container Creation Race)"]
47+
Skip --> Abort["Abort (Retry next sync)"]
48+
49+
ParentCheck -->|"Yes"| EvaluateState["Evaluate jump rule state"]
50+
51+
EvaluateState --> Neither{"Neither jump\nexists?"}
52+
EvaluateState --> OnlyOrig{"Only original\njump exists?"}
53+
EvaluateState --> OnlyCandidate{"Only candidate\njump exists?"}
54+
EvaluateState --> BothExist{"Both jumps\nexist?"}
55+
56+
Neither -->|"First time setup"| EnforceDirect["Enforce directly on asg-XYZ\n[active: asg-XYZ]"]
57+
EnforceDirect --> Done["Done"]
58+
59+
OnlyOrig -->|"Normal path"| NormalUpdate["Normal Update Path\n(see below)"]
60+
NormalUpdate --> Done
61+
62+
OnlyCandidate -->|"Failed to rename"| CheckOrphan["Check if orphaned\nasg-XYZ chain exists"]
63+
CheckOrphan -->|"Yes"| FlushDelete["Flush and delete\norphaned asg-XYZ\n[active: casg-XYZ]"]
64+
CheckOrphan -->|"No"| RenameRecovery1["Rename casg-XYZ to asg-XYZ\n[active: asg-XYZ]"]
65+
FlushDelete --> RenameRecovery1
66+
RenameRecovery1 --> NormalUpdate
67+
68+
BothExist -->|"Failed to delete old"| DeleteOrig["Delete asg-XYZ chain\nand jump rule\n[active: casg-XYZ]"]
69+
DeleteOrig --> RenameRecovery2["Rename casg-XYZ to asg-XYZ\n[active: asg-XYZ]"]
70+
RenameRecovery2 --> NormalUpdate
71+
```
72+
73+
## State Evaluation and Recovery
74+
75+
The following table details the four possible states detected by `replaceChainRules`, the recovery actions taken, and why those actions are safe for the running application's traffic.
76+
77+
| State / Failure Mode | Jump Rules Found | Recovery Action | Active Chain (Latest-known-good) | Why it is Safe |
78+
| :--- | :--- | :--- | :--- | :--- |
79+
| **First-time Setup** | Neither | Enforce directly on `asg-XYZ` (create chain, append rules, insert jump). | `asg-XYZ` | No existing traffic or rules to disrupt. |
80+
| **Normal Update** | Only `asg-XYZ` | Build `casg-XYZ`, insert jump, delete `asg-XYZ` jump & chain, rename `casg-XYZ` -> `asg-XYZ`. | `asg-XYZ` -> `casg-XYZ` -> `asg-XYZ` | The new rules are fully built in `casg-XYZ` before its jump rule is inserted at position 1. Traffic seamlessly shifts to the new rules before the old chain is deleted. |
81+
| **Container Creation Race (Parent chain not ready)** | N/A (Check fails) | Skip container enforcement. Retry on next sync cycle. | None | The container's network interface (and parent chain) is still being created by the CNI plugin. No application traffic can escape the container until the parent chain is wired up, so skipping enforcement temporarily does not leak traffic. |
82+
| **Interrupted Update (Failed to create candidate)** | Only `asg-XYZ` | Normal update path retries. | `asg-XYZ` | The original chain and jump rule were never modified. Traffic continues to flow through the old rules uninterrupted. |
83+
| **Interrupted Update (Failed to insert candidate jump)** | Only `asg-XYZ` | Normal update path retries. Agent attempts to delete the orphaned candidate chain during the failure. | `asg-XYZ` | The parent chain was never successfully modified to point to the candidate. Traffic continues through the original chain. |
84+
| **Interrupted Update (Failed to append rules)** | Only `asg-XYZ` (if immediate cleanup succeeds) or Both (if cleanup fails) | Normal update or "Both exist" recovery. | `asg-XYZ` (if cleanup succeeds) or `casg-XYZ` (if cleanup fails) | The agent immediately attempts to delete the candidate chain and jump rule to revert traffic to the original chain. If this cleanup fails, it falls into the "Both exist" recovery state on the next sync. |
85+
| **Interrupted Update (Failed to rename)** | Only `casg-XYZ` | Flush/delete orphaned `asg-XYZ` chain, rename `casg-XYZ` -> `asg-XYZ`, run normal update. | `casg-XYZ` | The candidate chain contains the fully built ruleset from the previous run. Renaming it restores the standard naming convention without dropping traffic. Clearing the orphaned chain prevents `RenameChain` failures (indefinite loop bug). |
86+
| **Interrupted Update (Failed to delete old)** | Both | Delete `asg-XYZ` chain & jump, rename `casg-XYZ` -> `asg-XYZ`, run normal update. | `casg-XYZ` | `casg-XYZ` was inserted at position 1, so it is already actively evaluating traffic with the newer rules. Deleting the old chain safely cleans up unused rules and prevents traffic from falling back to old rules (traffic regression bug). |
87+
88+
## Normal Update Path Detail
89+
90+
When the agent executes a normal update (starting from "Only original exists" or after recovering from a partial failure), it follows these steps:
91+
92+
```mermaid
93+
flowchart TD
94+
Start["Normal Update Path"] --> CreateCand["Create casg-XYZ chain"]
95+
CreateCand -->|"Error"| FailCreate["Abort\n(Failed to create candidate)"]
96+
97+
CreateCand -->|"Success"| InsertJump["Insert jump rule to casg-XYZ\nat position 1 in parent"]
98+
InsertJump -->|"Error"| FailInsert["Delete casg-XYZ & Abort\n(Failed to insert candidate jump)"]
99+
100+
InsertJump -->|"Success\n[active: casg-XYZ]"| AppendRules["Append new rules to casg-XYZ"]
101+
AppendRules -->|"Error"| FailAppend["Delete casg-XYZ & jump rule, then Abort\n(Failed to append rules)"]
102+
103+
AppendRules -->|"Success"| DeleteOld["Delete old asg-XYZ chain\nand its jump rule"]
104+
DeleteOld -->|"Error"| FailDelete["Abort\n(Failed to delete old)"]
105+
106+
DeleteOld -->|"Success"| RenameCand["Rename casg-XYZ to asg-XYZ"]
107+
RenameCand -->|"Error"| FailRename["Abort\n(Failed to rename)"]
108+
109+
RenameCand -->|"Success\n[active: asg-XYZ]"| Cleanup["Cleanup extra parent jump rules"]
110+
Cleanup --> Done["Done"]
111+
```
112+
113+
1. **Enforce on `casg-XYZ`**: Create the candidate chain, insert a jump rule to it at position 1 in the parent chain, and append the new ASG rules.
114+
* **Active chain:** `casg-XYZ` (takes priority over `asg-XYZ` because it is at position 1).
115+
2. **Delete `asg-XYZ`**: Delete the original jump rule and the original chain.
116+
* **Active chain:** `casg-XYZ` (now the only jump rule).
117+
3. **Rename `casg-XYZ` to `asg-XYZ`**: Rename the chain. The jump rule automatically updates to point to the new name.
118+
* **Active chain:** `asg-XYZ`.
119+
4. **Cleanup**: Remove any extra jump rules in the parent chain that might have been left behind.

0 commit comments

Comments
 (0)