Skip to content

Commit 1656712

Browse files
authored
Fix multi-instance state coordination with file locking and expiration persistence [minor] (#62)
When multiple Traefik routers use this plugin, each creates a separate instance that competes for the same state file. This caused verified IPs to lose their TTLs and get re-challenged across instances. Changes: - Add expiration timestamps to State struct for proper TTL serialization - Implement file locking to prevent concurrent write conflicts - Add state reconciliation to merge in-memory and file-based state - Keep periodic full state saves (10 min) as backup for rate/bot caches - Verified IPs now maintain their TTLs across plugin instances, preventing unnecessary re-challenges when requests hit different routers.
1 parent 865a64d commit 1656712

14 files changed

Lines changed: 1858 additions & 145 deletions

File tree

.github/workflows/github-release.yml

Lines changed: 8 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -5,27 +5,13 @@ on:
55
- main
66
types:
77
- closed
8-
permissions:
9-
contents: write
10-
actions: write
118
jobs:
129
release:
13-
if: github.event.pull_request.merged == true && !contains(github.event.pull_request.title, '[skip-release]')
14-
runs-on: ubuntu-24.04
15-
steps:
16-
- name: Checkout
17-
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5
18-
with:
19-
fetch-depth: 0
20-
21-
- name: install autotag binary
22-
run: curl -sL https://git.io/autotag-install | sudo sh -s -- -b /usr/bin
23-
24-
- name: create release
25-
run: |-
26-
TAG=$(autotag)
27-
git push origin v$TAG
28-
gh release create v$TAG --title "v$TAG" --generate-notes
29-
env:
30-
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
31-
10+
if: github.event.pull_request.merged == true && !contains(github.event.pull_request.title, 'skip-release')
11+
uses: libops/actions/.github/workflows/bump-release.yaml@main
12+
with:
13+
prefix: v
14+
permissions:
15+
contents: write
16+
actions: write
17+
secrets: inherit

.github/workflows/lint-test.yml

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -53,14 +53,36 @@ jobs:
5353
env:
5454
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
5555

56-
integration-test:
56+
integration-test-latest:
5757
needs: [run]
5858
permissions:
5959
contents: read
6060
runs-on: ubuntu-24.04
6161
strategy:
6262
matrix:
63-
traefik: [v2.11, v3.0, v3.1, v3.2, v3.3, v3.4]
63+
traefik: [latest]
64+
steps:
65+
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5
66+
67+
- name: run
68+
run: go run test.go
69+
working-directory: ./ci
70+
env:
71+
TRAEFIK_TAG: ${{ matrix.traefik }}
72+
73+
- name: cleanup
74+
if: ${{ always() }}
75+
run: docker compose logs --tail 100 nginx nginx2 traefik && docker compose down
76+
working-directory: ./ci
77+
78+
integration-test:
79+
needs: [integration-test-latest]
80+
permissions:
81+
contents: read
82+
runs-on: ubuntu-24.04
83+
strategy:
84+
matrix:
85+
traefik: [v2.11, v3.0, v3.1, v3.2, v3.3, v3.4, v3.5]
6486
steps:
6587
- uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5
6688

@@ -72,5 +94,5 @@ jobs:
7294

7395
- name: cleanup
7496
if: ${{ always() }}
75-
run: docker compose down
97+
run: docker compose logs --tail 100 nginx nginx2 traefik && docker compose down
7698
working-directory: ./ci

.traefik.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,4 @@ testData:
1111
CaptchaProvider: turnstile
1212
SiteKey: 1x00000000000000000000AA
1313
SecretKey: 1x0000000000000000000000000000000AA
14+
EnableStateReconciliation: "false"

CLAUDE.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,10 @@ This is a Traefik middleware plugin that protects websites from bot traffic by c
1616
- `CaptchaProtect` struct: Main middleware handler with rate limiting, bot detection, and challenge serving
1717
- `Config` struct: Configuration from Traefik labels
1818
- Three in-memory caches (using `github.com/patrickmn/go-cache`):
19-
- `rateCache`: Tracks request counts per subnet
19+
- `rateCache`: Tracks request counts per subnet (TTL = `window` config value)
2020
- `verifiedCache`: Stores IPs that have passed challenges (24h default TTL)
21-
- `botCache`: Caches reverse DNS lookups for bot verification
21+
- `botCache`: Caches reverse DNS lookups for bot verification (1h TTL)
22+
- **Why go-cache instead of sync.Map?** The plugin requires automatic TTL-based expiration for all caches. `sync.Map` has no built-in expiration mechanism, requiring manual cleanup goroutines. `go-cache` provides thread-safe maps with automatic expiration and cleanup.
2223

2324
### Request Flow Decision Tree
2425

@@ -120,9 +121,18 @@ Regex is significantly slower (~41ns vs ~3.4ns per operation) - see README bench
120121
### State Persistence
121122

122123
When `persistentStateFile` is configured:
123-
- State saves every 1 minute to JSON file (`saveState()` at `main.go:695-727`)
124-
- On startup, loads previous state from file (`loadState()` at `main.go:729-756`)
124+
- State saves every 10 seconds (with 0-2s random jitter) to JSON file (`saveState()` at `main.go:716-746`)
125+
- Uses file locking (`.lock` files) to prevent concurrent writes (`internal/state/state.go:61-129`)
126+
- On startup, loads previous state from file (`loadState()` at `main.go:729-761`)
125127
- Contains: rate limits per subnet, bot verification cache, verified IPs
128+
- **Important**: Each middleware instance runs its own save goroutine. If multiple instances share the same `persistentStateFile`, they will write more frequently (e.g., 2 instances = writes every ~5 seconds)
129+
- **State Reconciliation**: When `enableStateReconciliation: "true"`, each save performs a read-modify-write cycle to merge state from other instances. This adds I/O overhead but prevents data loss in multi-instance deployments (see `internal/state/state.go:86-100`)
130+
131+
**Why not Redis?** Traefik plugins are loaded via Yaegi (a Go interpreter), which has significant limitations:
132+
- Yaegi cannot interpret Go packages that use `unsafe`, cgo, or complex reflection patterns
133+
- Popular Redis clients like `go-redis/redis` are incompatible with Yaegi
134+
135+
**Current solution**: File-based persistence with reconciliation avoids these issues. Local caches remain fast (no network overhead), state saves are batched (every 10s), and reconciliation handles conflicts without complex coordination. The tradeoff is accepting slightly stale data across instances (max 10s delay) rather than the complexity and performance cost of real-time Redis synchronization.
126136

127137
### Good Bot Detection
128138

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@ services:
119119
| `enableStatsPage` | `string` | `"false"` | Allows `exemptIps` to access `/captcha-protect/stats` to monitor the rate limiter. |
120120
| `logLevel` | `string` | `"INFO"` | Log level for the middleware. Options: `ERROR`, `WARNING`, `INFO`, or `DEBUG`. |
121121
| `persistentStateFile` | `string` | `""` | File path to persist rate limiter state across Traefik restarts. In Docker, mount this file from the host. |
122+
| `enableStateReconciliation` | `string` | `"false"` | When `"true"`, reads and merges disk state before each save to prevent multiple instances from overwriting data. Adds extra I/O overhead. Only enable for multi-instance deployments sharing state. |
122123

123124

124125
### Good Bots

ci/.env

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
TRAEFIK_TAG=v3.3.3
1+
TRAEFIK_TAG=v3.5
22
NGINX_TAG=1.27.4-alpine3.21
33
TURNSTILE_SITE_KEY=1x00000000000000000000AA
44
TURNSTILE_SECRET_KEY=1x0000000000000000000000000000000AA

ci/docker-compose.yml

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,14 +21,44 @@ services:
2121
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.goodBots: ""
2222
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.protectRoutes: "/"
2323
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.persistentStateFile: "/tmp/state.json"
24+
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.enableStateReconciliation: "true"
2425
healthcheck:
2526
test: curl -fs http://localhost/healthz | grep -q OK || exit 1
2627
volumes:
2728
- ./conf/nginx/default.conf:/etc/nginx/conf.d/default.conf:r
2829
networks:
2930
default:
3031
aliases:
31-
- nginx
32+
- nginx
33+
nginx2:
34+
image: nginx:${NGINX_TAG}
35+
labels:
36+
traefik.enable: true
37+
traefik.http.routers.nginx2.entrypoints: http
38+
traefik.http.routers.nginx2.service: nginx2
39+
traefik.http.routers.nginx2.rule: Host(`localhost`) && PathPrefix(`/app2`)
40+
traefik.http.services.nginx2.loadbalancer.server.port: 80
41+
traefik.http.routers.nginx2.middlewares: captcha-protect@docker
42+
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.captchaProvider: turnstile
43+
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.window: 120
44+
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.rateLimit: ${RATE_LIMIT}
45+
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.siteKey: ${TURNSTILE_SITE_KEY}
46+
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.secretKey: ${TURNSTILE_SECRET_KEY}
47+
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.enableStatsPage: "true"
48+
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.ipForwardedHeader: "X-Forwarded-For"
49+
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.logLevel: "DEBUG"
50+
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.goodBots: ""
51+
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.protectRoutes: "/"
52+
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.persistentStateFile: "/tmp/state.json"
53+
traefik.http.middlewares.captcha-protect.plugin.captcha-protect.enableStateReconciliation: "true"
54+
healthcheck:
55+
test: curl -fs http://localhost/healthz | grep -q OK || exit 1
56+
volumes:
57+
- ./conf/nginx/default.conf:/etc/nginx/conf.d/default.conf:r
58+
networks:
59+
default:
60+
aliases:
61+
- nginx2
3262
traefik:
3363
image: traefik:${TRAEFIK_TAG}
3464
command: >-

ci/test.go

Lines changed: 42 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ var (
2525

2626
const numIPs = 100
2727
const parallelism = 10
28-
const expectedRedirectURL = "http://localhost/challenge?destination=%2F"
2928

3029
func main() {
3130
_ips := []string{
@@ -48,24 +47,19 @@ func main() {
4847
fmt.Println("Bringing traefik/nginx online")
4948
runCommand("docker", "compose", "up", "-d")
5049
waitForService("http://localhost")
50+
waitForService("http://localhost/app2")
5151

5252
fmt.Printf("Making sure %d attempt(s) pass\n", rateLimit)
53-
runParallelChecks(ips, rateLimit)
53+
runParallelChecks(ips, rateLimit, "http://localhost")
5454

55-
fmt.Printf("Making sure attempt #%d causes a redirect to the challenge page\n", rateLimit+1)
56-
ensureRedirect(ips)
55+
time.Sleep(cp.StateSaveInterval + cp.StateSaveJitter + (1 * time.Second))
56+
runCommand("jq", ".", "tmp/state.json")
5757

58-
fmt.Println("Sleeping for 2m")
59-
time.Sleep(125 * time.Second)
60-
fmt.Println("Making sure one attempt passes after 2m window")
61-
runParallelChecks(ips, 1)
62-
fmt.Println("All good 🚀")
58+
fmt.Printf("Making sure attempt #%d causes a redirect to the challenge page\n", rateLimit+1)
59+
ensureRedirect(ips, "http://localhost")
6360

64-
// make sure the state has time to save
65-
fmt.Println("Waiting for state to save")
66-
runCommand("jq", ".", "tmp/state.json")
67-
time.Sleep(80 * time.Second)
68-
runCommand("jq", ".", "tmp/state.json")
61+
fmt.Println("\nTesting state sharing between nginx instances...")
62+
testStateSharing(ips)
6963

7064
runCommand("docker", "container", "stats", "--no-stream")
7165

@@ -138,7 +132,7 @@ func waitForService(url string) {
138132
}
139133
}
140134

141-
func runParallelChecks(ips []string, rateLimit int) {
135+
func runParallelChecks(ips []string, rateLimit int, url string) {
142136
var wg sync.WaitGroup
143137
sem := make(chan struct{}, parallelism)
144138

@@ -151,7 +145,7 @@ func runParallelChecks(ips []string, rateLimit int) {
151145
defer func() { <-sem }()
152146

153147
fmt.Printf("Checking %s\n", ip)
154-
output := httpRequest(ip)
148+
output := httpRequest(ip, url)
155149
if output != "" {
156150
slog.Error("Unexpected output", "ip", ip, "output", output)
157151
os.Exit(1)
@@ -164,21 +158,47 @@ func runParallelChecks(ips []string, rateLimit int) {
164158
wg.Wait()
165159
}
166160

167-
func ensureRedirect(ips []string) {
161+
func ensureRedirect(ips []string, url string) {
162+
expectedURL := url + "/challenge?destination=%2F"
163+
if url != "http://localhost" {
164+
// For /app2, the destination should be the app2 path
165+
expectedURL = "http://localhost/challenge?destination=%2Fapp2"
166+
}
167+
168168
for _, ip := range ips {
169169
fmt.Printf("Checking %s\n", ip)
170-
output := httpRequest(ip)
170+
output := httpRequest(ip, url)
171171

172-
if output != expectedRedirectURL {
173-
slog.Error("Unexpected output", "ip", ip, "output", output)
172+
if output != expectedURL {
173+
slog.Error("Unexpected output", "ip", ip, "output", output, "expected", expectedURL)
174174
os.Exit(1)
175175
}
176176

177177
fmt.Printf("Got a redirect! %s\n", output)
178178
}
179179
}
180180

181-
func httpRequest(ip string) string {
181+
func testStateSharing(ips []string) {
182+
// Use first IP to test state sharing
183+
testIP := ips[0]
184+
185+
fmt.Printf("Testing with IP: %s\n", testIP)
186+
187+
// The IP should already be at rate limit from previous tests on localhost/
188+
// Now verify it's also rate limited on localhost/app2 (shared state)
189+
fmt.Println("Verifying IP is rate limited on /app2 (state should be shared)...")
190+
output := httpRequest(testIP, "http://localhost/app2")
191+
expectedURL := "http://localhost/challenge?destination=%2Fapp2"
192+
193+
if output != expectedURL {
194+
slog.Error("State NOT shared between instances!", "ip", testIP, "output", output, "expected", expectedURL)
195+
os.Exit(1)
196+
}
197+
198+
fmt.Println("✓ State is correctly shared between nginx instances!")
199+
}
200+
201+
func httpRequest(ip, url string) string {
182202
client := &http.Client{
183203
CheckRedirect: func(req *http.Request, via []*http.Request) error {
184204
// Capture the redirect URL and stop following it
@@ -189,7 +209,7 @@ func httpRequest(ip string) string {
189209
},
190210
}
191211

192-
req, err := http.NewRequest("GET", "http://localhost", nil)
212+
req, err := http.NewRequest("GET", url, nil)
193213
if err != nil {
194214
slog.Error("Failed to create request", "err", err)
195215
os.Exit(1)

0 commit comments

Comments
 (0)