Cache warmer thundering herd on rolling deploys (N pods warming in parallel)

## Summary

On Plone-pod startup, `zodb_pgjsonb.cache_warmer` preloads the top-N most-referenced objects into the L2 cache (target=8803 in the observed production site). This is valuable for request-latency SLOs *after* startup, but **every pod runs the warmer independently in its own process**, multiplying DB load by the replica count during the startup window.

Observed on aaf-prod (6 Plone backend pods, 3.9M-row catalog, b54 deploy):

```
INFO  zodb_pgjsonb.storage:356  Cache warmer started (target=8803, decay=0.8)
INFO  zodb_pgjsonb.cache_warmer:165  Cache warmer: loaded 8803 objects into L2
```

When a rolling deploy lands 3 pods within ~30s, the cache warmer runs 3× concurrently — each pod issues `target` object fetches against the same PostgreSQL primary. Combined with the parallel `plone.pgcatalog` schema-check + `ANALYZE object_state` chatter, this pushes the primary CPU from its baseline ~10 % to saturation (3.7/4 cores observed), feeding the cascade of slow `/ok` probes → sick backends → 503s until queues drain.

## Why it's worse than it looks

- The L2 cache is **per-process**. So the warmer *does* need to populate each pod's cache. You can't just run it on one pod and have others benefit.
- But **N pods starting concurrently** all issuing the same preload queries produces no additional benefit over serialised preloading — they do the same work on the same hot rows, paying N× DB cost for no per-pod advantage.

## Ideas

Two independent optimisations, either or both:

### 1. Defer cache warmer until after pod is Ready

Currently the warmer runs during Plone startup on the main thread path. The pod is not serving traffic yet, but the pod's container readiness probe is gated on `Zope: Ready to handle requests` which fires *after* the warmer completes. That means warmer load happens during the "vulnerable" startup window where DB headroom is tight.

Change: spawn the warmer as a background thread / async task that runs **after** the HTTP server starts accepting traffic. Pod reports Ready as soon as the server binds, warmer fills L2 in the background over the next N seconds.

Trade-off: first few requests land on a cold cache → slightly slower until warmer completes. Generally acceptable because Varnish absorbs most anonymous traffic.

### 2. Jittered / rate-limited warm-up

Even as a background task, N pods starting concurrently still hit the DB simultaneously. Add random jitter `(0 .. N×interval)` before warmer start + token-bucket pacing so at any moment only K pods are actively warming.

Jitter alone (cheap, single-line change) would already break the thundering-herd on rolling deploys: pod 1 starts warming at 0s, pod 2 at ~5s, pod 3 at ~12s, etc., so the DB load is smeared instead of stacked.

Pacing (slightly more involved) would cap concurrent warmers cluster-wide — e.g. via an advisory lock slot allocator or a small Redis-less leader election.

### 3. Opt-out / config knob

Allow deployments to disable the warmer entirely via environment variable or config flag for situations where Varnish absorbs most traffic anyway and the cold-cache penalty on the first few authenticated requests is acceptable.

## Observed numbers

On our baseline (post-warm, steady state):
- DB primary: ~0.4 cores
- Warmer per pod: ~5s × ~8800 queries = ~1700 qps per warmer

6 pods starting in a 30s window: burst 6× 1700 = 10 kqps on a DB that normally handles ~100 qps. That's the ~90th-percentile CPU spike on every redeploy.

## Related

- plone-pgcatalog startup gate proposal (filed separately): reduces the DDL-probe herd but not the cache-warmer herd.

## Environment

- zodb-pgjsonb 1.11.x
- PostgreSQL 16 (CloudNativePG), 3.9M rows
- aaf-prod, 6 backend pods


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache warmer thundering herd on rolling deploys (N pods warming in parallel) #59

Summary

Why it's worse than it looks

Ideas

1. Defer cache warmer until after pod is Ready

2. Jittered / rate-limited warm-up

3. Opt-out / config knob

Observed numbers

Related

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cache warmer thundering herd on rolling deploys (N pods warming in parallel) #59

Description

Summary

Why it's worse than it looks

Ideas

1. Defer cache warmer until after pod is Ready

2. Jittered / rate-limited warm-up

3. Opt-out / config knob

Observed numbers

Related

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions