Skip to content

fix: CacheWarmer hardening (#65)#68

Merged
jensens merged 7 commits into
mainfrom
fix/warmer-hardening
Apr 23, 2026
Merged

fix: CacheWarmer hardening (#65)#68
jensens merged 7 commits into
mainfrom
fix/warmer-hardening

Conversation

@jensens
Copy link
Copy Markdown
Member

@jensens jensens commented Apr 23, 2026

Closes #65.

Summary

Three hardening items + one coverage gap, all flagged during the #63 final review and filed as follow-ups in #65.

I1 — silent error swallow

The warmer's inline _current_max_tid closure on PGJsonbStorage.__init__ caught any exception and silently returned 0, making operator-visible failures disappear (closed connection, dropped `transaction_log`, wrong role). Replaced with a module-level _read_max_tid(conn) helper (raises to caller) and a public PGJsonbStorage.current_max_tid() method that logs at WARNING and returns None. The warmer now skips warmup gracefully on None instead of installing a fabricated `consensus=0`.

I2 — DRY the 3 copies of the SELECT

The same SELECT COALESCE(MAX(tid), 0) FROM transaction_log lived in three places with three different error-handling policies (_restore_state, poll_invalidations, and the inline closure). All three now route through _read_max_tid. Each caller keeps its own error policy: startup + poll_invalidations let the exception propagate (fatal paths), warmer uses the wrapped current_max_tid() (best-effort path). grep -rn "SELECT COALESCE(MAX(tid)" src/ returns exactly one hit (inside the helper).

I3 — startup race

Warmer samples current_tid=T0, a concurrent instance poll advances shared-cache consensus to T1 > T0, warmer's poll_advance(T0, []) becomes a no-op, every subsequent set(..., polled_tid=T0) is rejected because T0 < consensus=T1. Mitigation:

  • After poll_advance(current_tid, []), re-read `self._shared_cache.consensus_tid` → effective_tid.
  • Use effective_tid as `polled_tid` for subsequent set() calls so they pass the gate even when the sampled TID is now stale.

Plus observability: SharedLoadCache.set() now returns `bool` (accepted/rejected). The warmer counts `written` via the bool and logs a WARNING when `written == 0` despite a non-empty result set — so if the mitigation's window isn't enough (consensus advances mid-loop), the silent-failure is no longer silent.

Required API additions:

  • SharedLoadCache.consensus_tid — read-only property (lock-guarded).
  • SharedLoadCache.set() — returns `bool`.

Both are additive. Existing callers (instance.load, load_multiple) discard the new return value, so no behaviour change elsewhere.

Coverage gap

New regression tests in TestMainStorageFinishAdvancesConsensus pin the `PGJsonbStorage._finish → shared.poll_advance` wiring (direct-use write path, exercised only implicitly by conformance tests before this PR).

Commits

6189def docs: CHANGES entry for 1.12.1 warmer hardening (#65)
7847eda test(shared-cache): pin main-storage _finish → shared.poll_advance (#65)
cc89d46 fix(warmer): recover from consensus race; warn on zero accepted writes (#65)
de91e98 feat(shared-cache): consensus_tid property + set() returns bool (#65)
766bde1 refactor(instance): hoist _read_max_tid import to module top (#65)
0a76574 refactor(warmer): consolidate MAX(tid) query + log/skip on failure (#65)
b29e89c docs(plan): warmer hardening plan (#65)

Each commit is green (verified for `git bisect`).

Test plan

  • 508 passing, 0 failing, 72 s (up from 493 pre-change — +15 new tests across 5 tasks, zero regressions)
  • ZODB conformance tests unaffected
  • Grep: exactly one remaining inline `SELECT COALESCE(MAX(tid)` (the helper body)
  • Full `ruff check` + `ruff format --check` clean on all touched files

Release note

Tagged for 1.12.1. This PR unblocks the release that was held pending #65.

🤖 Generated with Claude Code

jensens and others added 7 commits April 23, 2026 12:08
Add module-level _read_max_tid(conn) helper and a public
PGJsonbStorage.current_max_tid() method that logs on failure and
returns None.  Route _restore_state, poll_invalidations, and the
warmer's load_current_tid_fn through them.  Warmer skips warmup
when current_max_tid returns None instead of installing consensus=0.

Closes I1 and I2 of #65.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches the existing import-style block (6 other names from .storage
already imported at module top). Avoids a per-call name lookup in
the poll_invalidations hot path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both are wanted by the cache warmer's race-detection fix (next
commit).  The existing instance.load / load_multiple call sites
ignore the new return value; no behaviour change there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#65)

Closes I3 of #65.  After poll_advance, re-read consensus_tid and use
that as the polled_tid for subsequent set() calls — if another
instance advanced consensus ahead of our sampled TID, this ensures
our sets still pass the gate.  Track accepted vs. attempted writes
via the new set() return value; log a WARNING when the entire
warmup was rejected despite non-empty results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage gap flagged in #63 final review.  Pins the invariant that
the direct-use write path advances consensus and invalidates
changed zoids in the shared cache, independent of any instance's
tpc_finish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jensens jensens merged commit b9ade3f into main Apr 23, 2026
5 checks passed
@jensens jensens deleted the fix/warmer-hardening branch April 23, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CacheWarmer: follow-up hardening after #63 (error-swallow, DI seam, startup race)

1 participant