From f6fdb22edc477e7b8318df3daeeb6d5f9ab16a54 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?F=C3=A9lix=20Saparelli?= Date: Fri, 5 Jun 2026 22:49:44 +1200 Subject: [PATCH] docs(skill): flag that Ready phase isn't necessarily healthy MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A Ready replica can have a stale Active restore and a growing consecutiveRestoreFailures counter — every recent restore attempt has failed, so its data is older than lastRestoreCompletedAt suggests. To users "the replica isn't working" usually means stale data, not refused connections. Note this in the pgro-status skill's overview checks so future agents cross-reference the failure counter against the last successful restore before declaring a Ready replica healthy. --- .claude/skills/pgro-status/SKILL.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.claude/skills/pgro-status/SKILL.md b/.claude/skills/pgro-status/SKILL.md index 67b85b6..28c7392 100644 --- a/.claude/skills/pgro-status/SKILL.md +++ b/.claude/skills/pgro-status/SKILL.md @@ -36,6 +36,8 @@ Look for: - Restore objects: each replica should have **exactly one** `Active` restore in steady state. A transient `Pending` / `Restoring` / `Ready` / `Switching` restore is normal during a cycle. More than one `Active` indicates the sweep isn't pruning. - Pending pod count > 0 is worth digging into before reporting healthy — could be a scheduling problem (Karpenter, taints, resource pressure). +**A `Ready` phase replica is not necessarily healthy.** `Ready` only means the operator's switchover state machine is at rest — the previous restore is still serving traffic. If `consecutiveRestoreFailures > 0` and growing, *every restore attempt since the last good one has failed*, so the data is staler than its `lastRestoreCompletedAt` claims. To users, "the replica isn't working" usually means the data is days behind, not that connections are refused. Always cross-check `consecutiveRestoreFailures` against `lastRestoreCompletedAt` and the replica's expected cadence before calling a `Ready` replica healthy. + ### Phase 2 — per-replica detail For each replica that looks off — and whenever a thorough check is requested — fetch the key status fields and conditions: