From f6fdb22edc477e7b8318df3daeeb6d5f9ab16a54 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?F=C3=A9lix=20Saparelli?= <felix@bes.au>
Date: Fri, 5 Jun 2026 22:49:44 +1200
Subject: [PATCH] docs(skill): flag that Ready phase isn't necessarily healthy
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

A Ready replica can have a stale Active restore and a growing
consecutiveRestoreFailures counter — every recent restore attempt has
failed, so its data is older than lastRestoreCompletedAt suggests. To
users "the replica isn't working" usually means stale data, not
refused connections. Note this in the pgro-status skill's overview
checks so future agents cross-reference the failure counter against
the last successful restore before declaring a Ready replica healthy.
---
 .claude/skills/pgro-status/SKILL.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/.claude/skills/pgro-status/SKILL.md b/.claude/skills/pgro-status/SKILL.md
index 67b85b6..28c7392 100644
--- a/.claude/skills/pgro-status/SKILL.md
+++ b/.claude/skills/pgro-status/SKILL.md
@@ -36,6 +36,8 @@ Look for:
 - Restore objects: each replica should have **exactly one** `Active` restore in steady state. A transient `Pending` / `Restoring` / `Ready` / `Switching` restore is normal during a cycle. More than one `Active` indicates the sweep isn't pruning.
 - Pending pod count > 0 is worth digging into before reporting healthy — could be a scheduling problem (Karpenter, taints, resource pressure).
 
+**A `Ready` phase replica is not necessarily healthy.** `Ready` only means the operator's switchover state machine is at rest — the previous restore is still serving traffic. If `consecutiveRestoreFailures > 0` and growing, *every restore attempt since the last good one has failed*, so the data is staler than its `lastRestoreCompletedAt` claims. To users, "the replica isn't working" usually means the data is days behind, not that connections are refused. Always cross-check `consecutiveRestoreFailures` against `lastRestoreCompletedAt` and the replica's expected cadence before calling a `Ready` replica healthy.
+
 ### Phase 2 — per-replica detail
 
 For each replica that looks off — and whenever a thorough check is requested — fetch the key status fields and conditions: