fix(rbac,nr-test): grant worker jobs/pods/events read; resolve NR test conflict (#71)

mastermanas805 · claude · web-flow · commit 4a61733b6c13 · 2026-06-10T20:06:22.000Z
FIX 1 (P1, rule-27 regression live in prod): the instant-worker SA in instant-infra could only `get apps/deployments`, so DeployStatusReconciler's `BatchV1().Jobs(ns).Get` was denied every ~30s (jobs.deploy_status_reconcile.job_query_failed: cannot get resource "jobs" in API group "batch") and the rule-27 silent-build-failure detection path was disabled in prod. Extend the existing instant-worker-deploy-reader ClusterRole with the minimal read-only verbs the deploy status + autopsy path actually call (no create/delete/patch/watch): batch/jobs get ← deploy_status_reconcile.go:256 Jobs().Get pods list ← deploy_failure_autopsy.go:208 Pods().List pods/log get ← deploy_failure_autopsy.go:220,230 Pods().GetLogs events list ← deploy_failure_autopsy.go:215 Events().List ClusterRole-scoped because per-app namespaces (instant-deploy-<appID>) are created on demand and can't be enumerated at bind time. RBAC-only — no Deployment / secret / image change. APPLY-CHECKLIST.md gains an operator apply + `kubectl auth can-i` verification row (this repo has no auto-apply). FIX 2 (P2): newrelic/tests/apply.test.sh carried committed merge-conflict markers around line 174 (from PR #14) plus stale hardcoded artifact counts (33 dry-run entries, 26/16 JSON files), leaving the NR apply test suite un-runnable. Resolve the markers and replace every hardcoded count with one derived from the on-disk JSON glob (15 dashboards + 83 alerts = 98 today), so the assertions track the registry instead of going stale again. Suite now passes 49/0. Local validation: kubeconform -strict (5/5 valid) + CI-equivalent yamllint on the RBAC manifest; bash -n + full run of the NR test suite (49 passed, 0 failed). Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
diff --git a/k8s/APPLY-CHECKLIST.md b/k8s/APPLY-CHECKLIST.md
@@ -254,6 +254,64 @@ doesn't work for wildcards). The runbook has the verification commands.
 
 ---
 
+## Worker deploy-status RBAC — jobs/pods/events read grant (2026-06-11)
+
+`k8s/worker-rbac.yaml`'s `instant-worker-deploy-reader` ClusterRole was
+extended to grant the worker SA (`instant-worker` in `instant-infra`) the
+read-only k8s access its deploy-status reconciler + failure-autopsy path
+actually need. Before this, the SA could only `get apps/deployments`, so
+the rule-27 silent-build-failure detector logged
+`jobs.deploy_status_reconcile.job_query_failed … cannot get resource
+"jobs" in API group "batch"` every ~30s and that whole detection path was
+**disabled in prod**.
+
+The added verbs map 1:1 to live k8s calls (no over-grant — read-only, no
+create/delete/patch/watch):
+
+| apiGroup / resource | verb | worker call (file:line) |
+|---|---|---|
+| `batch` / `jobs`     | `get`  | `deploy_status_reconcile.go:256` `BatchV1().Jobs(ns).Get` |
+| `""` / `pods`        | `list` | `deploy_failure_autopsy.go:208` `CoreV1().Pods(ns).List` |
+| `""` / `pods/log`    | `get`  | `deploy_failure_autopsy.go:220,230` `CoreV1().Pods(ns).GetLogs` |
+| `""` / `events`      | `list` | `deploy_failure_autopsy.go:215` `CoreV1().Events(ns).List` |
+
+(`apps/deployments get` was already granted and is unchanged.)
+
+This is an RBAC-only change — no Deployment, no secrets, no image tag. It
+is safe to `kubectl apply` directly (none of the app.yaml clobber hazards
+above apply to a ClusterRole/ClusterRoleBinding/ServiceAccount file).
+
+```bash
+# 1. Confirm context
+kubectl config current-context        # expect do-nyc3-instant-prod
+
+# 2. Dry-run server-side and read the diff (RBAC verbs only should change)
+kubectl apply --dry-run=server -f k8s/worker-rbac.yaml
+
+# 3. Apply
+kubectl apply -f k8s/worker-rbac.yaml
+
+# 4. Verify the SA can now reach jobs/pods/events in a deploy namespace.
+#    (Use any live instant-deploy-* ns; RBAC is cluster-scoped so the
+#    namespace just has to exist. All four must print "yes".)
+NS=$(kubectl get ns -o name | grep -m1 'instant-deploy-' | cut -d/ -f2)
+kubectl auth can-i get  jobs.batch  --as=system:serviceaccount:instant-infra:instant-worker -n "$NS"
+kubectl auth can-i list pods        --as=system:serviceaccount:instant-infra:instant-worker -n "$NS"
+kubectl auth can-i get  pods/log    --as=system:serviceaccount:instant-infra:instant-worker -n "$NS"
+kubectl auth can-i list events      --as=system:serviceaccount:instant-infra:instant-worker -n "$NS"
+
+# 5. Confirm the error stops within one reconcile tick (~30s). This should
+#    print nothing after the apply:
+kubectl logs -n instant-infra deploy/instant-worker --since=2m \
+  | grep -i 'job_query_failed' || echo "ok — no job_query_failed in last 2m"
+```
+
+Rollback: `git revert` the merge commit and re-apply — the verbs are
+purely additive, so reverting only narrows the grant back to
+`apps/deployments get` (re-disables rule-27 detection, no other effect).
+
+---
+
 ## Related files
 
 - `README.md` — secrets clobber warning (the same class of bug, but for
diff --git a/k8s/worker-rbac.yaml b/k8s/worker-rbac.yaml
@@ -47,15 +47,46 @@ subjects:
     namespace: instant-infra
 
 ---
-# ClusterRole granting the worker permission to read Deployment objects across
-# all "instant-deploy-*" namespaces. Used by DeployStatusReconciler (see
-# worker/internal/jobs/deploy_status_reconcile.go) to reconcile the
-# deployments.status DB column from live k8s state every 30s.
+# ClusterRole granting the worker the read-only k8s access needed to reconcile
+# deployment status AND capture failure autopsies across all "instant-deploy-*"
+# (and "instant-stack-*") namespaces. Used by DeployStatusReconciler +
+# the deploy-failure autopsy client, both wired into the same River worker via
+# WithAutopsyK8s (see worker/internal/jobs/deploy_status_reconcile.go and
+# worker/internal/jobs/deploy_failure_autopsy.go).
 #
 # This is intentionally cluster-scoped because per-app namespaces are created
 # on demand by the api process — we can't enumerate them at RBAC-bind time.
-# Read-only get on deployments.apps is the tightest verb set that makes the
-# reconciler work; the worker has no need to list/watch/patch and gets none.
+#
+# Every verb below maps to exactly one live k8s call; nothing is over-granted
+# (no create/delete/patch/watch — the worker only reads):
+#
+#   apps/deployments       get   ← deploy_status_reconcile.go:251
+#                                   AppsV1().Deployments(ns).Get(...)  — roll the
+#                                   deployments.status DB column forward every 30s.
+#
+#   batch/jobs             get   ← deploy_status_reconcile.go:256
+#                                   BatchV1().Jobs(ns).Get(...)  — detect a Failed
+#                                   Kaniko build Job (BackoffLimitExceeded /
+#                                   DeadlineExceeded) so the row flips to `failed`
+#                                   even when the build pod is GC'd before the
+#                                   runtime Deployment exists. THIS is the grant
+#                                   whose absence disabled CLAUDE.md rule-27
+#                                   silent-build-failure detection in prod
+#                                   (worker logged job_query_failed every tick).
+#
+#   pods                   list  ← deploy_failure_autopsy.go:208
+#                                   CoreV1().Pods(ns).List(...)  — find the failed
+#                                   build/runtime pods by label selector.
+#
+#   pods/log               get   ← deploy_failure_autopsy.go:220,230
+#                                   CoreV1().Pods(ns).GetLogs(...)  — capture the
+#                                   last_lines of build-pod logs for the autopsy
+#                                   row + the deploy.failed audit_log / email.
+#
+#   events                 list  ← deploy_failure_autopsy.go:215
+#                                   CoreV1().Events(ns).List(...)  — read k8s
+#                                   Events (ImagePullBackOff, FailedScheduling, …)
+#                                   to enrich the autopsy reason/hint.
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRole
 metadata:
@@ -64,6 +95,18 @@ rules:
   - apiGroups: ["apps"]
     resources: ["deployments"]
     verbs: ["get"]
+  - apiGroups: ["batch"]
+    resources: ["jobs"]
+    verbs: ["get"]
+  - apiGroups: [""]
+    resources: ["pods"]
+    verbs: ["list"]
+  - apiGroups: [""]
+    resources: ["pods/log"]
+    verbs: ["get"]
+  - apiGroups: [""]
+    resources: ["events"]
+    verbs: ["list"]
 
 ---
 # Bind the deploy-reader ClusterRole to the instant-worker ServiceAccount.
diff --git a/newrelic/tests/apply.test.sh b/newrelic/tests/apply.test.sh
@@ -93,7 +93,7 @@ assert_eq "exit code is 1" "1" "$code"
 assert_contains "stderr mentions NEW_RELIC_ACCOUNT_ID" "NEW_RELIC_ACCOUNT_ID" "$out"
 
 # ----------------------------------------------------------------------------
-echo "test: --dry-run with both env vars set prints all 33 names, no API calls"
+echo "test: --dry-run with both env vars set prints all 98 names, no API calls"
 # ----------------------------------------------------------------------------
 out=$(env -i PATH="$PATH" NEW_RELIC_API_KEY=fake NEW_RELIC_ACCOUNT_ID=1234567 \
   "$APPLY" --dry-run 2>&1)
@@ -145,10 +145,12 @@ assert_contains "dry-run prints decrypt-burst alert"    "connection_url.decrypte
 assert_contains "dry-run prints deploy-by-team alert"   "deploy failure rate > 30% (1h) faceted"       "$out"
 assert_contains "dry-run prints backup-stuck alert"     "backup.requested with no follow-up"           "$out"
 
-# No real HTTP traffic — the [dry-run] tag must appear on every name
-# 12 dashboards + 21 alerts = 33 total after W10 follow-up.
+# No real HTTP traffic — the [dry-run] tag must appear on every name.
+# Count is derived from the JSON on disk (15 dashboards + 83 alerts = 98),
+# not a hand-typed list, so it stays correct as artifacts are added/removed.
+expected_count=$(ls "$NR_DIR"/dashboards/*.json "$NR_DIR"/alerts/*.json 2>/dev/null | wc -l | tr -d ' ')
 dryrun_count=$(echo "$out" | grep -c '^\[dry-run\]' || true)
-assert_eq "every name prefixed with [dry-run] (33 total)" "33" "$dryrun_count"
+assert_eq "every name prefixed with [dry-run] (one per JSON file)" "$expected_count" "$dryrun_count"
 
 # ----------------------------------------------------------------------------
 echo "test: --dry-run without env vars still validates JSON + warns"
@@ -158,24 +160,22 @@ code=$?
 assert_eq "exit code is 0 with no env (dry-run)" "0" "$code"
 assert_contains "warns about unset env" "warning" "$out"
 dryrun_count=$(echo "$out" | grep -c '^\[dry-run\]' || true)
-assert_eq "still prints 33 [dry-run] entries" "33" "$dryrun_count"
+assert_eq "still prints one [dry-run] entry per JSON file" "$expected_count" "$dryrun_count"
 
 # ----------------------------------------------------------------------------
 echo "test: every JSON file in dashboards/ and alerts/ parses cleanly"
 # ----------------------------------------------------------------------------
 broken=0
+total_json=0
 for f in "$NR_DIR"/dashboards/*.json "$NR_DIR"/alerts/*.json; do
+  total_json=$((total_json+1))
   if ! jq empty "$f" >/dev/null 2>&1; then
     echo "  FAIL invalid JSON: $f"
     broken=$((broken+1))
   fi
 done
 if [ "$broken" -eq 0 ]; then
-<<<<<<< HEAD
-  echo "  ok  all 26 JSON files parse"
-=======
-  echo "  ok  all 16 JSON files parse"
->>>>>>> b41339a (newrelic: dashboards + alerts for W7/W8/W9 audit kinds (W10 follow-up))
+  echo "  ok  all $total_json JSON files parse"
   PASS=$((PASS+1))
 else
   FAIL=$((FAIL+1))