Skip to content

Commit 4a61733

Browse files
fix(rbac,nr-test): grant worker jobs/pods/events read; resolve NR test conflict (#71)
FIX 1 (P1, rule-27 regression live in prod): the instant-worker SA in instant-infra could only `get apps/deployments`, so DeployStatusReconciler's `BatchV1().Jobs(ns).Get` was denied every ~30s (jobs.deploy_status_reconcile.job_query_failed: cannot get resource "jobs" in API group "batch") and the rule-27 silent-build-failure detection path was disabled in prod. Extend the existing instant-worker-deploy-reader ClusterRole with the minimal read-only verbs the deploy status + autopsy path actually call (no create/delete/patch/watch): batch/jobs get ← deploy_status_reconcile.go:256 Jobs().Get pods list ← deploy_failure_autopsy.go:208 Pods().List pods/log get ← deploy_failure_autopsy.go:220,230 Pods().GetLogs events list ← deploy_failure_autopsy.go:215 Events().List ClusterRole-scoped because per-app namespaces (instant-deploy-<appID>) are created on demand and can't be enumerated at bind time. RBAC-only — no Deployment / secret / image change. APPLY-CHECKLIST.md gains an operator apply + `kubectl auth can-i` verification row (this repo has no auto-apply). FIX 2 (P2): newrelic/tests/apply.test.sh carried committed merge-conflict markers around line 174 (from PR #14) plus stale hardcoded artifact counts (33 dry-run entries, 26/16 JSON files), leaving the NR apply test suite un-runnable. Resolve the markers and replace every hardcoded count with one derived from the on-disk JSON glob (15 dashboards + 83 alerts = 98 today), so the assertions track the registry instead of going stale again. Suite now passes 49/0. Local validation: kubeconform -strict (5/5 valid) + CI-equivalent yamllint on the RBAC manifest; bash -n + full run of the NR test suite (49 passed, 0 failed). Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
1 parent 58b7bca commit 4a61733

3 files changed

Lines changed: 117 additions & 16 deletions

File tree

k8s/APPLY-CHECKLIST.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -254,6 +254,64 @@ doesn't work for wildcards). The runbook has the verification commands.
254254

255255
---
256256

257+
## Worker deploy-status RBAC — jobs/pods/events read grant (2026-06-11)
258+
259+
`k8s/worker-rbac.yaml`'s `instant-worker-deploy-reader` ClusterRole was
260+
extended to grant the worker SA (`instant-worker` in `instant-infra`) the
261+
read-only k8s access its deploy-status reconciler + failure-autopsy path
262+
actually need. Before this, the SA could only `get apps/deployments`, so
263+
the rule-27 silent-build-failure detector logged
264+
`jobs.deploy_status_reconcile.job_query_failed … cannot get resource
265+
"jobs" in API group "batch"` every ~30s and that whole detection path was
266+
**disabled in prod**.
267+
268+
The added verbs map 1:1 to live k8s calls (no over-grant — read-only, no
269+
create/delete/patch/watch):
270+
271+
| apiGroup / resource | verb | worker call (file:line) |
272+
|---|---|---|
273+
| `batch` / `jobs` | `get` | `deploy_status_reconcile.go:256` `BatchV1().Jobs(ns).Get` |
274+
| `""` / `pods` | `list` | `deploy_failure_autopsy.go:208` `CoreV1().Pods(ns).List` |
275+
| `""` / `pods/log` | `get` | `deploy_failure_autopsy.go:220,230` `CoreV1().Pods(ns).GetLogs` |
276+
| `""` / `events` | `list` | `deploy_failure_autopsy.go:215` `CoreV1().Events(ns).List` |
277+
278+
(`apps/deployments get` was already granted and is unchanged.)
279+
280+
This is an RBAC-only change — no Deployment, no secrets, no image tag. It
281+
is safe to `kubectl apply` directly (none of the app.yaml clobber hazards
282+
above apply to a ClusterRole/ClusterRoleBinding/ServiceAccount file).
283+
284+
```bash
285+
# 1. Confirm context
286+
kubectl config current-context # expect do-nyc3-instant-prod
287+
288+
# 2. Dry-run server-side and read the diff (RBAC verbs only should change)
289+
kubectl apply --dry-run=server -f k8s/worker-rbac.yaml
290+
291+
# 3. Apply
292+
kubectl apply -f k8s/worker-rbac.yaml
293+
294+
# 4. Verify the SA can now reach jobs/pods/events in a deploy namespace.
295+
# (Use any live instant-deploy-* ns; RBAC is cluster-scoped so the
296+
# namespace just has to exist. All four must print "yes".)
297+
NS=$(kubectl get ns -o name | grep -m1 'instant-deploy-' | cut -d/ -f2)
298+
kubectl auth can-i get jobs.batch --as=system:serviceaccount:instant-infra:instant-worker -n "$NS"
299+
kubectl auth can-i list pods --as=system:serviceaccount:instant-infra:instant-worker -n "$NS"
300+
kubectl auth can-i get pods/log --as=system:serviceaccount:instant-infra:instant-worker -n "$NS"
301+
kubectl auth can-i list events --as=system:serviceaccount:instant-infra:instant-worker -n "$NS"
302+
303+
# 5. Confirm the error stops within one reconcile tick (~30s). This should
304+
# print nothing after the apply:
305+
kubectl logs -n instant-infra deploy/instant-worker --since=2m \
306+
| grep -i 'job_query_failed' || echo "ok — no job_query_failed in last 2m"
307+
```
308+
309+
Rollback: `git revert` the merge commit and re-apply — the verbs are
310+
purely additive, so reverting only narrows the grant back to
311+
`apps/deployments get` (re-disables rule-27 detection, no other effect).
312+
313+
---
314+
257315
## Related files
258316

259317
- `README.md` — secrets clobber warning (the same class of bug, but for

k8s/worker-rbac.yaml

Lines changed: 49 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -47,15 +47,46 @@ subjects:
4747
namespace: instant-infra
4848

4949
---
50-
# ClusterRole granting the worker permission to read Deployment objects across
51-
# all "instant-deploy-*" namespaces. Used by DeployStatusReconciler (see
52-
# worker/internal/jobs/deploy_status_reconcile.go) to reconcile the
53-
# deployments.status DB column from live k8s state every 30s.
50+
# ClusterRole granting the worker the read-only k8s access needed to reconcile
51+
# deployment status AND capture failure autopsies across all "instant-deploy-*"
52+
# (and "instant-stack-*") namespaces. Used by DeployStatusReconciler +
53+
# the deploy-failure autopsy client, both wired into the same River worker via
54+
# WithAutopsyK8s (see worker/internal/jobs/deploy_status_reconcile.go and
55+
# worker/internal/jobs/deploy_failure_autopsy.go).
5456
#
5557
# This is intentionally cluster-scoped because per-app namespaces are created
5658
# on demand by the api process — we can't enumerate them at RBAC-bind time.
57-
# Read-only get on deployments.apps is the tightest verb set that makes the
58-
# reconciler work; the worker has no need to list/watch/patch and gets none.
59+
#
60+
# Every verb below maps to exactly one live k8s call; nothing is over-granted
61+
# (no create/delete/patch/watch — the worker only reads):
62+
#
63+
# apps/deployments get ← deploy_status_reconcile.go:251
64+
# AppsV1().Deployments(ns).Get(...) — roll the
65+
# deployments.status DB column forward every 30s.
66+
#
67+
# batch/jobs get ← deploy_status_reconcile.go:256
68+
# BatchV1().Jobs(ns).Get(...) — detect a Failed
69+
# Kaniko build Job (BackoffLimitExceeded /
70+
# DeadlineExceeded) so the row flips to `failed`
71+
# even when the build pod is GC'd before the
72+
# runtime Deployment exists. THIS is the grant
73+
# whose absence disabled CLAUDE.md rule-27
74+
# silent-build-failure detection in prod
75+
# (worker logged job_query_failed every tick).
76+
#
77+
# pods list ← deploy_failure_autopsy.go:208
78+
# CoreV1().Pods(ns).List(...) — find the failed
79+
# build/runtime pods by label selector.
80+
#
81+
# pods/log get ← deploy_failure_autopsy.go:220,230
82+
# CoreV1().Pods(ns).GetLogs(...) — capture the
83+
# last_lines of build-pod logs for the autopsy
84+
# row + the deploy.failed audit_log / email.
85+
#
86+
# events list ← deploy_failure_autopsy.go:215
87+
# CoreV1().Events(ns).List(...) — read k8s
88+
# Events (ImagePullBackOff, FailedScheduling, …)
89+
# to enrich the autopsy reason/hint.
5990
apiVersion: rbac.authorization.k8s.io/v1
6091
kind: ClusterRole
6192
metadata:
@@ -64,6 +95,18 @@ rules:
6495
- apiGroups: ["apps"]
6596
resources: ["deployments"]
6697
verbs: ["get"]
98+
- apiGroups: ["batch"]
99+
resources: ["jobs"]
100+
verbs: ["get"]
101+
- apiGroups: [""]
102+
resources: ["pods"]
103+
verbs: ["list"]
104+
- apiGroups: [""]
105+
resources: ["pods/log"]
106+
verbs: ["get"]
107+
- apiGroups: [""]
108+
resources: ["events"]
109+
verbs: ["list"]
67110

68111
---
69112
# Bind the deploy-reader ClusterRole to the instant-worker ServiceAccount.

newrelic/tests/apply.test.sh

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ assert_eq "exit code is 1" "1" "$code"
9393
assert_contains "stderr mentions NEW_RELIC_ACCOUNT_ID" "NEW_RELIC_ACCOUNT_ID" "$out"
9494

9595
# ----------------------------------------------------------------------------
96-
echo "test: --dry-run with both env vars set prints all 33 names, no API calls"
96+
echo "test: --dry-run with both env vars set prints all 98 names, no API calls"
9797
# ----------------------------------------------------------------------------
9898
out=$(env -i PATH="$PATH" NEW_RELIC_API_KEY=fake NEW_RELIC_ACCOUNT_ID=1234567 \
9999
"$APPLY" --dry-run 2>&1)
@@ -145,10 +145,12 @@ assert_contains "dry-run prints decrypt-burst alert" "connection_url.decrypte
145145
assert_contains "dry-run prints deploy-by-team alert" "deploy failure rate > 30% (1h) faceted" "$out"
146146
assert_contains "dry-run prints backup-stuck alert" "backup.requested with no follow-up" "$out"
147147

148-
# No real HTTP traffic — the [dry-run] tag must appear on every name
149-
# 12 dashboards + 21 alerts = 33 total after W10 follow-up.
148+
# No real HTTP traffic — the [dry-run] tag must appear on every name.
149+
# Count is derived from the JSON on disk (15 dashboards + 83 alerts = 98),
150+
# not a hand-typed list, so it stays correct as artifacts are added/removed.
151+
expected_count=$(ls "$NR_DIR"/dashboards/*.json "$NR_DIR"/alerts/*.json 2>/dev/null | wc -l | tr -d ' ')
150152
dryrun_count=$(echo "$out" | grep -c '^\[dry-run\]' || true)
151-
assert_eq "every name prefixed with [dry-run] (33 total)" "33" "$dryrun_count"
153+
assert_eq "every name prefixed with [dry-run] (one per JSON file)" "$expected_count" "$dryrun_count"
152154

153155
# ----------------------------------------------------------------------------
154156
echo "test: --dry-run without env vars still validates JSON + warns"
@@ -158,24 +160,22 @@ code=$?
158160
assert_eq "exit code is 0 with no env (dry-run)" "0" "$code"
159161
assert_contains "warns about unset env" "warning" "$out"
160162
dryrun_count=$(echo "$out" | grep -c '^\[dry-run\]' || true)
161-
assert_eq "still prints 33 [dry-run] entries" "33" "$dryrun_count"
163+
assert_eq "still prints one [dry-run] entry per JSON file" "$expected_count" "$dryrun_count"
162164

163165
# ----------------------------------------------------------------------------
164166
echo "test: every JSON file in dashboards/ and alerts/ parses cleanly"
165167
# ----------------------------------------------------------------------------
166168
broken=0
169+
total_json=0
167170
for f in "$NR_DIR"/dashboards/*.json "$NR_DIR"/alerts/*.json; do
171+
total_json=$((total_json+1))
168172
if ! jq empty "$f" >/dev/null 2>&1; then
169173
echo " FAIL invalid JSON: $f"
170174
broken=$((broken+1))
171175
fi
172176
done
173177
if [ "$broken" -eq 0 ]; then
174-
<<<<<<< HEAD
175-
echo " ok all 26 JSON files parse"
176-
=======
177-
echo " ok all 16 JSON files parse"
178-
>>>>>>> b41339a (newrelic: dashboards + alerts for W7/W8/W9 audit kinds (W10 follow-up))
178+
echo " ok all $total_json JSON files parse"
179179
PASS=$((PASS+1))
180180
else
181181
FAIL=$((FAIL+1))

0 commit comments

Comments
 (0)