Skip to content

Commit c667432

Browse files
committed
fix(pvc): legacy hostPath-PV compat, scoped Reloader, flow alignment with #614
Applies the verified findings from the cross-review against PR #614 (every item adversarially confirmed against the refs; union merge-tree clean). PVC / upgrade path: - llm.yaml: restore container-level runAsUser/runAsGroup 1000 on x402-buyer. Clusters upgraded in place from <= rc12 keep hostPath-typed PVs where kubelet skips fsGroup; their /state dir is 1000:1000 with consumed.json written 0600 by UID 1000 — a 65532 sidecar cannot read it, Fatalf's on `load state`, and takes every paid/<model> route down. On fresh local-type PVs the explicit UID is harmless (fsGroup 65532 grants group access). embed_buyer_state_test.go updated to pin the new contract. - plans/volume-permission-hardening.md: new "Upgrading from <= v0.10.0-rc12" section — supported path is cluster recreation (wallet backup/restore), with a documented k3d chown escape hatch. troubleshooting.md gets the symptom->fix entry. The Hermes half of the legacy-PV breakage cannot be patched at runtime without reintroducing the chown machinery this PR removes, so it is a documented breaking change instead. Paid-route availability: - llm.yaml: Reloader annotation narrowed to litellm-config only. The buyer ConfigMaps (x402-buyer-config/x402-buyer-auths) are rewritten by the controller on every buy, top-up, auto-refill, and tombstone cleanup; with strategy Recreate + 1 replica the previous annotation bounced the entire inference gateway (all Hermes traffic, in-flight SSE streams) on every purchase event, inverting CLAUDE.md pitfall 7 (restart is fallback, not the default buy path). The buyer hot-reloads via /admin/reload. stack_test.go updated to pin litellm-config-only. Flow alignment with #614: - lib.sh: `stack down` -> `stack down --yes` in reset_flow_workspace. #614's flow-16 (now last in the single-stack array) intentionally leaves a live agent offer; without --yes the non-TTY ConfirmRunningServicesLoss gate refuses, graceful down is silently skipped (`|| true`), and teardown degrades to the raw k3d-delete fallback on every release-smoke run. - flow-11: post-register Ready poll 120s -> 300s to match flow-14's identical live-Base-Sepolia chain-watch path (pitfall 13 free-tier RPC throttling). Known follow-ups (not in this commit): flow-08 buy-retry top-up vs exactly-N assertions on rare partial failures; flow-11 lacks flow-14's remote-signer rolled guard; aztec PVC has no permission story (runs as root today); post-merge controller repin so released sub-agents pick up this PR's UID-1000 render (tracked in #614's pin-test note).
1 parent 1f3088b commit c667432

7 files changed

Lines changed: 89 additions & 16 deletions

File tree

.agents/skills/obol-stack-dev/references/troubleshooting.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,17 @@ data alone. Use `FLOW_FORCE_PURGE_DATA=true` or `RELEASE_SMOKE_FORCE_PURGE_DATA=
3232
only when the operator explicitly wants full persistent-data deletion and has an
3333
interactive sudo path.
3434

35+
### Hermes/x402-buyer EACCES or crashloop after upgrading a pre-v0.10.0 cluster
36+
37+
PVs provisioned before v0.10.0 are hostPath-typed — kubelet skips fsGroup
38+
ownership there, and the v0.10.0 pods (UID 1000, no root chown init) cannot
39+
read legacy data owned 10000:10000. Symptoms: Hermes gateway crashloops on
40+
state.db / config.yaml; x402-buyer exits `load state:` at startup, killing
41+
every `paid/<model>` route. Fix: recreate the cluster (`stack down`
42+
`purge -f``init``up`; back up agent wallets first), or for k3d chown
43+
the PV backing dirs to 1000:1000 from inside the node and restart the pods.
44+
See plans/volume-permission-hardening.md "Upgrading from <= v0.10.0-rc12".
45+
3546
### k3d port 80 privileged on macOS
3647

3748
Always use `http://obol.stack:8080/`, not `http://obol.stack/`. Port 8080 maps to the same Traefik LB.

flows/flow-11-dual-stack.sh

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1054,7 +1054,10 @@ if [ "$register_rc" -ne 0 ]; then
10541054
fi
10551055
pass "obol sell register issued"
10561056

1057-
poll_step_grep "Alice: ServiceOffer Ready" "True" 24 5 \
1057+
# 300s to match flow-14's post-register poll: Ready requires the controller's
1058+
# Base Sepolia chain watch (via eRPC) to observe the register tx, and the
1059+
# free-tier RPC throttling in pitfall 13 makes 120s intermittently tight.
1060+
poll_step_grep "Alice: ServiceOffer Ready" "True" 60 5 \
10581061
alice kubectl get serviceoffers.obol.org alice-inference -n llm \
10591062
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
10601063

flows/lib.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -488,7 +488,7 @@ reset_flow_workspace() {
488488
OBOL_CONFIG_DIR="$dir/config" \
489489
OBOL_BIN_DIR="$dir/bin" \
490490
OBOL_DATA_DIR="$dir/data" \
491-
run_with_timeout 120 "$obol_cmd" stack down >/dev/null 2>&1 || true
491+
run_with_timeout 120 "$obol_cmd" stack down --yes >/dev/null 2>&1 || true
492492
if [ "$force_purge" = "true" ]; then
493493
OBOL_DEVELOPMENT=true \
494494
OBOL_NONINTERACTIVE=true \

internal/embed/embed_buyer_state_test.go

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -92,9 +92,12 @@ func TestBuyerStatePVC(t *testing.T) {
9292
t.Errorf("litellm pod fsGroupChangePolicy = %v, want OnRootMismatch", policy)
9393
}
9494

95-
// x402-buyer should inherit the pod-level 65532 identity and rely on
96-
// fsGroup-applied local PV ownership. A container-level UID/GID 1000 is
97-
// the old hostPath workaround and should not come back.
95+
// x402-buyer must keep container-level UID/GID 1000 while hostPath PVs
96+
// from <= v0.10.0-rc12 clusters remain in support: those PVs ignore the
97+
// pod fsGroup (kubelet skips ownership management on hostPath) and hold
98+
// a consumed.json written 0600 by UID 1000 — a 65532 sidecar crashloops
99+
// on `load state` and takes every paid/* route down. On fresh local-type
100+
// PVs the explicit UID is harmless (fsGroup 65532 grants group access).
98101
containers, ok := nested(dep, "spec", "template", "spec", "containers").([]any)
99102
if !ok {
100103
t.Fatal("litellm Deployment has no containers")
@@ -110,11 +113,11 @@ func TestBuyerStatePVC(t *testing.T) {
110113
if buyer == nil {
111114
t.Fatal("x402-buyer container missing from litellm Deployment")
112115
}
113-
if u := nested(buyer, "securityContext", "runAsUser"); u != nil {
114-
t.Errorf("x402-buyer securityContext.runAsUser = %v, want unset (inherits pod UID 65532)", u)
116+
if u := nested(buyer, "securityContext", "runAsUser"); u != 1000 {
117+
t.Errorf("x402-buyer securityContext.runAsUser = %v, want 1000 (legacy hostPath-PV state compat)", u)
115118
}
116-
if g := nested(buyer, "securityContext", "runAsGroup"); g != nil {
117-
t.Errorf("x402-buyer securityContext.runAsGroup = %v, want unset (inherits pod GID 65532)", g)
119+
if g := nested(buyer, "securityContext", "runAsGroup"); g != 1000 {
120+
t.Errorf("x402-buyer securityContext.runAsGroup = %v, want 1000 (legacy hostPath-PV state compat)", g)
118121
}
119122
if nr := nested(buyer, "securityContext", "runAsNonRoot"); nr != true {
120123
t.Errorf("x402-buyer securityContext.runAsNonRoot = %v, want true (restricted PSS)", nr)

internal/embed/infrastructure/base/templates/llm.yaml

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,15 @@ spec:
176176
labels:
177177
app: litellm
178178
annotations:
179-
configmap.reloader.stakater.com/reload: "litellm-config,x402-buyer-config,x402-buyer-auths"
179+
# litellm-config only: model-list changes must roll the pod so paid
180+
# routes pick up new entries. The buyer ConfigMaps are deliberately
181+
# NOT listed — x402-buyer hot-reloads them via /admin/reload (driven
182+
# by the serviceoffer-controller on every buy/refill), and this
183+
# Deployment is strategy Recreate with one replica: reloading on
184+
# x402-buyer-config/x402-buyer-auths would take the whole inference
185+
# gateway down on every purchase event (CLAUDE.md pitfall 7: restart
186+
# is the fallback, not the default buy path).
187+
configmap.reloader.stakater.com/reload: "litellm-config"
180188
secret.reloader.stakater.com/reload: "litellm-secrets"
181189
spec:
182190
terminationGracePeriodSeconds: 60
@@ -304,13 +312,20 @@ spec:
304312
# across flow-08/11/14/13. See internal/embed/embed_image_pin_test.go.
305313
image: ghcr.io/obolnetwork/x402-buyer:f5d94fc@sha256:0c431eda44e9e2fe5dd50c82cf4885f9be5037e592478781c51e9c510171265c
306314
imagePullPolicy: IfNotPresent
307-
# PSS Restricted + writable PVC. The StorageClass asks
308-
# local-path-provisioner for local PVs, so kubelet applies the
309-
# pod-level fsGroup to /state on mount. The sidecar inherits the
310-
# pod's nonroot 65532 UID/GID and needs no root init or UID 1000
311-
# alignment workaround.
315+
# PSS Restricted + writable PVC. On fresh clusters the StorageClass
316+
# asks local-path-provisioner for local PVs, so kubelet applies the
317+
# pod-level fsGroup to /state on mount and any UID works. The
318+
# explicit UID/GID 1000 below is for clusters UPGRADED in place
319+
# from <= v0.10.0-rc12: their x402-buyer-state PV is hostPath-typed
320+
# (kubelet ignores fsGroup there) and the old provisioner chowned
321+
# the dir 1000:1000 with consumed.json written 0600 by UID 1000 —
322+
# a 65532 sidecar cannot even read it and crashloops on startup
323+
# (`load state` Fatalf), taking every paid/* route down. Keep 1000
324+
# until hostPath PVs are out of support.
312325
securityContext:
313326
runAsNonRoot: true
327+
runAsUser: 1000
328+
runAsGroup: 1000
314329
allowPrivilegeEscalation: false
315330
readOnlyRootFilesystem: true
316331
capabilities:

internal/stack/stack_test.go

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -491,14 +491,22 @@ func TestLLMTemplate_IncludesPaidRouteAndBuyerSidecar(t *testing.T) {
491491
`name: buyer-http`,
492492
`name: x402-buyer-config`,
493493
`name: x402-buyer-auths`,
494-
`configmap.reloader.stakater.com/reload: "litellm-config,x402-buyer-config,x402-buyer-auths"`,
494+
// litellm-config ONLY: the buyer ConfigMaps must stay out of the
495+
// Reloader annotation — x402-buyer hot-reloads them via /admin/reload
496+
// and the Recreate strategy would otherwise bounce the whole gateway
497+
// on every buy/refill (CLAUDE.md pitfall 7).
498+
`configmap.reloader.stakater.com/reload: "litellm-config"`,
495499
`emptyDir:`,
496500
} {
497501
if !strings.Contains(out, want) {
498502
t.Fatalf("llm template missing %q:\n%s", want, out)
499503
}
500504
}
501505

506+
if strings.Contains(out, `configmap.reloader.stakater.com/reload: "litellm-config,`) {
507+
t.Fatal("llm template reload annotation must list litellm-config only — buyer CM writes happen per purchase and would Recreate-bounce the gateway")
508+
}
509+
502510
if strings.Contains(out, "custom_provider_map") {
503511
t.Fatalf("llm template should not require a custom provider:\n%s", out)
504512
}

plans/volume-permission-hardening.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,39 @@ For a fresh-stack PVC check, use only the CLI surface: create the stack, inspect
4949
the `local-path` StorageClass, then inspect every bound PV and confirm it has
5050
`.spec.local.path` and no `.spec.hostPath`.
5151

52+
## Upgrading from <= v0.10.0-rc12 (breaking)
53+
54+
The new model only governs PVs provisioned AFTER this change. PV specs are
55+
immutable: a cluster created on rc12 or earlier keeps hostPath-typed PVs,
56+
where the kubelet skips fsGroup ownership management entirely, and its
57+
hermes-data files are owned 10000:10000 (the old containerUID, chowned by the
58+
removed root init on every start). Running a v0.10.0 CLI `obol stack up`
59+
against such a cluster re-applies the new pod specs in place and the Hermes
60+
pod (now UID 1000, no chown init, no `fixHermesDataPVCK3dFallback`) loses
61+
read/write access to its own state.
62+
63+
Supported upgrade path: **recreate the cluster**.
64+
65+
```bash
66+
obol agent wallet backup # if any agent wallet holds funds
67+
obol stack down && obol stack purge -f
68+
obol stack init && obol stack up
69+
obol agent wallet restore # as needed
70+
```
71+
72+
Escape hatch for clusters that must not be recreated (k3d): chown the legacy
73+
backing dirs to the new UID from inside the node, then restart the pods:
74+
75+
```bash
76+
docker exec k3d-obol-stack-<id>-server-0 \
77+
sh -c 'chown -R 1000:1000 /var/lib/rancher/k3s/storage/pvc-*hermes-data*'
78+
```
79+
80+
The x402-buyer sidecar deliberately keeps container-level UID/GID 1000
81+
(llm.yaml) so the legacy `x402-buyer-state` PV — dir 1000:1000, consumed.json
82+
0600 — stays readable without any migration; do not remove that alignment
83+
until hostPath PVs are out of support.
84+
5285
## Remaining Debt
5386

5487
OpenClaw and wallet provisioning still contain legacy host-side staging paths

0 commit comments

Comments
 (0)