@@ -96,3 +96,107 @@ to Kubernetes-native inputs without making editable skills/workspaces immutable:
9696Until that migration is done, do not reintroduce global provisioner chown,
9797root init containers, or UID 1000 alignment as the primary fix for local-path
9898PVC permissions.
99+
100+ ## Roadmap: multi-node ("nodes joining my cluster")
101+
102+ Not scheduled — far-future. Recorded here so near-term permission decisions
103+ don't foreclose it.
104+
105+ ### The coupling we accepted
106+
107+ "Host FS is canonical" is an identity: one directory is simultaneously a
108+ plain host path the CLI writes (` SeedHostFiles ` , ` syncRuntimeFiles ` , wallet
109+ staging, flow-16 host asserts) AND the backing store of a PVC a pod mounts.
110+ The identity holds only while ALL of these are true:
111+
112+ 1 . The PV is path-addressable on a node (local-path/hostPath, not block or
113+ network storage).
114+ 2 . The node's FS is the host's FS (k3d bind-mount of ` OBOL_DATA_DIR ` , or
115+ native k3s where host == node).
116+ 3 . There is effectively one node, so "the node that has the directory" and
117+ "the node the pod runs on" cannot diverge.
118+ 4 . Host and container share UID/GID semantics (same kernel; on macOS faked
119+ by Docker Desktop file sharing).
120+
121+ A second node breaks (2) and (3) independently. Every permission mechanism
122+ in this plan (group sharing, fsGroup walks, the removed chowns) operates on
123+ (4) and silently assumes (1)–(3).
124+
125+ ### Scope of the coupling — agent homes only
126+
127+ | Data | Host access needed? |
128+ | ---| ---|
129+ | hermes-data / agent homes (config, SOUL.md, skills) | Yes — the product promise |
130+ | OpenClaw skills | Yes |
131+ | x402-buyer-state (consumed.json) | No — pod-private |
132+ | Chain data (reth, aztec) | No — pod-private |
133+ | Wallet keystores | Only at creation; Remaining Debt moves them to Secrets |
134+
135+ Scaling out means shrinking the host-canonical surface to the first two
136+ rows, not replicating it onto every node.
137+
138+ ### What breaks on day one of a join (as of this writing)
139+
140+ - ` local-path ` is ` volumeBindingMode: WaitForFirstConsumer ` and NO render
141+ (hermes.go, agent_render.go, llm.yaml) sets a nodeSelector/affinity: a new
142+ agent's pod can schedule on node B, the PV is provisioned on node B's
143+ disk, while ` SeedHostFiles ` writes to ` $DATA_DIR ` on the home host. The
144+ pod boots against an empty home with no error anywhere.
145+ - ` ensureVolumeWritable ` is a ` docker exec ` into the k3d node container —
146+ no transport to a remote node (already early-returns on the k3s backend).
147+ - Existing PVs pin pods to their node forever via nodeAffinity, so wrong
148+ first placements are sticky.
149+
150+ ### Options ladder (increasing decoupling)
151+
152+ 0 . ** Home-node pattern (recommended v1, prerequisite for any join path).**
153+ Label the home node (` obol.org/home=true ` ), render nodeSelector into
154+ every host-canonical workload (hermes master, CRD agents, litellm/buyer).
155+ Joined nodes take stateless or pod-private work only: vLLM/Ollama
156+ upstreams, network nodes (the biggest storage consumers, zero host
157+ visibility needed), demo servers. Agents cannot migrate; home node is
158+ the SPOF; cheap and non-breaking.
159+ 1 . ** Host exports ` $DATA_DIR ` over NFS** (csi-driver-nfs), agent-home PVCs
160+ become RWX network mounts; files still physically live on the host so
161+ direct editing keeps working. ` all_squash,anonuid=1000 ` solves ownership
162+ flapping at the protocol level. Hard caveat: Hermes ` state.db ` is SQLite
163+ — SQLite over NFS is a corruption hazard. Workable shape is inputs over
164+ NFS + state.db on a node-local PV, which is already half of option 3.
165+ 2 . ** Distributed storage (Longhorn/Rook).** Solves migration, destroys host
166+ access entirely, far too heavy for local-first. Ruled out except as the
167+ storage class users bring on managed k8s.
168+ 3 . ** API-mediated host access (the #610 direction, revisited).** Inputs
169+ (config, SOUL.md, skills, markers) as ConfigMaps/Secrets/OCI artifacts —
170+ delivered to any node, checksum-rolled; state as pod-private PVs on any
171+ provisioner; host access becomes a verb (` obol agent fs ls|cat|edit|cp `
172+ over kubectl exec/cp or a sidecar) instead of a shared mount. Survives
173+ arbitrary topology and PSS-restricted namespaces. Cost: live-editing a
174+ skill needs a sync round-trip instead of ` :w ` — UX problem, solvable
175+ with a ` --watch ` loop; this is why #610 was reverted and why it comes
176+ back when (3) in the coupling list stops being true.
177+ 4 . ** Hybrid by data class (target end-state).** Operator-authored inputs →
178+ API objects; machine state → pod-private PVs; human-inspectable outputs
179+ → ` obol agent fs ` or a write-once RWX exports share. Single-node k3d
180+ keeps the local-path fast path as an optimization, not as the contract.
181+
182+ ### Join mechanics
183+
184+ k3d ` node create ` only adds agent containers on the same host. The real
185+ multi-node story is the native k3s backend: Linux home runs ` k3s server `
186+ (host FS == node FS, home-node pattern costs nothing), remote boxes join
187+ with ` k3s agent --server ... --token ... ` . macOS stays single-node k3d;
188+ remote GPU capacity is better reached as an external endpoint
189+ (` obol model setup custom --endpoint http://gpu-box:8000/v1 ` ), which the
190+ stack already supports and which sidesteps this entire section.
191+
192+ ### Decisions binding today
193+
194+ - Prefer pure group-1000 sharing (setgid dirs, g+rw, nobody chowns owners)
195+ over render-time ` os.Getuid() ` UID matching: group sharing is
196+ topology-neutral; UID matching bakes one machine's identity into
197+ manifests and deepens the coupling.
198+ - Before any join path ships: home-node nodeSelector rendering + a
199+ join-time preflight that names the host-canonical volumes pinned to the
200+ home node.
201+ - Input migration to API objects proceeds on the Remaining Debt schedule
202+ above; it is also the multi-node prerequisite.
0 commit comments