Skip to content

feat(tunnel): simplify tunnel setup#645

Merged
bussyjd merged 4 commits into
mainfrom
oisin/tunnelcleanup
Jun 16, 2026
Merged

feat(tunnel): simplify tunnel setup#645
bussyjd merged 4 commits into
mainfrom
oisin/tunnelcleanup

Conversation

@OisinKyne

@OisinKyne OisinKyne commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

obol tunnel cleanup + obol domain UX, with two
reliability fixes

Reworks the Cloudflare tunnel commands around a
least-privilege connector token as the default and
only remote path, brings obol domain into line with
that model, and fixes two bugs surfaced while testing
the flow end-to-end.

obol tunnel

  • Connector token is the default (and only) remote
    path. Removed the account-wide API-token provisioning
    path entirely (tunnel provision, and setup
    --api-token/--account-id/--zone-id/--register-domain
    are gone). DNS/ingress are configured by the user in
    the Cloudflare dashboard (route Public Hostname →
    http://traefik.traefik.svc.cluster.local:80).
  • Paste-friendly setup. obol tunnel setup
    takes the token positionally or via --token; accepts
    the bare eyJ… value or the whole cloudflared tunnel
    run --token … line (prefix stripped). On a TTY with no
    token, it walks the user through the dashboard steps
    and prompts.
  • No host binary required for the default path — Obol
    runs the connector in-cluster.
  • tunnel status rework. Reads connector health from
    cloudflared's in-cluster /ready + /metrics (port 2000,
    --api-token/--account-id/--zone-id/--register-domain are gone). DNS/ingress are
    configured by the user in the Cloudflare dashboard (route Public Hostname →
    http://traefik.traefik.svc.cluster.local:80).
  • Paste-friendly setup. obol tunnel setup takes the token positionally or
    via --token; accepts the bare eyJ… value or the whole cloudflared tunnel run
    --token … line (prefix stripped). On a TTY with no token, it walks the user through
    the dashboard steps and prompts.
  • No host binary required for the default path — Obol runs the connector
    in-cluster.
  • tunnel status rework. Reads connector health from cloudflared's in-cluster /ready
  • /metrics (port 2000, no token) plus a public HTTP probe. Concise by default;
    --verbose adds replicas/pods; --no-probe stays offline. Shows a clear
    temporary-vs-permanent mode.
  • Browser login is now a hidden advanced fallback (setup --management local, alias
    tunnel login) for users who'd rather not use the dashboard.

obol domain

  • New obol domain list — shows registrar domains already in the account (the
    natural "do I have a domain for the tunnel?" check).
  • Fixed a credential footgun: --api-token no longer carries the -t alias, which
    collided with tunnel setup -t (a different credential — connector token).
  • Interactive API-token walkthrough when missing on a TTY (scope + token-creation
    URL), mirroring the tunnel flow; clear actionable error otherwise.
  • Handoff to tunnel: register success prints the obol tunnel setup --hostname …
    next step; search/check suggest register.
  • Hid the leaky --respond-async knob; register now nudges toward a saved payment
    method + registrant contacts on action_required. Framed throughout as an optional
    convenience over doing it in the dashboard.

Reliability fixes (surfaced during testing)

  • Hermes pod recreation (fix(agent)): syncRuntimeFiles chowned the Hermes home dir
    to the host UID for host-side writes but never handed ownership back to the
    container UID, so the next non-root pod restart died with mkdir: cannot create
    directory '/data/.hermes': Permission denied. Added the fixRuntimeVolumeOwnership
    bookend + regression test.
  • Helm 4 / k3d apply flakes: tunnel kubectl apply paths now use --server-side
    --force-conflicts (avoids the /openapi/v2 EOF flake); SyncAgentBaseURL appends
    SyncFlagsForVersion so Helm 4 gets --force-conflicts on the AGENT_BASE_URL sync.

Tests & docs

  • New/updated tests: connector-token parsing, status probes, the -t collision
    guard, the domain API-token resolver, ListRegistrarDomains, and the hermes
    ownership bookend.
  • Updated CLAUDE.md, README.md, and docs/guides/monetize-inference.md.

▎ Companion docs live in the obol-gitbook repo (rewritten "Set up a permanent URL"
▎ guide + screenshots) ObolNetwork/obol-gitbook#152 and skills (version bump) — separate PRs.

@OisinKyne OisinKyne force-pushed the oisin/tunnelcleanup branch from 8f0e620 to c838cc1 Compare June 15, 2026 22:15
@OisinKyne OisinKyne requested a review from bussyjd June 16, 2026 11:10
@OisinKyne OisinKyne force-pushed the oisin/tunnelcleanup branch from 614ca7a to 5e81a42 Compare June 16, 2026 11:10
@bussyjd

bussyjd commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Diagram-based review: how this improves the stack

The PR’s main win is that permanent tunnel setup no longer asks the CLI to hold broad Cloudflare authority. It moves the default path to a single-tunnel connector token, with Cloudflare-side hostname/DNS setup done explicitly in the dashboard.

flowchart LR
  subgraph Before["Before: API-token provisioning"]
    U1["User"] --> CLI1["obol tunnel provision / setup"]
    CLI1 --> API["Cloudflare API token"]
    API --> CF1["CLI creates tunnel"]
    API --> DNS1["CLI mutates DNS / ingress"]
    CLI1 --> K1["cloudflared runs in cluster"]
  end

  subgraph After["After: connector-token setup"]
    U2["User creates tunnel + public hostname in Cloudflare dashboard"]
    U2 --> TOK["Single tunnel connector token"]
    TOK --> CLI2["obol tunnel setup <token>"]
    CLI2 --> SEC["K8s Secret: TUNNEL_TOKEN"]
    CLI2 --> CM["management_mode=remote"]
    SEC --> K2["cloudflared runs in cluster"]
    CM --> K2
  end
Loading
flowchart TD
  Setup["obol tunnel setup"] --> Extract["Accept bare token, --token, positional arg, or full cloudflared command"]
  Extract --> Validate["Decode connector token: account tag + tunnel UUID + secret"]
  Validate --> Store["Persist token locally and as K8s Secret"]
  Store --> Helm["Helm upgrade cloudflared into remote-managed mode"]
  Helm --> Sync["Sync AGENT_BASE_URL + frontend tunnel ConfigMap"]
  Sync --> URL["Permanent public URL: https://<hostname>"]
Loading

What gets better

  • Smaller credential blast radius: default tunnel setup no longer needs an account-wide Cloudflare API token, account ID, zone ID, or CLI-side DNS mutation.
  • Better UX: users can paste the whole cloudflared tunnel run --token ... command and let Obol extract the token.
  • Cleaner CLI model: tunnel setup becomes the permanent-URL command; browser login is retained as a hidden advanced fallback; domain purchasing is separated under obol domain.
  • Less credential confusion: domain --api-token no longer uses -t, leaving tunnel setup -t for the connector token.
  • Better status without Cloudflare API: tunnel status now reads cloudflared /ready and /metrics in-cluster, plus a public HTTP probe.
flowchart LR
  Status["obol tunnel status"] --> Kube["Deployment / pod readiness"]
  Status --> Ready["cloudflared :2000 /ready"]
  Status --> Metrics["cloudflared :2000 /metrics"]
  Status --> Public["Public URL HTTP probe"]
  Kube --> Report["Concise active / degraded / starting report"]
  Ready --> Report
  Metrics --> Report
  Public --> Report
Loading

Findings I’m following with fixes

  1. Blocking: the Hermes ownership regression guard fails. syncRuntimeFiles still calls ensureVolumeWritable directly and never restores container ownership with fixRuntimeVolumeOwnership, so the pod-recreation permission bug is not actually fixed yet.
  2. Minor: tunnel status --no-probe can label a remote connector-token tunnel as managed-locally because the fallback status string does not check the management mode.

I’m following this review comment with a small patch for both issues and focused test evidence.

@bussyjd

bussyjd commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Follow-up fixes pushed in 21dcc09:

  • Restored the Hermes runtime ownership bookend in syncRuntimeFiles by using the testable ownership hooks and deferring fixRuntimeVolumeOwnershipFn after host-side writes.
  • Fixed the tunnel status fallback label so remote connector-token tunnels report not_probed when probes are skipped/unavailable, instead of managed-locally.
  • Added a regression test for the persistent connector status fallback.

Verification:

go test ./cmd/obol ./internal/tunnel ./internal/hermes ./internal/helmcmd -count=1
git diff --check

Both pass locally from the PR worktree.

@bussyjd

bussyjd commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Live smoke for the Hermes ownership fix passed on an isolated k3d stack using the PR binary at 21dcc09.

Smoke shape:

OBOL_DEVELOPMENT=true \
OBOL_CONFIG_DIR=/tmp/obol-pr-645-review/.workspace/config \
OBOL_BIN_DIR=/tmp/obol-pr-645-review/.workspace/bin \
OBOL_DATA_DIR=/tmp/obol-pr-645-review/.workspace/data \
obol stack init --backend k3d --force

obol stack up
obol agent sync obol-agent
obol kubectl rollout restart deployment/hermes -n hermes-obol-agent
obol kubectl rollout status deployment/hermes -n hermes-obol-agent --timeout=180s

Evidence:

  • obol stack up initialized and synced the default Hermes agent successfully.
  • After explicit obol agent sync obol-agent, node-side hostPath ownership was restored:
1000:1000 700 /data/hermes-obol-agent/hermes-data/.hermes
1000:1000 660 /data/hermes-obol-agent/hermes-data/.hermes/config.yaml
1000:1000 2775 /data/hermes-obol-agent/hermes-data/.hermes/obol-skills
  • Forced Hermes pod recreation succeeded:
deployment "hermes" successfully rolled out
hermes-56bd8476bf-bjszm
init:init-hermes-data:Completed:0
container:hermes:ready=true:restarts=0
container:hermes-dashboard:ready=true:restarts=0
  • Events showed normal Created/Started lifecycle after rollout, and Hermes logs did not show the prior /data/.hermes permission-denied failure.
  • The isolated smoke cluster obol-stack-useful-mallard was stopped afterward with obol stack down.

@bussyjd bussyjd merged commit 6c77f17 into main Jun 16, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants