Skip to content

snapshot: --probe flag + weekly CI cron#182

Open
barbatos2011 wants to merge 2 commits into
tronprotocol:developfrom
barbatos2011:feat/snapshot-sources-probe
Open

snapshot: --probe flag + weekly CI cron#182
barbatos2011 wants to merge 2 commits into
tronprotocol:developfrom
barbatos2011:feat/snapshot-sources-probe

Conversation

@barbatos2011
Copy link
Copy Markdown

Summary

  • New trond snapshot sources --probe subcommand: HEAD-checks every upstream mirror, classifies each as ok / stale / unreachable / no_backups / bad_config, exits non-zero on any failure.
  • Probe(ctx, src, opts) + ProbeAll in internal/snapshot/. For date-strategy mirrors (nile) walks the generated date list newest-to-oldest, stopping at the first HTTP 200; for html mirrors scrapes the index. Distinguishes "no recent backup" from "endpoint vanished".
  • Weekly cron workflow (.github/workflows/snapshot-sources-probe.yml): Mon 09:00 UTC + manual dispatch. On failure opens a rolling snapshot-probe-stale issue (one issue, comments for subsequent failures), auto-closes when sources recover. Probe artifact uploaded for 30-day retention.

Why

Task #161 (Nile S3 mirror → 403) was the visible symptom of a deeper gap: the SourceTable is a hardcoded list with no feedback loop. The structural fix is a cron-driven probe so the next URL rotation surfaces in CI within a week instead of in a user bug report.

Smoke test (live SourceTable, this branch)

STATUS       NETWORK  KIND  DOMAIN              LATEST          AGE
ok           mainnet  lite  34.143.247.77       backup20260522  3d
ok           mainnet  full  34.143.247.77       backup20260522  3d
ok           mainnet  full  35.247.128.170      backup20260522  3d
ok           mainnet  full  34.86.86.229        backup20260523  2d
ok           mainnet  full  34.48.6.163         backup20260520  5d
ok           mainnet  full  35.197.17.205       backup20260522  3d
unreachable  nile     lite  database.nileex.io  -               -
summary: ok=6 stale=0 unreachable=1 no_backups=0 bad_config=0    exit code: 1

(Nile is currently flagged because this branch is off develop, predating the #161 URL fix on the shadowfork-phase1 PR.)

Test plan

  • Unit tests via httptest cover ok / stale / unreachable / bad_config + ProbeAll order preservation + age parser for both date formats.
  • CLI smoke-tested against the real SourceTable (output above).
  • Workflow dispatch once merged to confirm issue-open/close flow.

🤖 Generated with Claude Code

…follow-up)

Adds the feedback loop that would have caught tronprotocol#161 before a user did.
The Nile S3 URL going stale was invisible from the codebase side --
nothing exercised the published URLs after a hardcoded edit, so the
table aged silently until a human tried to download.

Three pieces:

internal/snapshot/probe.go

  Probe(ctx, src, opts) HEAD-checks the actual published tarball
  URL for a source. For 'date'-strategy mirrors (nile) it walks the
  generated date list newest-to-oldest; for 'html' mirrors it
  scrapes the index, same as Download does. First 200 wins and is
  classified ok / stale based on its age. We do NOT trust the
  existing LatestBackup helper here -- that one returns the topmost
  candidate without HEAD-checking, which would tell us nothing.

  ProbeAll concurrently probes a []Source preserving input order.

cmd/snapshot/sources.go

  Existing 'trond snapshot sources' grows three flags: --probe,
  --probe-timeout, --stale-after. Probe path returns a non-nil error
  on any not-OK source so the CLI exits 1; cleaner for shell-pipe
  consumption than os.Exit() inside a Cobra RunE. JSON output still
  prints the full report before failing.

.github/workflows/snapshot-sources-probe.yml

  Weekly cron (Mon 09:00 UTC) + workflow_dispatch. Builds trond,
  runs the probe in JSON mode, and on failure opens (or comments on
  the existing) a rolling 'snapshot-probe-stale'-labelled issue.
  Auto-closes when sources recover. Probe artifact uploaded for the
  30-day retention so we can diff probe runs week-on-week.

Smoke-tested locally against the live SourceTable on this branch
(which still has the broken Nile S3 URL from before tronprotocol#161 lands):

  $ trond snapshot sources --probe
  STATUS       NETWORK  KIND  DOMAIN              LATEST          AGE
  ok           mainnet  lite  34.143.247.77       backup20260522  3d
  ok           mainnet  full  34.143.247.77       backup20260522  3d
  ok           mainnet  full  35.247.128.170      backup20260522  3d
  ok           mainnet  full  34.86.86.229        backup20260523  2d
  ok           mainnet  full  34.48.6.163         backup20260520  5d
  ok           mainnet  full  35.197.17.205       backup20260522  3d
  unreachable  nile     lite  database.nileex.io  -               -
  summary: ok=6 stale=0 unreachable=1 no_backups=0 bad_config=0
  exit code: 1

Useful side finding: all five mainnet IPs from the 2025-Q1 entry
are still healthy a year later, so the file-header staleness
worry was over-cautious. The Nile failure is real and lines up
exactly with tronprotocol#161.

Branch is intentionally off develop so this can land independent
of the shadow-fork PR. When tronprotocol#161's Nile-URL fix merges, the probe
will go fully green on the next cron tick (or workflow_dispatch
manually for an immediate verify).
Two CI failures on PR tronprotocol#182 after first push, both in code I added:

1. gofmt — probe.go's const block + struct field alignment differed
   from gofmt's column choice (one extra space on a comment). gofmt -w
   reflowed; no behavioural change.

2. unparam — buildProbeMirror returned (*httptest.Server, Source) but
   every caller used '_, src := ...'; the server itself is cleaned up
   via t.Cleanup(srv.Close) inside the helper, so callers never need
   the handle. Dropped the first return; updated all five callers
   from '_, src := ...' to 'src := ...'.

Not fixed in this commit (pre-existing on develop, not introduced by
this PR):
  - Vulnerability scan reports findings on internal/target/ssh.go's
    calls into golang.org/x/crypto/ssh. The vulnerable paths were
    committed before this branch was cut.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant