Skip to content

docs(frost/signing): canonicalize the static-vs-runtime error taxonomy#3993

Merged
mswilkison merged 1 commit into
feat/frost-schnorr-migration-scaffoldfrom
chore/document-static-runtime-taxonomy-2026-05-23
May 24, 2026
Merged

docs(frost/signing): canonicalize the static-vs-runtime error taxonomy#3993
mswilkison merged 1 commit into
feat/frost-schnorr-migration-scaffoldfrom
chore/document-static-runtime-taxonomy-2026-05-23

Conversation

@mswilkison
Copy link
Copy Markdown
Contributor

Why

The RFC-21 Phase 6 review decided which orchestration errors are fallback-eligible (static config errors → safe to fall back to legacy retry path) and which must hard-fail (runtime per-attempt errors → no fallback, since per-participant divergence creates split-brain group fracture). The rationale lived in commit messages, the RFC text, and inline comments on individual sentinels — distributed enough that a future maintainer reading just `roast_retry_orchestration.go` could miss the load-bearing constraint.

This PR adds a top-of-file design-rationale block that centralises the decision in the place that enforces it.

What changed

  • One file changed: `pkg/frost/signing/roast_retry_orchestration.go`
  • Pure documentation: no behavior change, no test changes, no API change
  • 49 lines added (one comment block)

What it captures

  1. STATIC vs RUNTIME classification — explicit definitions, with the sentinel (`ErrNoRoastRetryCoordinatorRegistered`) and detection mechanism (`errors.Is` in `signing_loop_roast_dispatcher.go`) named.
  2. Why static-error fallback is safe — every honest signer observes the same node-local config at startup, so the fallback decision is deterministic across the group.
  3. Why runtime-error fallback is unsafe — per-attempt protocol state errors can be observed by some participants and not others within the same attempt; fallback would put some operators on new code and others on legacy for the same attempt.
  4. Enforcement rule — any error surfaced from this package that is intended to permit fallback MUST be the sentinel; wrapping ANY runtime error in the sentinel is a safety regression that PR reviewers should reject.
  5. Historical redirect — the earlier design had `BeginAttempt` failures fall back, on the assumption that BeginAttempt was cheap idempotent setup. Review identified that BeginAttempt mutates per-attempt state and can fail from races with concurrent receives; the taxonomy was tightened so only true configuration errors are fallback-eligible.

Lineage

Surfaced in the cross-PR review re-evaluation following PR #3866 follow-up landings. Originally tracked as "Document static-vs-runtime classification canonically" — initially flagged as "available if you want," now elevated because the rationale was the most important architectural decision in the RFC-21 stack and is currently the easiest piece of design context to lose.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

Adds a top-of-file design-rationale block to roast_retry_orchestration.go
that captures the load-bearing decision (from RFC-21 Phase 6 review)
about which orchestration errors are fallback-eligible and which must
hard-fail.

The decision had been distributed across commit messages, the RFC text,
and inline comments on individual sentinel definitions. The
block centralises it next to the code that enforces it, so future
maintainers can find the rationale without having to reconstruct it
from spelunking history.

Key statements captured:

  STATIC errors  -> safe to fall back to the legacy retry path. Every
                    honest signer observes the same node-local config
                    at startup so fallback decisions are deterministic
                    across the group. Sentinel:
                    ErrNoRoastRetryCoordinatorRegistered, detected via
                    errors.Is in signing_loop_roast_dispatcher.go.

  RUNTIME errors -> HARD FAIL. Per-attempt protocol state errors can be
                    observed by some participants and not others within
                    the same attempt; falling back to legacy under those
                    conditions creates split-brain (some operators
                    running new code, others running legacy on the same
                    attempt). The orchestration layer returns these as
                    bare errors that the dispatcher treats as terminal.

The block also notes the historical redirect: the earlier design had
BeginAttempt failures fall back, on the assumption that BeginAttempt
was cheap idempotent setup. Review identified BeginAttempt mutates
per-attempt state and can fail from races with concurrent receives,
which the static-error fallback can't safely handle. Documenting the
"why" prevents the regression from being re-introduced by a maintainer
who reads only the code.

Pure documentation -- no behaviour change, no test changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mswilkison mswilkison merged commit da5f833 into feat/frost-schnorr-migration-scaffold May 24, 2026
21 of 23 checks passed
@mswilkison mswilkison deleted the chore/document-static-runtime-taxonomy-2026-05-23 branch May 24, 2026 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant