docs(frost/signing): canonicalize the static-vs-runtime error taxonomy#3993
Merged
mswilkison merged 1 commit intoMay 24, 2026
Conversation
Adds a top-of-file design-rationale block to roast_retry_orchestration.go
that captures the load-bearing decision (from RFC-21 Phase 6 review)
about which orchestration errors are fallback-eligible and which must
hard-fail.
The decision had been distributed across commit messages, the RFC text,
and inline comments on individual sentinel definitions. The
block centralises it next to the code that enforces it, so future
maintainers can find the rationale without having to reconstruct it
from spelunking history.
Key statements captured:
STATIC errors -> safe to fall back to the legacy retry path. Every
honest signer observes the same node-local config
at startup so fallback decisions are deterministic
across the group. Sentinel:
ErrNoRoastRetryCoordinatorRegistered, detected via
errors.Is in signing_loop_roast_dispatcher.go.
RUNTIME errors -> HARD FAIL. Per-attempt protocol state errors can be
observed by some participants and not others within
the same attempt; falling back to legacy under those
conditions creates split-brain (some operators
running new code, others running legacy on the same
attempt). The orchestration layer returns these as
bare errors that the dispatcher treats as terminal.
The block also notes the historical redirect: the earlier design had
BeginAttempt failures fall back, on the assumption that BeginAttempt
was cheap idempotent setup. Review identified BeginAttempt mutates
per-attempt state and can fail from races with concurrent receives,
which the static-error fallback can't safely handle. Documenting the
"why" prevents the regression from being re-introduced by a maintainer
who reads only the code.
Pure documentation -- no behaviour change, no test changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
da5f833
into
feat/frost-schnorr-migration-scaffold
21 of 23 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The RFC-21 Phase 6 review decided which orchestration errors are fallback-eligible (static config errors → safe to fall back to legacy retry path) and which must hard-fail (runtime per-attempt errors → no fallback, since per-participant divergence creates split-brain group fracture). The rationale lived in commit messages, the RFC text, and inline comments on individual sentinels — distributed enough that a future maintainer reading just `roast_retry_orchestration.go` could miss the load-bearing constraint.
This PR adds a top-of-file design-rationale block that centralises the decision in the place that enforces it.
What changed
What it captures
Lineage
Surfaced in the cross-PR review re-evaluation following PR #3866 follow-up landings. Originally tracked as "Document static-vs-runtime classification canonically" — initially flagged as "available if you want," now elevated because the rationale was the most important architectural decision in the RFC-21 stack and is currently the easiest piece of design context to lose.
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com