Skip to content

Commit 6214eec

Browse files
authored
docs(rfc): lock RFC-21 Phase-3 design decisions (#3967)
## Summary Promotes the Phase-3 design decisions settled in the **2026-05-22** cross-team review into a dedicated **Resolved Decisions** section of RFC-21. Doc-only; +180/-38. ## Why The previous draft listed Phase-3 questions under \"Open questions\" with a *recommended-entering-Phase-3* path that turned out, on review, to have a critical safety gap: the all-to-all signed-evidence gossip recommendation silently assumed gossip is synchronously consistent across the signer set. In practice gossip is eventually consistent, so two honest signers can hold divergent evidence sets at the moment the deterministic \`NextAttempt\` boundary triggers, producing divergent next-attempt contexts and fracturing the group. This PR locks the replacement design before Phase 3 implementation PRs begin landing. ## What the resolved-decisions section pins | Decision | Resolution | |---|---| | Cross-process coordinator agreement | **Coordinator-proposed aggregation** on a dedicated evidence topic, signed with operator key, receiver-side bundle verification for censorship detection. All-to-all gossip + local union is rejected with rationale. | | Source of \`DkgGroupPublicKey\` for seed | Extracted from FFI signer material at attempt construction time. No wallet-registry lookup on hot path. | | \`AttemptContext\` ↔ \`NativeExecutionFFISigningRequest\` | Field on request struct; Go-side orchestration only; does not cross CGO boundary. | | \`SelectCoordinator\` retention | Keep as helper; \`BeginAttempt\` bridges \`[32]byte\` seed to legacy \`int64\` via a sterile, named adapter. | | Evidence-signing key | Reuse existing operator key. | | Evidence message format | JSON wrapped in existing \`pkg/net/gen/pb\` envelope; routed via \`net.Message\`. | | Maximum evidence-message size | Single \`TransitionMessage\` per transition, ~10-20 KiB at 100-signer saturation. No chunking. | | Silence-parking transience (risk mitigation) | Strictly single-attempt skip, no escalation. A peer falsely labelled silent is reinstated by the very next attempt. | ## Layer-B exclusion-policy strengthening The exclusion-policy list in Layer B is extended with explicit \"no escalation\" wording for the silence/parking case. The risk Gemini's review surfaced (late-arriving evidence weaponised into permanent exclusion) is bounded by: - Silence parking ≤ 1 attempt. - Permanent exclusion only fires on overflow (transport-blamable) or non-transport reject (validation-blamable). Neither can trigger on a slow-but-honest peer. - Receiver-side bundle verification catches a coordinator that tries to censor an honest peer's signed snapshot. ## Open questions reduced to three What remains in the Open Questions section is genuinely open: - Persistence across signer restart (Phase 5+). - FFI surface for future exclusion-relevant errors (follows the L5 pattern from #425 / #3961). - \`AttemptContextHash\` backward-compat horizon (Phase 6+). ## Test plan - [ ] Reviewer reads the Resolved decisions section end-to-end. - [ ] Reviewer confirms the coordinator-aggregation flow as documented matches the agreed design. - [ ] AsciiDoc renders cleanly (CI step \`Publish contracts documentation\` covers this). No code change; no behaviour-test surface.
2 parents 266272c + 832d529 commit 6214eec

1 file changed

Lines changed: 180 additions & 38 deletions

File tree

docs/rfc/rfc-21-roast-coordinator-retry-and-transition-evidence.adoc

Lines changed: 180 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,14 @@ session inputs; it is never chosen, only derived. Any signer can
198198
recompute it from the session header and verify the coordinator's
199199
participant selection.
200200

201+
*`DkgGroupPublicKey` source.* The runtime extracts `DkgGroupPublicKey`
202+
from the FFI signer material at attempt construction time -- the same
203+
material that already carries the DKG-validated group public key and is
204+
required at signature-verification time anyway. Do not re-read it from
205+
the wallet registry: the FFI material is the canonical hot-path source,
206+
removes async/DB lookup latency, and preserves separation between the
207+
core signing protocol and application state.
208+
201209
=== Layer A: Receiver transition evidence (M4)
202210

203211
The three `select { default }` drops become:
@@ -248,8 +256,9 @@ type categoryQuota struct {
248256
The point is to produce a fixed-size attestation, not to log
249257
everything forever. Per-attempt evidence is at most
250258
`O(|IncludedSet| * sum(quotas))` bytes -- bounded, predictable, and
251-
small enough to be signed and broadcast as a single message
252-
(see open question 1).
259+
small enough to be signed and broadcast as a single message. The
260+
broadcast mechanism is the coordinator-aggregated `TransitionMessage`
261+
defined in the Resolved decisions section.
253262

254263
=== Layer B: Coordinator state (joining M4 and M7)
255264

@@ -278,6 +287,26 @@ type Coordinator interface {
278287
context from the previous attempt's evidence. It is deterministic given
279288
`(AttemptContext, TransitionEvidence)` -- two coordinators with the same
280289
verified inputs agree on the next attempt without further coordination.
290+
291+
The verified-inputs requirement is critical: gossip is eventually
292+
consistent, but `NextAttempt` is a synchronous state transition. Two
293+
honest signers fed differently-timed evidence sets produce divergent
294+
contexts. To prevent that, the *evidence input itself* is an
295+
authoritative `TransitionMessage` produced by the current attempt's
296+
coordinator (the "coordinator-aggregation" model defined in the
297+
Resolved decisions section); see that section for the full
298+
agreement-flow specification.
299+
300+
*Seed-bridging.* The legacy `pkg/frost/roast/coordinator.go::SelectCoordinator`
301+
helper accepts an `int64` seed plus an attempt number. `BeginAttempt`
302+
wraps it with a sterile bridge that folds the new `[32]byte`
303+
`AttemptSeed` into the legacy parameter shape -- for example, taking
304+
the first 8 bytes as a big-endian `int64`. The bridge is a
305+
non-cryptographic adapter for the deterministic shuffle: equivalent
306+
seed bytes must produce the same legacy `int64` on every honest
307+
signer. The bridge is named, isolated, and exhaustively tested so
308+
later edits cannot accidentally desynchronise it.
309+
281310
The exclusion policy is:
282311

283312
. Senders with `OverflowCount >= overflowExclusionThreshold` during the
@@ -286,7 +315,14 @@ The exclusion policy is:
286315
reasons are moved to `ExcludedSet` (validation blamable).
287316
. Senders with deadline-expiry only -- silent peers -- are moved to a
288317
*parked* set that the next attempt skips but the attempt after that
289-
retries (to tolerate transient outages).
318+
retries (to tolerate transient outages). Silence parking is
319+
*strictly transient*: a single attempt's worth of skip, no escalation.
320+
A peer falsely labelled silent because their contribution arrived
321+
late (or because a malicious coordinator censored it) is not
322+
permanently penalised -- they are reinstated by the very next
323+
attempt. Permanent exclusion only follows from overflow or non-
324+
transport reject events, neither of which can fire on a slow-but-
325+
honest peer.
290326
. If `IncludedSet` minus exclusions drops below the threshold `t`, the
291327
coordinator returns `ErrAttemptInfeasible` and the session is
292328
declared failed for this signer set.
@@ -404,42 +440,148 @@ choices in their PR descriptions and reviews.
404440
only when the supporting evidence is attached. The RFC does not
405441
promise an early flip.
406442

407-
== Open questions
443+
== Resolved decisions
444+
445+
The decisions in this section were settled in a Phase-3 design review
446+
(2026-05-22) with cross-team protocol-owner input. They are listed
447+
here so subsequent implementation PRs can reference them.
448+
449+
=== Cross-process coordinator agreement
450+
451+
*Decision: coordinator-proposed aggregation on a dedicated topic,
452+
signed with the operator key, with receiver-side bundle verification
453+
for censorship detection.*
454+
455+
The earlier draft of this RFC carried "all-to-all signed-evidence
456+
gossip with local union" as the recommended path. That recommendation
457+
silently assumed gossip is synchronously consistent across the signer
458+
set; in practice gossip is eventually consistent, so two honest
459+
signers can hold divergent evidence sets at the moment the attempt
460+
times out. Applying the deterministic `NextAttempt` function to
461+
divergent inputs produces divergent next-attempt contexts and
462+
fractures the signing group.
463+
464+
The replacement flow is:
465+
466+
. *Observation.* Each signer's `EvidenceRecorder` (Phase 2)
467+
produces a per-attempt local-evidence snapshot.
468+
. *Submission.* Each signer signs its snapshot with its operator
469+
key (the same key `pkg/net` already uses to attribute network
470+
messages) and broadcasts it on a dedicated evidence topic.
471+
. *Aggregation.* The current attempt's elected coordinator
472+
(the deterministic `SelectCoordinator` output) collects the
473+
signed snapshots, builds a canonical bundle, signs the bundle,
474+
and broadcasts it as a `TransitionMessage`.
475+
. *Verification.* Every receiver validates the bundle's
476+
coordinator signature, validates each contained snapshot's
477+
operator signature, *and verifies that its own observations
478+
appear in the bundle*. A coordinator that omits an honest
479+
peer's signed snapshot is caught here.
480+
. *Transition.* Receivers feed the verified bundle into
481+
`NextAttempt`. Because the bundle is the authoritative input,
482+
all honest receivers compute the same next-attempt context.
483+
484+
A peer that signs conflicting snapshots is slashable -- the
485+
signature is the binding. A coordinator that signs an inconsistent
486+
bundle (omits observations, alters counts, etc.) is detected at
487+
verification step (4) and the next-attempt coordinator handles the
488+
exclusion.
489+
490+
Alternatives considered (rejected):
491+
492+
. *All-to-all signed-evidence gossip with local union.* Original
493+
recommendation. Rejected because gossip's eventual-consistency
494+
semantics let honest signers reach the deterministic
495+
`NextAttempt` boundary with divergent inputs, producing
496+
divergent outputs.
497+
. *Piggy-back on existing FROST broadcast channel.* Rejected
498+
because it couples evidence rate limits to protocol round-trip
499+
rate limits, and re-uses a topic with different traffic
500+
characteristics.
501+
. *Coordinator-only authoritative without aggregation.* Rejected
502+
because losing the all-signer signed attestations also loses
503+
the audit trail. The aggregation model keeps the per-signer
504+
signatures inside the bundle, so the audit trail survives.
505+
506+
Liveness: a malicious coordinator can withhold the
507+
`TransitionMessage`, stalling the transition. ROAST handles this
508+
the same way it handles a malicious signer: the attempt times
509+
out, the next attempt elects a different coordinator (the
510+
`SelectCoordinator` output is deterministic but rotates with the
511+
attempt number), and the new coordinator drives the transition.
512+
The malicious coordinator's evidence is itself parked or
513+
excluded by the new coordinator's bundle, ending the loop.
514+
515+
Safety: any honest signer that verifies a bundle and computes
516+
`NextAttempt(ctx, bundle)` produces the same context as any other
517+
honest signer that verifies the same bundle. Safety reduces to
518+
"is the bundle correctly verified" -- a local check, not a
519+
network-consistency requirement.
520+
521+
This design satisfies the formal verified-inputs requirement of
522+
the deterministic `NextAttempt` policy specified in Layer B.
523+
524+
=== Source of `DkgGroupPublicKey` for the seed
525+
526+
*Decision: extract from FFI signer material at attempt construction.*
527+
528+
The DKG-validated group public key is already present in the FFI
529+
signer material (it is required at signature-verification time
530+
anyway), so the seed derivation can take it from there. The
531+
wallet registry is *not* consulted on the hot path; doing so
532+
would introduce async lookup latency and entangle the core
533+
signing protocol with application state. See Shared types above
534+
for the derivation contract.
535+
536+
=== `AttemptContext` ↔ `NativeExecutionFFISigningRequest` binding
537+
538+
*Decision: extend the request struct with an `AttemptContext`
539+
field; the context is Go-side orchestration only.*
540+
541+
The context does not cross the CGO/Rust boundary into the
542+
`tbtc-signer` engine -- the engine remains a pure signing
543+
primitive. Go-side coordinator wiring populates the context;
544+
existing call sites construct attempt-zero contexts inline
545+
during Phase 4.
546+
547+
=== `SelectCoordinator` retention
548+
549+
*Decision: keep the existing helper; bridge the seed type inside
550+
`BeginAttempt`.*
551+
552+
The deterministic shuffle is correct in isolation. The bridge
553+
folds the new `[32]byte` `AttemptSeed` into the legacy `int64`
554+
parameter shape with a sterile, named adapter (see Layer B).
555+
556+
=== Evidence-signing key
557+
558+
*Decision: reuse the existing operator key.*
559+
560+
The operator key already binds every other gossip message a
561+
keep-core node emits via `pkg/net`. Layering a second key
562+
surface specifically for evidence signing is premature
563+
optimization given the current key model.
564+
565+
=== Evidence message format
566+
567+
*Decision: JSON payload wrapped in the existing `pkg/net/gen/pb`
568+
envelope, routed via the `net.Message` interface.*
569+
570+
This matches the FROST/tbtc-signer protocol messages (Phase 1B)
571+
and inherits the network layer's operator-key signing
572+
automatically. Raw JSON does not appear on the wire.
573+
574+
=== Maximum evidence-message size
575+
576+
*Decision: single `TransitionMessage` per transition; no
577+
chunking.*
578+
579+
Under coordinator-aggregation, the per-transition payload is
580+
`O(N)` not `O(N^2)`. At a 100-signer group with all four
581+
quotas saturated the JSON-encoded bundle is ~10-20 KiB,
582+
comfortably within libp2p's per-message limits.
408583

409-
. *Cross-process coordinator agreement.* Today each signer runs its own
410-
process; the coordinator state machine is per-process. We assume
411-
that two honest signers, fed the same `TransitionEvidence` from a
412-
shared gossip layer, produce the same `NextAttempt`. Without
413-
agreement on the evidence input, the deterministic function still
414-
produces divergent outputs -- node A excludes peer X (saw overflow),
415-
node B does not (didn't), and the next-attempt sets disagree. This
416-
defeats the whole point of the layered design.
417-
+
418-
*Recommended path (signed-evidence gossip):* every observer signs the
419-
evidence it produced with its operator key and broadcasts the
420-
attestation on a dedicated evidence topic. Honest signers feed only
421-
*verified attestations* into the deterministic
422-
`NextAttempt`, taking the union over signed observations and applying
423-
the same exclusion thresholds. Two honest signers thus consume the
424-
same input set and produce the same output. A peer that signs
425-
conflicting evidence is itself slashable -- the signature is the
426-
binding.
427-
+
428-
Options considered:
429-
.. Piggy-back on existing FROST broadcast channel -- simplest but
430-
couples evidence to protocol round-trips and re-uses a topic with
431-
different rate-limit characteristics.
432-
.. *Dedicated evidence broadcast topic with signed attestations
433-
(recommended).* Cleaner separation, more wiring; the wiring is
434-
what the design owes the protocol.
435-
.. Coordinator-only authoritative -- only the elected coordinator
436-
produces evidence and other signers verify but don't recompute.
437-
Closest to the paper but loses redundancy.
438-
+
439-
The recommendation is the recommended *entering* Phase 3. The final
440-
decision is still owed and is the question that most needs
441-
design-time review with threshold-network/keep-core protocol owners
442-
before Phase 3 lands.
584+
== Open questions
443585

444586
. *Persistence across signer restart.* If a signer crashes mid-attempt,
445587
does it lose its evidence? The paper assumes persistent state. For

0 commit comments

Comments
 (0)