[leader] drive guardian committee handoff#568
Conversation
1efe9d7 to
db82103
Compare
8bcd31d to
31419f7
Compare
db82103 to
7d1e4c9
Compare
31419f7 to
c8a0fa0
Compare
7d1e4c9 to
d6e7673
Compare
7d8b386 to
3043f68
Compare
| from_epoch, | ||
| hashi_epoch, "Driving guardian committee handoff" | ||
| ); | ||
| let signed = Self::collect_committee_transition_signatures(inner, from_epoch).await?; |
There was a problem hiding this comment.
So one challenge with this is that we're expecting a previous committee to ratchet the guardian up to a new epoch. Assuming this happens quickly after epoch change this should be fine, and we can always recover by having the provisioners explicitly do the ratcheting themselves. But you need to worry about the "leader" maybe not even being a part of that previous committee.
There was a problem hiding this comment.
@bmwill wait i think the leader doesn't have to be part of the outgoing commmittee, since it only fans out the SignCommitteeTransition over peer RPCs to the outgoing committee's members and aggregates a threshold cert.
Each follower signs with the historical epoch key it kept in db.signing_keys, so a hashi-server that has rotated out of the active committee can still sign for an epoch it was in.
so as long as enough former committee members are online i think we should be fine, right?
- Route guardian RPC metrics through the existing RpcMetricsMakeCallbackHandler middleware via a new GuardianClient::with_metrics setter (mirrors grpc::client::Client), and drop the explicit time_guardian_rpc wrapper and the now-unused GUARDIAN_RPC_METHOD_UPDATE_COMMITTEE constant. - Fix the sparse-epoch handling: leader and followers pick the next on-chain committee via range((from_epoch + 1)..).next() instead of from_epoch + 1. Aligns with the guardian-side fix on PR #569. - Rename maybe_reconcile_guardian_committee to check_reconcile_guardian_committee to match the file's check_* helpers. - Only spawn the reconcile task when the hashi epoch advances (track last_guardian_reconcile_epoch). - Hoist the initial GetGuardianInfo out of the loop; reuse the current_committee_epoch from each UpdateCommittee response. - Switch the inflight-task guard to is_some() so a completed-but- unconsumed reconcile result is never dropped. - Warn when the guardian's epoch runs ahead of hashi's, and bail when UpdateCommittee omits current_committee_epoch.
The guardian's committee is set once at ProvisionerInit and can never change, so signature verification fails as soon as hashi rotates past the bootstrap epoch. The guardian-side fix (new `UpdateCommittee` RPC and `current_committee_epoch` reporting) lands in the stacked PR underneath this one. This PR wires the hashi-server side that drives the handoff. - Each leader tick spawns a bounded one-shot reconcile task. It reads the guardian's `current_committee_epoch`, and for each missing step fans out `SignCommitteeTransition` across the OUTGOING committee, aggregates a BLS cert with each member's historical per-epoch BLS signing key from `db.signing_keys`, and sends an `UpdateCommittee` to advance the guardian by one epoch. - The new committee in the transition is reconstructed by each signer from on-chain state at `from_epoch + 1` — no committee bytes travel on the inter-node wire, so the leader can't get peers to sign attacker-crafted committees. - Idempotency lives on the guardian side, so leader churn / lost RPC results are safe — the next leader simply repeats. - New metric `hashi_guardian_current_committee_epoch` mirrors the guardian's reported epoch.
Apply review feedback over the original commit: factor the duplicated "time + record outcome" pattern in `reconcile_guardian_committee` into a small helper, and shorten doc comments throughout to match neighboring RPC handlers and signing helpers. No behavior change.
Drop the `maybe_reconcile_guardian_committee` and `validate_and_sign_committee_transition` doc blocks (siblings have none), pull `reconcile_guardian_committee`, `time_guardian_rpc`, and `sign_message_proto_at_epoch` docs to one line, and shorten the inline `ProvisionerInit` and `Bail out` comments. Also trim the proto-side `SignCommitteeTransition` rationale.
- Route guardian RPC metrics through the existing RpcMetricsMakeCallbackHandler middleware via a new GuardianClient::with_metrics setter (mirrors grpc::client::Client), and drop the explicit time_guardian_rpc wrapper and the now-unused GUARDIAN_RPC_METHOD_UPDATE_COMMITTEE constant. - Fix the sparse-epoch handling: leader and followers pick the next on-chain committee via range((from_epoch + 1)..).next() instead of from_epoch + 1. Aligns with the guardian-side fix on PR #569. - Rename maybe_reconcile_guardian_committee to check_reconcile_guardian_committee to match the file's check_* helpers. - Only spawn the reconcile task when the hashi epoch advances (track last_guardian_reconcile_epoch). - Hoist the initial GetGuardianInfo out of the loop; reuse the current_committee_epoch from each UpdateCommittee response. - Switch the inflight-task guard to is_some() so a completed-but- unconsumed reconcile result is never dropped. - Warn when the guardian's epoch runs ahead of hashi's, and bail when UpdateCommittee omits current_committee_epoch.
Clearing last_guardian_reconcile_epoch when the reconcile task errors keeps the original behavior of retrying a failed handoff on the next checkpoint (e.g. transient guardian downtime), while a successful reconcile still holds the gate until the hashi epoch advances.
ab1cc55 to
55cfd4d
Compare
addressed comments, and changes approved by luke. dismissing to unblock merge. since github is blocking merge because of this.
Automated update based on: New SignCommitteeTransition RPC added where peers sign with historical epoch BLS keys. Leader now drives guardian committee handoff to match on-chain epoch.
Automated update based on: New SignCommitteeTransition RPC added where peers sign with historical epoch BLS keys. Leader now drives guardian committee handoff to match on-chain epoch.
Automated update based on: New SignCommitteeTransition RPC added where peers sign with historical epoch BLS keys. Leader now drives guardian committee handoff to match on-chain epoch.
Automated update based on: New SignCommitteeTransition RPC added where peers sign with historical epoch BLS keys. Leader now drives guardian committee handoff to match on-chain epoch.
builds on top of #569. The leader drives the guardian's committee forward to match the on-chain epoch each tick.
Changes
SignCommitteeTransitionRPC:from_epoch + 1and signs with the historicalfrom_epochBLS key.Hashi::sign_message_proto_at_epochfor signing with a historical epoch's key rather than the current one.hashi_guardian_current_committee_epochfor the guardian's reported epoch.