Fixed bugs in signature verification by etremel · Pull Request #288 · Derecho-Project/derecho

etremel · 2025-10-10T17:31:49Z

I found and fixed a bug in the way handle_verify_request read and verified signatures. Since the "signature" SST field is not actually monotonic (a later signature won't validate as an earlier signature), and signatures don't contain their version numbers, there is actually a race condition between a remote node updating signature and signed_num while the local node is trying to do handle_verify_request. Since the remote node writes in the order (1) signature (2) signed_num, while the local node reads in the order (1) signed_num (2) signature, the local node could read an earlier signed_num and then a later signature. This causes signature validation to fail because the local node retrieves the wrong signature from its log to compare (e.g. it retrieves signature 15, when the signature it got from the SST is the signature for version 17). This race condition can be mostly fixed by having the local node re-read the signed_num after reading the signature and compare it with the value it has cached. If signed_num changed while reading the signature, it means the read interleaved with the remote node's write and we need to try again.

Also, the new-view initialization of the SST leaves all of the delivery-counter fields (like persisted_num) uninitialized, assuming the first delivery predicate will update them, but that assumption does not hold for signed_num because handle_verify_request can read from a remote node's signed_num before that node has written to it. Initializing signed_num to its last value from the previous view is safer.

While testing this bug fix, I had to fix two other issues with our testing code: (1) The default values for some of our port numbers overlap with Linux's ephemeral port range, causing transient failures when a process finds that one of these ports is already "in use," and (2) Tests that use ViewManager::barrier_sync() in the main thread can cause a view-change to hang because the view change also relies on barrier_sync() and only one barrier_sync() can be in progress at a time.

Sometimes I get a "Warning: Signature did not match my signature!" message while running CascadeChain, even though everything is working fine. I think the problem is a race condition between one node updating its SST and another node reading the SST in handle_verify_request. The SST to_string method needs to be updated to display the signature- related fields so I can debug this more easily, and also needed to be fixed anyway to display the fields in the order they are declared.

The SST to_string should unpack joiner IP addresses with inet_ntoa, since the point of this string is to be human-readable for debugging. The signature race condition doesn't seem to show up in signed_log_test, just in CascadeChain, so maybe the problem is signed_log_test uses updates that are so small they complete quickly.

I discovered the source of at least one type of "Signature did not match" warning: The signed_num field in DerechoSST was initialized to 0, not -1, even though 0 is a valid version (the first version). This meant handle_verify_request would try to verify another node's signature on version 0 even if it hadn't yet put a signature in the SST signature field. In fact, handle_verify_request needs a special case to check if a peer node has not yet computed any signatures; it won't gracefully degrade because 0 is a real version number and -1 is invalid as an argument to subgroup_object->get_signature().

View changes keep getting stuck while trying to run signed_log_test.

Since the "signature" SST field is not actually monotonic (a later signature won't validate as an earlier signature), and signatures don't contain their version numbers, there is actually a race condition between a remote node updating "signature" and "signed_num" while the local node is trying to do handle_verify_request. Since the remote node writes in the order (1) signature (2) signed_num, while the local node reads in the order (1) signed_num (2) signature, the local node could read an earlier signed_num and then a later signature. This causes signature validation to fail because the local node retrieves the wrong signature from its log to compare (e.g. it retrieves signature 15, when the signature it got from the SST is the signature for version 17). This race condition can be mostly fixed by having the local node re-read the signed_num after reading the signature and compare it with the value it has cached. If signed_num changed while reading the signature, it means the read interleaved with the remote node's write and we need to try again. If signed_num has the same value twice, it's probably the correct value (matches the signature). It is still slightly possible to get an inconsistent read if the remote node overwrites the signature exactly as the local node is reading it (so it won't have changed signed_num when the local node reads it a second time), but the window for that race condition is much smaller.

Another contributor to the race condition in handle_verify_request is the fact that the new-view initialization of the SST leaves all of the delivery-counter fields (like persisted_num) at their default values, assuming the first delivery predicate will update them, but handle_verify_request can read from a remote node's signed_num before that node has written to it. It's safer to initialize signed_num and the corresponding signature to its correct value from the last view. verified_num is written before it's read, like persisted_num, so it doesn't need to be initialized. Separately, the default values for the Derecho ports cause problems when trying to test locally on Linux machines because they overlap with the Linux ephemeral port range (32768 and higher). Tests keep failing because a process accidentally connects to itself, or to another node's system-assigned outgoing port for a different connection. It's better to use lower values for these ports so they won't overlap with the ephemeral range, and these values are still high enough to be unused by any other service.

Using barrier_sync() in the main thread can "steal" the sync exchanges from a view-change's sync() and prevent either one from completing, which can happen if one node reaches the barrier_sync() while another one is still starting up and joining (causing a view change). It should work better to use a completely different socket and port for the final synchronization, like Cascade's perftest does.

KenBirman · 2025-10-10T17:35:53Z

Nice job catching this! Should we just move the Derecho port# address space to some other range?

etremel · 2025-10-21T20:16:50Z

Yes, I moved the default Derecho port numbers to the range 14480-14880 in commit ceeb679. This is large enough to still be in the unused port space but small enough to never overlap with the ephemeral port range. Of course, each other application that uses Derecho (like Cascade) will also need to update its own ports if it was using its own config files.

etremel added 7 commits July 17, 2025 11:10

Added debug logging to SST to find another bug

94094dd

View changes keep getting stuck while trying to run signed_log_test.

etremel merged commit c089ce6 into master Dec 19, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed bugs in signature verification#288

Fixed bugs in signature verification#288
etremel merged 7 commits into
masterfrom
signature_sst_bug_fix

etremel commented Oct 10, 2025

Uh oh!

KenBirman commented Oct 10, 2025

Uh oh!

etremel commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

etremel commented Oct 10, 2025

Uh oh!

KenBirman commented Oct 10, 2025

Uh oh!

etremel commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants