Skip to content

Fixed bugs in signature verification#288

Merged
etremel merged 7 commits into
masterfrom
signature_sst_bug_fix
Dec 19, 2025
Merged

Fixed bugs in signature verification#288
etremel merged 7 commits into
masterfrom
signature_sst_bug_fix

Conversation

@etremel
Copy link
Copy Markdown
Contributor

@etremel etremel commented Oct 10, 2025

I found and fixed a bug in the way handle_verify_request read and verified signatures. Since the "signature" SST field is not actually monotonic (a later signature won't validate as an earlier signature), and signatures don't contain their version numbers, there is actually a race condition between a remote node updating signature and signed_num while the local node is trying to do handle_verify_request. Since the remote node writes in the order (1) signature (2) signed_num, while the local node reads in the order (1) signed_num (2) signature, the local node could read an earlier signed_num and then a later signature. This causes signature validation to fail because the local node retrieves the wrong signature from its log to compare (e.g. it retrieves signature 15, when the signature it got from the SST is the signature for version 17). This race condition can be mostly fixed by having the local node re-read the signed_num after reading the signature and compare it with the value it has cached. If signed_num changed while reading the signature, it means the read interleaved with the remote node's write and we need to try again.

Also, the new-view initialization of the SST leaves all of the delivery-counter fields (like persisted_num) uninitialized, assuming the first delivery predicate will update them, but that assumption does not hold for signed_num because handle_verify_request can read from a remote node's signed_num before that node has written to it. Initializing signed_num to its last value from the previous view is safer.

While testing this bug fix, I had to fix two other issues with our testing code: (1) The default values for some of our port numbers overlap with Linux's ephemeral port range, causing transient failures when a process finds that one of these ports is already "in use," and (2) Tests that use ViewManager::barrier_sync() in the main thread can cause a view-change to hang because the view change also relies on barrier_sync() and only one barrier_sync() can be in progress at a time.

etremel added 7 commits July 17, 2025 11:10
Sometimes I get a "Warning: Signature did not match my signature!"
message while running CascadeChain, even though everything is working
fine. I think the problem is a race condition between one node updating
its SST and another node reading the SST in handle_verify_request. The
SST to_string method needs to be updated to display the signature-
related fields so I can debug this more easily, and also needed to be
fixed anyway to display the fields in the order they are declared.
The SST to_string should unpack joiner IP addresses with inet_ntoa,
since the point of this string is to be human-readable for debugging.
The signature race condition doesn't seem to show up in signed_log_test,
just in CascadeChain, so maybe the problem is signed_log_test uses
updates that are so small they complete quickly.
I discovered the source of at least one type of "Signature did not
match" warning: The signed_num field in DerechoSST was initialized to 0,
not -1, even though 0 is a valid version (the first version). This meant
handle_verify_request would try to verify another node's signature on
version 0 even if it hadn't yet put a signature in the SST signature
field. In fact, handle_verify_request needs a special case to check if a
peer node has not yet computed any signatures; it won't gracefully
degrade because 0 is a real version number and -1 is invalid as an
argument to subgroup_object->get_signature().
View changes keep getting stuck while trying to run signed_log_test.
Since the "signature" SST field is not actually monotonic (a later
signature won't validate as an earlier signature), and signatures don't
contain their version numbers, there is actually a race condition
between a remote node updating "signature" and "signed_num" while the
local node is trying to do handle_verify_request. Since the remote node
writes in the order (1) signature (2) signed_num, while the local node
reads in the order (1) signed_num (2) signature, the local node could
read an earlier signed_num and then a later signature. This causes
signature validation to fail because the local node retrieves the wrong
signature from its log to compare (e.g. it retrieves signature 15, when
the signature it got from the SST is the signature for version 17).

This race condition can be mostly fixed by having the local node re-read
the signed_num after reading the signature and compare it with the value
it has cached. If signed_num changed while reading the signature, it
means the read interleaved with the remote node's write and we need to
try again. If signed_num has the same value twice, it's probably the
correct value (matches the signature). It is still slightly possible
to get an inconsistent read if the remote node overwrites the signature
exactly as the local node is reading it (so it won't have changed
signed_num when the local node reads it a second time), but the window
for that race condition is much smaller.
Another contributor to the race condition in handle_verify_request is
the fact that the new-view initialization of the SST leaves all of the
delivery-counter fields (like persisted_num) at their default values,
assuming the first delivery predicate will update them, but
handle_verify_request can read from a remote node's signed_num before
that node has written to it. It's safer to initialize signed_num and the
corresponding signature to its correct value from the last view.
verified_num is written before it's read, like persisted_num, so it
doesn't need to be initialized.

Separately, the default values for the Derecho ports cause problems when
trying to test locally on Linux machines because they overlap with the
Linux ephemeral port range (32768 and higher). Tests keep failing
because a process accidentally connects to itself, or to another node's
system-assigned outgoing port for a different connection. It's better to
use lower values for these ports so they won't overlap with the
ephemeral range, and these values are still high enough to be unused by
any other service.
Using barrier_sync() in the main thread can "steal" the sync exchanges
from a view-change's sync() and prevent either one from completing,
which can happen if one node reaches the barrier_sync() while another
one is still starting up and joining (causing a view change). It should
work better to use a completely different socket and port for the final
synchronization, like Cascade's perftest does.
@KenBirman
Copy link
Copy Markdown
Contributor

Nice job catching this! Should we just move the Derecho port# address space to some other range?

@etremel
Copy link
Copy Markdown
Contributor Author

etremel commented Oct 21, 2025

Yes, I moved the default Derecho port numbers to the range 14480-14880 in commit ceeb679. This is large enough to still be in the unused port space but small enough to never overlap with the ephemeral port range. Of course, each other application that uses Derecho (like Cascade) will also need to update its own ports if it was using its own config files.

@etremel etremel merged commit c089ce6 into master Dec 19, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants