Fixed bugs in signature verification#288
Merged
Merged
Conversation
Sometimes I get a "Warning: Signature did not match my signature!" message while running CascadeChain, even though everything is working fine. I think the problem is a race condition between one node updating its SST and another node reading the SST in handle_verify_request. The SST to_string method needs to be updated to display the signature- related fields so I can debug this more easily, and also needed to be fixed anyway to display the fields in the order they are declared.
The SST to_string should unpack joiner IP addresses with inet_ntoa, since the point of this string is to be human-readable for debugging. The signature race condition doesn't seem to show up in signed_log_test, just in CascadeChain, so maybe the problem is signed_log_test uses updates that are so small they complete quickly.
I discovered the source of at least one type of "Signature did not match" warning: The signed_num field in DerechoSST was initialized to 0, not -1, even though 0 is a valid version (the first version). This meant handle_verify_request would try to verify another node's signature on version 0 even if it hadn't yet put a signature in the SST signature field. In fact, handle_verify_request needs a special case to check if a peer node has not yet computed any signatures; it won't gracefully degrade because 0 is a real version number and -1 is invalid as an argument to subgroup_object->get_signature().
View changes keep getting stuck while trying to run signed_log_test.
Since the "signature" SST field is not actually monotonic (a later signature won't validate as an earlier signature), and signatures don't contain their version numbers, there is actually a race condition between a remote node updating "signature" and "signed_num" while the local node is trying to do handle_verify_request. Since the remote node writes in the order (1) signature (2) signed_num, while the local node reads in the order (1) signed_num (2) signature, the local node could read an earlier signed_num and then a later signature. This causes signature validation to fail because the local node retrieves the wrong signature from its log to compare (e.g. it retrieves signature 15, when the signature it got from the SST is the signature for version 17). This race condition can be mostly fixed by having the local node re-read the signed_num after reading the signature and compare it with the value it has cached. If signed_num changed while reading the signature, it means the read interleaved with the remote node's write and we need to try again. If signed_num has the same value twice, it's probably the correct value (matches the signature). It is still slightly possible to get an inconsistent read if the remote node overwrites the signature exactly as the local node is reading it (so it won't have changed signed_num when the local node reads it a second time), but the window for that race condition is much smaller.
Another contributor to the race condition in handle_verify_request is the fact that the new-view initialization of the SST leaves all of the delivery-counter fields (like persisted_num) at their default values, assuming the first delivery predicate will update them, but handle_verify_request can read from a remote node's signed_num before that node has written to it. It's safer to initialize signed_num and the corresponding signature to its correct value from the last view. verified_num is written before it's read, like persisted_num, so it doesn't need to be initialized. Separately, the default values for the Derecho ports cause problems when trying to test locally on Linux machines because they overlap with the Linux ephemeral port range (32768 and higher). Tests keep failing because a process accidentally connects to itself, or to another node's system-assigned outgoing port for a different connection. It's better to use lower values for these ports so they won't overlap with the ephemeral range, and these values are still high enough to be unused by any other service.
Using barrier_sync() in the main thread can "steal" the sync exchanges from a view-change's sync() and prevent either one from completing, which can happen if one node reaches the barrier_sync() while another one is still starting up and joining (causing a view change). It should work better to use a completely different socket and port for the final synchronization, like Cascade's perftest does.
Contributor
|
Nice job catching this! Should we just move the Derecho port# address space to some other range? |
Contributor
Author
|
Yes, I moved the default Derecho port numbers to the range 14480-14880 in commit ceeb679. This is large enough to still be in the unused port space but small enough to never overlap with the ephemeral port range. Of course, each other application that uses Derecho (like Cascade) will also need to update its own ports if it was using its own config files. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I found and fixed a bug in the way handle_verify_request read and verified signatures. Since the "signature" SST field is not actually monotonic (a later signature won't validate as an earlier signature), and signatures don't contain their version numbers, there is actually a race condition between a remote node updating
signatureandsigned_numwhile the local node is trying to do handle_verify_request. Since the remote node writes in the order (1)signature(2)signed_num, while the local node reads in the order (1)signed_num(2)signature, the local node could read an earliersigned_numand then a later signature. This causes signature validation to fail because the local node retrieves the wrong signature from its log to compare (e.g. it retrieves signature 15, when the signature it got from the SST is the signature for version 17). This race condition can be mostly fixed by having the local node re-read thesigned_numafter reading the signature and compare it with the value it has cached. Ifsigned_numchanged while reading the signature, it means the read interleaved with the remote node's write and we need to try again.Also, the new-view initialization of the SST leaves all of the delivery-counter fields (like
persisted_num) uninitialized, assuming the first delivery predicate will update them, but that assumption does not hold forsigned_numbecausehandle_verify_requestcan read from a remote node's signed_num before that node has written to it. Initializingsigned_numto its last value from the previous view is safer.While testing this bug fix, I had to fix two other issues with our testing code: (1) The default values for some of our port numbers overlap with Linux's ephemeral port range, causing transient failures when a process finds that one of these ports is already "in use," and (2) Tests that use
ViewManager::barrier_sync()in the main thread can cause a view-change to hang because the view change also relies on barrier_sync() and only one barrier_sync() can be in progress at a time.