Skip to content

Commit b7e401f

Browse files
committed
Partially handled race conditions when reading the signature
Since the "signature" SST field is not actually monotonic (a later signature won't validate as an earlier signature), and signatures don't contain their version numbers, there is actually a race condition between a remote node updating "signature" and "signed_num" while the local node is trying to do handle_verify_request. Since the remote node writes in the order (1) signature (2) signed_num, while the local node reads in the order (1) signed_num (2) signature, the local node could read an earlier signed_num and then a later signature. This causes signature validation to fail because the local node retrieves the wrong signature from its log to compare (e.g. it retrieves signature 15, when the signature it got from the SST is the signature for version 17). This race condition can be mostly fixed by having the local node re-read the signed_num after reading the signature and compare it with the value it has cached. If signed_num changed while reading the signature, it means the read interleaved with the remote node's write and we need to try again. If signed_num has the same value twice, it's probably the correct value (matches the signature). It is still slightly possible to get an inconsistent read if the remote node overwrites the signature exactly as the local node is reading it (so it won't have changed signed_num when the local node reads it a second time), but the window for that race condition is much smaller.
1 parent 94094dd commit b7e401f

2 files changed

Lines changed: 22 additions & 9 deletions

File tree

src/core/git_version.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,8 @@ namespace derecho {
1313
const int MAJOR_VERSION = 2;
1414
const int MINOR_VERSION = 4;
1515
const int PATCH_VERSION = 1;
16-
const int COMMITS_AHEAD_OF_VERSION = 9;
16+
const int COMMITS_AHEAD_OF_VERSION = 10;
1717
const char* VERSION_STRING = "2.4.1";
18-
const char* VERSION_STRING_PLUS_COMMITS = "2.4.1+9";
18+
const char* VERSION_STRING_PLUS_COMMITS = "2.4.1+10";
1919

2020
}

src/core/persistence_manager.cpp

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -231,9 +231,8 @@ void PersistenceManager::handle_verify_request(subgroup_id_t subgroup_id, persis
231231
if(shard_member_rank == Vc.gmsSST->get_local_index()) {
232232
continue;
233233
}
234-
// The signature in the other node's "signatures" column corresponds to the version in its "signed_num" column
235-
const persistent::version_t other_signed_version = Vc.gmsSST->signed_num[shard_member_rank][subgroup_id];
236-
dbg_debug(persistence_logger, "PersistenceManager: Node {} has latest signed version {}, checking that version (verify request was for version {})", Vc.members[shard_member_rank], other_signed_version, version);
234+
// Read the other node's signed_num column, which should indicate what version has been signed
235+
persistent::version_t other_signed_version = Vc.gmsSST->signed_num[shard_member_rank][subgroup_id];
237236
// If this node hasn't finished signing that version yet, we won't be able to check that the other node's signature
238237
// matches the local signature. It also means the minimum signed version can't advance past my_signed_version anyway.
239238
if(other_signed_version > my_signed_version) {
@@ -245,11 +244,25 @@ void PersistenceManager::handle_verify_request(subgroup_id_t subgroup_id, persis
245244
dbg_debug(persistence_logger, "PersistenceManager: Skipping signature check for node {} because it has not signed any versions yet", Vc.members[shard_member_rank]);
246245
continue;
247246
}
248-
//Copy out the signature so it can't change during verification
247+
// Attempt to read the signature from the signature column, then check signed_num again
248+
// If the other node updated the signature while we were reading it, signed_num will change,
249+
// so we have to read both of them again
249250
std::vector<uint8_t> other_signature(signature_size);
250-
gmssst::set(other_signature.data(),
251-
&Vc.gmsSST->signatures[shard_member_rank][subgroup_id * signature_size],
252-
signature_size);
251+
bool consistent_read = false;
252+
while(!consistent_read) {
253+
gmssst::set(other_signature.data(),
254+
&Vc.gmsSST->signatures[shard_member_rank][subgroup_id * signature_size],
255+
signature_size);
256+
persistent::version_t new_signed_version = Vc.gmsSST->signed_num[shard_member_rank][subgroup_id];
257+
if(new_signed_version == other_signed_version) {
258+
consistent_read = true;
259+
} else {
260+
dbg_debug(persistence_logger, "PersistenceManager: Read signed_num {}, then {} from node {}, trying again", other_signed_version, new_signed_version, Vc.members[shard_member_rank]);
261+
other_signed_version = new_signed_version;
262+
}
263+
}
264+
dbg_debug(persistence_logger, "PersistenceManager: Got a consistent read of signed_num {} and signature from node {}", other_signed_version, Vc.members[shard_member_rank]);
265+
253266
// Retrieve this node's signature on that version
254267
std::vector<std::uint8_t> my_signature = subgroup_object->get_signature(other_signed_version);
255268
if(my_signature.size() == 0) {

0 commit comments

Comments
 (0)