Bump go.etcd.io/raft to v3.7.0-beta.0#21778
Conversation
Codecov Report❌ Patch coverage is Please upload reports for the commit 4715a54 to get more accurate results. Additional details and impacted files
... and 24 files with indirect coverage changes @@ Coverage Diff @@
## main #21778 +/- ##
==========================================
+ Coverage 69.71% 69.83% +0.12%
==========================================
Files 449 449
Lines 38186 38184 -2
==========================================
+ Hits 26621 26666 +45
+ Misses 10137 10092 -45
+ Partials 1428 1426 -2 Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
ruh-roh ... is this something similar to the race we got in #21261 (comment) where a naive shallow-copy a message is no longer sufficient when passing it as a pointer and hanging onto internal mutable fields? |
|
/retest |
|
And all tests are passing! |
|
Started testing with etcd-io/raft#435 |
The PR was just merged. Pls sync this PR. If the workflow is green, we can tag raft v3.7.0-beta.0 |
ebf6767 to
7a0420c
Compare
| if _, err := enc.w.Write(enc.uint64buf); err != nil { | ||
| return err | ||
| } | ||
| opts := proto.MarshalOptions{} |
There was a problem hiding this comment.
Looking at proto.Marshal, empty options are default.
| if n < msgAppV2BufSize { | ||
| b, err := proto.Marshal(m.Entries[i]) | ||
| if size < msgAppV2BufSize { | ||
| b, err := opts.MarshalAppend(enc.buf[:0], m.Entries[i]) |
There was a problem hiding this comment.
|
/retest |
ahrtr
left a comment
There was a problem hiding this comment.
Overall looks good, with a couple of minor comments and questions.
|
@ahrtr addressed the comments by reverted the change to confstate. Waiting for official approval to squash the commits. |
| // the validation previously) to raft on bootstrap. | ||
| if shouldApplyV3 { | ||
| *confState = *s.r.ApplyConfChange(*cc) | ||
| *confState = *s.r.ApplyConfChange(cc) |
There was a problem hiding this comment.
| *confState = *s.r.ApplyConfChange(cc) | |
| confState = s.r.ApplyConfChange(cc) |
There was a problem hiding this comment.
This causes TestIssue2746 test to fail. Like @fuweid we need to have a callback to etcdProcess.
There was a problem hiding this comment.
does this cause copylock failures? if so, is that a real issue we think will cause races due to subsequent mutations?
There was a problem hiding this comment.
I just realized that my previous comment #21778 (comment) isn't correct. I understood why passing ep now.
|
Addressed the comment and squashed commits. PTAL @fuweid |
|
/retest |
|
After last change all integration tests are consistency failing with TestIssue2746 test. |
|
After reverting #21778 (comment) test TestIssue2746 passes. |
fuweid
left a comment
There was a problem hiding this comment.
LGTM
I think we have some follow-ups (non-blockers)
We should clone first and then try to mutate it if there is no metadata.
Because it could cause race condition if we don't copy snapshot in etcd side.
WARNING: DATA RACE
Write at 0x00c0028c64d0 by goroutine 326:
go.etcd.io/raft/v3/raftpb.EnsureSnapshot()
/home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/raftpb/util.go:54 +0x345
go.etcd.io/raft/v3.(*MemoryStorage).ApplySnapshot()
/home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/storage.go:222 +0xac
go.etcd.io/etcd/server/v3/etcdserver.(*raftNode).start.func1()
/home/niuhechaofan/workspace/etcd/server/etcdserver/raft.go:277 +0x12d5
Previous read at 0x00c0028c64d0 by goroutine 184:
go.etcd.io/raft/v3/raftpb.(*Snapshot).GetMetadata()
/home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/raftpb/raft.pb.go:518 +0x5d2
go.etcd.io/raft/v3.IsEmptySnap()
/home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/node.go:128 +0x5c0
go.etcd.io/raft/v3.needStorageAppendRespMsg()
/home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/rawnode.go:215 +0x5b0
go.etcd.io/raft/v3.(*RawNode).acceptReady()
/home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/rawnode.go:419 +0x3ba
go.etcd.io/raft/v3.(*node).run()
/home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/node.go:436 +0x1494
go.etcd.io/raft/v3.StartNode.gowrap1()
/home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/node.go:273 +0x2e
- We need refactor to avoid dereferencing pb (shadow-copy) of confState.
etcdProgressshould be pass-down to apply and we can make some callback.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, fuweid, serathius The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| sentAppResp := false | ||
| var messages []*raftpb.Message | ||
| for i := len(ms) - 1; i >= 0; i-- { | ||
| m := proto.Clone(ms[i]).(*raftpb.Message) // Copy by value to avoid mutating the original message concurrently read by raft. |
There was a problem hiding this comment.
we no longer need to copy before propagating to msgSnapC or returning?
There was a problem hiding this comment.
We added a defensive copy in previous PR. @fuweid suggested we don't need it.
Our testing didn't find issue with that at the moment, but this can be easily changed.
| if m.Type == raftpb.MsgAppResp { | ||
| if m.GetType() == raftpb.MsgAppResp { | ||
| if sentAppResp { | ||
| m.To = 0 |
There was a problem hiding this comment.
previously we were mutating m.To and passing it into the msgSnapC channel even if we'd already done sentAppResp?
is it going to break something downstream if we don't send to msgSnapC and continue in this case?
edit: oh, never mind... the channel adding and observing was only for MsgSnap / MsgHeartbeat types
There was a problem hiding this comment.
non-blocking: if you want to make it more obvious this is not a logic change while dropping the m.To mutation, you could make a appendMessages := true above, set that to false anywhere we were setting m.To = 0, and check that instead of m.To != 0 when appending. adding a continue here and below is a little hard to follow when there's a lot more code that follows in the loop
There was a problem hiding this comment.
This was also suggestion by @fuweid #21778 (comment)
It made sense to me, but didn't dig too deep. Depending on robustness/antithesis tests to validate.
There was a problem hiding this comment.
Yeah. I tested in my local and post that. It's fine. We can revisit that if we need to fix memory issue :)
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
|
/retest |
|
/retest |
Just raised etcd-io/raft#441 |
/cc @fuweid @Jefftree