Skip to content

Bump go.etcd.io/raft to v3.7.0-beta.0#21778

Merged
serathius merged 1 commit into
etcd-io:mainfrom
serathius:bump-raft
May 27, 2026
Merged

Bump go.etcd.io/raft to v3.7.0-beta.0#21778
serathius merged 1 commit into
etcd-io:mainfrom
serathius:bump-raft

Conversation

@serathius
Copy link
Copy Markdown
Member

@serathius serathius commented May 21, 2026

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 79.87013% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.83%. Comparing base (e8305de) to head (37f2e02).

⚠️ Current head 37f2e02 differs from pull request most recent head 4715a54

Please upload reports for the commit 4715a54 to get more accurate results.

Files with missing lines Patch % Lines
server/etcdserver/server.go 80.51% 8 Missing and 7 partials ⚠️
server/etcdserver/api/rafthttp/stream.go 50.00% 7 Missing ⚠️
server/etcdserver/bootstrap.go 76.00% 4 Missing and 2 partials ⚠️
server/etcdserver/api/rafthttp/msgappv2_codec.go 83.33% 1 Missing and 4 partials ⚠️
server/etcdserver/api/rafthttp/remote.go 0.00% 5 Missing ⚠️
server/storage/wal/wal.go 72.22% 5 Missing ⚠️
server/storage/storage.go 42.85% 3 Missing and 1 partial ⚠️
server/etcdserver/raft.go 89.65% 0 Missing and 3 partials ⚠️
server/storage/util.go 88.88% 2 Missing and 1 partial ⚠️
server/etcdserver/api/rafthttp/http.go 80.00% 1 Missing and 1 partial ⚠️
... and 6 more
Additional details and impacted files
Files with missing lines Coverage Δ
server/etcdserver/api/rafthttp/msg_codec.go 100.00% <100.00%> (ø)
server/etcdserver/api/rafthttp/transport.go 88.57% <100.00%> (ø)
server/etcdserver/api/snap/message.go 100.00% <100.00%> (ø)
server/etcdserver/snapshot_merge.go 78.37% <100.00%> (ø)
server/mock/mockstorage/storage_recorder.go 95.00% <100.00%> (ø)
server/storage/backend.go 80.00% <100.00%> (ø)
server/storage/hooks.go 100.00% <100.00%> (ø)
server/storage/wal/decoder.go 89.00% <100.00%> (ø)
server/storage/wal/testing/waltesting.go 60.86% <100.00%> (ø)
server/storage/wal/version.go 52.83% <100.00%> (ø)
... and 16 more

... and 24 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #21778      +/-   ##
==========================================
+ Coverage   69.71%   69.83%   +0.12%     
==========================================
  Files         449      449              
  Lines       38186    38184       -2     
==========================================
+ Hits        26621    26666      +45     
+ Misses      10137    10092      -45     
+ Partials     1428     1426       -2     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8305de...4715a54. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@liggitt
Copy link
Copy Markdown
Contributor

liggitt commented May 21, 2026

WARNING: DATA RACE

ruh-roh ... is this something similar to the race we got in #21261 (comment) where a naive shallow-copy a message is no longer sufficient when passing it as a pointer and hanging onto internal mutable fields?

Comment thread server/etcdserver/raft.go Outdated
@serathius
Copy link
Copy Markdown
Member Author

/retest

@serathius
Copy link
Copy Markdown
Member Author

And all tests are passing!

@serathius
Copy link
Copy Markdown
Member Author

Started testing with etcd-io/raft#435

@ahrtr
Copy link
Copy Markdown
Member

ahrtr commented May 22, 2026

Started testing with etcd-io/raft#435

The PR was just merged. Pls sync this PR. If the workflow is green, we can tag raft v3.7.0-beta.0

@serathius serathius force-pushed the bump-raft branch 7 times, most recently from ebf6767 to 7a0420c Compare May 22, 2026 22:03
if _, err := enc.w.Write(enc.uint64buf); err != nil {
return err
}
opts := proto.MarshalOptions{}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at proto.Marshal, empty options are default.

if n < msgAppV2BufSize {
b, err := proto.Marshal(m.Entries[i])
if size < msgAppV2BufSize {
b, err := opts.MarshalAppend(enc.buf[:0], m.Entries[i])
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serathius
Copy link
Copy Markdown
Member Author

/retest

Copy link
Copy Markdown
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, with a couple of minor comments and questions.

Comment thread server/etcdserver/server.go Outdated
Comment thread server/etcdserver/server.go Outdated
@serathius
Copy link
Copy Markdown
Member Author

serathius commented May 27, 2026

@ahrtr addressed the comments by reverted the change to confstate. Waiting for official approval to squash the commits.

// the validation previously) to raft on bootstrap.
if shouldApplyV3 {
*confState = *s.r.ApplyConfChange(*cc)
*confState = *s.r.ApplyConfChange(cc)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*confState = *s.r.ApplyConfChange(cc)
confState = s.r.ApplyConfChange(cc)

Copy link
Copy Markdown
Member Author

@serathius serathius May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes TestIssue2746 test to fail. Like @fuweid we need to have a callback to etcdProcess.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this cause copylock failures? if so, is that a real issue we think will cause races due to subsequent mutations?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that my previous comment #21778 (comment) isn't correct. I understood why passing ep now.

Comment thread server/etcdserver/server.go
Copy link
Copy Markdown
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one comment #21778 (comment)

@serathius
Copy link
Copy Markdown
Member Author

Addressed the comment and squashed commits. PTAL @fuweid

@serathius
Copy link
Copy Markdown
Member Author

/retest

@serathius
Copy link
Copy Markdown
Member Author

serathius commented May 27, 2026

After last change all integration tests are consistency failing with TestIssue2746 test.

    logger.go:146: 2026-05-27T13:14:20.705Z	WARN	m5.raft	2c44984ad3de1c13 attempted to restore snapshot but it is not in the ConfState auto_leave:false; should never happen	{"member": "m5"}
    logger.go:146: 2026-05-27T13:14:20.705Z	INFO	m5.raft	2c44984ad3de1c13 [commit: 0] ignored snapshot [index: 37, term: 2]	{"member": "m5"}

@serathius
Copy link
Copy Markdown
Member Author

After reverting #21778 (comment) test TestIssue2746 passes.

Copy link
Copy Markdown
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think we have some follow-ups (non-blockers)

  1. https://github.com/etcd-io/raft/blob/67d129e88b8ff1acbfe8957fad48a127096a47f9/storage.go#L222

We should clone first and then try to mutate it if there is no metadata.
Because it could cause race condition if we don't copy snapshot in etcd side.

WARNING: DATA RACE
Write at 0x00c0028c64d0 by goroutine 326:
  go.etcd.io/raft/v3/raftpb.EnsureSnapshot()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/raftpb/util.go:54 +0x345
  go.etcd.io/raft/v3.(*MemoryStorage).ApplySnapshot()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/storage.go:222 +0xac
  go.etcd.io/etcd/server/v3/etcdserver.(*raftNode).start.func1()
      /home/niuhechaofan/workspace/etcd/server/etcdserver/raft.go:277 +0x12d5

Previous read at 0x00c0028c64d0 by goroutine 184:
  go.etcd.io/raft/v3/raftpb.(*Snapshot).GetMetadata()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/raftpb/raft.pb.go:518 +0x5d2
  go.etcd.io/raft/v3.IsEmptySnap()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/node.go:128 +0x5c0
  go.etcd.io/raft/v3.needStorageAppendRespMsg()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/rawnode.go:215 +0x5b0
  go.etcd.io/raft/v3.(*RawNode).acceptReady()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/rawnode.go:419 +0x3ba
  go.etcd.io/raft/v3.(*node).run()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/node.go:436 +0x1494
  go.etcd.io/raft/v3.StartNode.gowrap1()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/node.go:273 +0x2e
  1. We need refactor to avoid dereferencing pb (shadow-copy) of confState. etcdProgress should be pass-down to apply and we can make some callback.

@k8s-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, fuweid, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [ahrtr,fuweid,serathius]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment thread server/etcdserver/bootstrap.go Outdated
Comment thread server/etcdserver/raft.go
sentAppResp := false
var messages []*raftpb.Message
for i := len(ms) - 1; i >= 0; i-- {
m := proto.Clone(ms[i]).(*raftpb.Message) // Copy by value to avoid mutating the original message concurrently read by raft.
Copy link
Copy Markdown
Contributor

@liggitt liggitt May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we no longer need to copy before propagating to msgSnapC or returning?

Copy link
Copy Markdown
Member Author

@serathius serathius May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added a defensive copy in previous PR. @fuweid suggested we don't need it.
Our testing didn't find issue with that at the moment, but this can be easily changed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread server/etcdserver/raft.go
if m.Type == raftpb.MsgAppResp {
if m.GetType() == raftpb.MsgAppResp {
if sentAppResp {
m.To = 0
Copy link
Copy Markdown
Contributor

@liggitt liggitt May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously we were mutating m.To and passing it into the msgSnapC channel even if we'd already done sentAppResp?

is it going to break something downstream if we don't send to msgSnapC and continue in this case?

edit: oh, never mind... the channel adding and observing was only for MsgSnap / MsgHeartbeat types

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: if you want to make it more obvious this is not a logic change while dropping the m.To mutation, you could make a appendMessages := true above, set that to false anywhere we were setting m.To = 0, and check that instead of m.To != 0 when appending. adding a continue here and below is a little hard to follow when there's a lot more code that follows in the loop

Copy link
Copy Markdown
Member Author

@serathius serathius May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was also suggestion by @fuweid #21778 (comment)
It made sense to me, but didn't dig too deep. Depending on robustness/antithesis tests to validate.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I tested in my local and post that. It's fine. We can revisit that if we need to fix memory issue :)

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@serathius
Copy link
Copy Markdown
Member Author

/retest

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 79.87013% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.84%. Comparing base (e8305de) to head (4715a54).

Files with missing lines Patch % Lines
server/etcdserver/server.go 80.51% 8 Missing and 7 partials ⚠️
server/etcdserver/api/rafthttp/stream.go 50.00% 7 Missing ⚠️
server/etcdserver/bootstrap.go 76.00% 4 Missing and 2 partials ⚠️
server/etcdserver/api/rafthttp/msgappv2_codec.go 83.33% 1 Missing and 4 partials ⚠️
server/etcdserver/api/rafthttp/remote.go 0.00% 5 Missing ⚠️
server/storage/wal/wal.go 72.22% 5 Missing ⚠️
server/storage/storage.go 42.85% 3 Missing and 1 partial ⚠️
server/etcdserver/raft.go 89.65% 0 Missing and 3 partials ⚠️
server/storage/util.go 88.88% 2 Missing and 1 partial ⚠️
server/etcdserver/api/rafthttp/http.go 80.00% 1 Missing and 1 partial ⚠️
... and 6 more
Additional details and impacted files
Files with missing lines Coverage Δ
server/etcdserver/api/rafthttp/msg_codec.go 88.88% <100.00%> (-11.12%) ⬇️
server/etcdserver/api/rafthttp/transport.go 88.57% <100.00%> (ø)
server/etcdserver/api/snap/message.go 100.00% <100.00%> (ø)
server/etcdserver/snapshot_merge.go 78.37% <100.00%> (ø)
server/mock/mockstorage/storage_recorder.go 95.00% <100.00%> (ø)
server/storage/backend.go 80.00% <100.00%> (ø)
server/storage/hooks.go 100.00% <100.00%> (ø)
server/storage/wal/decoder.go 89.00% <100.00%> (ø)
server/storage/wal/testing/waltesting.go 60.86% <100.00%> (ø)
server/storage/wal/version.go 52.83% <100.00%> (ø)
... and 16 more

... and 22 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #21778      +/-   ##
==========================================
+ Coverage   69.71%   69.84%   +0.12%     
==========================================
  Files         449      449              
  Lines       38186    38184       -2     
==========================================
+ Hits        26621    26668      +47     
+ Misses      10137    10093      -44     
+ Partials     1428     1423       -5     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8305de...4715a54. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@serathius
Copy link
Copy Markdown
Member Author

/retest

@ahrtr
Copy link
Copy Markdown
Member

ahrtr commented May 27, 2026

LGTM

I think we have some follow-ups (non-blockers)

  1. https://github.com/etcd-io/raft/blob/67d129e88b8ff1acbfe8957fad48a127096a47f9/storage.go#L222

We should clone first and then try to mutate it if there is no metadata. Because it could cause race condition if we don't copy snapshot in etcd side.

WARNING: DATA RACE
Write at 0x00c0028c64d0 by goroutine 326:
  go.etcd.io/raft/v3/raftpb.EnsureSnapshot()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/raftpb/util.go:54 +0x345
  go.etcd.io/raft/v3.(*MemoryStorage).ApplySnapshot()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/storage.go:222 +0xac
  go.etcd.io/etcd/server/v3/etcdserver.(*raftNode).start.func1()
      /home/niuhechaofan/workspace/etcd/server/etcdserver/raft.go:277 +0x12d5

Previous read at 0x00c0028c64d0 by goroutine 184:
  go.etcd.io/raft/v3/raftpb.(*Snapshot).GetMetadata()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/raftpb/raft.pb.go:518 +0x5d2
  go.etcd.io/raft/v3.IsEmptySnap()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/node.go:128 +0x5c0
  go.etcd.io/raft/v3.needStorageAppendRespMsg()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/rawnode.go:215 +0x5b0
  go.etcd.io/raft/v3.(*RawNode).acceptReady()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/rawnode.go:419 +0x3ba
  go.etcd.io/raft/v3.(*node).run()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/node.go:436 +0x1494
  go.etcd.io/raft/v3.StartNode.gowrap1()
      /home/niuhechaofan/go/pkg/mod/go.etcd.io/raft/v3@v3.7.0-beta.0/node.go:273 +0x2e
  1. We need refactor to avoid dereferencing pb (shadow-copy) of confState. etcdProgress should be pass-down to apply and we can make some callback.

Just raised etcd-io/raft#441

@serathius serathius merged commit e11f39c into etcd-io:main May 27, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants