Skip to content

Add test for non-atomic vote persistence in handleRequestVoteRequest#1242

Open
Qian-Cheng-nju wants to merge 3 commits into
sofastack:masterfrom
specula-org:test/vote-persistence-bug
Open

Add test for non-atomic vote persistence in handleRequestVoteRequest#1242
Qian-Cheng-nju wants to merge 3 commits into
sofastack:masterfrom
specula-org:test/vote-persistence-bug

Conversation

@Qian-Cheng-nju
Copy link
Copy Markdown
Contributor

@Qian-Cheng-nju Qian-Cheng-nju commented Feb 16, 2026

Reproduction test for the issue described in #1241.

Summary by CodeRabbit

  • New Features
    • Added optional runtime tracing with NDJSON output for Raft events (votes, heartbeats, append entries, config changes, commit advancement).
  • Tests
    • Added tests reproducing vote-persistence/non-atomicity across restarts with simulated disk I/O failures and helpers to inject failures and build vote requests.
    • Added integration test exercising tracing in a 3-node cluster.
    • Added tests covering several pending bug scenarios (meta-file loss, apply exceptions, persistence crash).

@sofastack-cla
Copy link
Copy Markdown

sofastack-cla Bot commented Feb 16, 2026

Hi @Qian-Cheng-nju, welcome to SOFAStack community, Please sign Contributor License Agreement!

After you signed CLA, we will automatically sync the status of this pull request in 3 minutes.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an NDJSON Raft tracer and tracing hooks across core components, introduces tests reproducing non-atomic vote-persistence via an injectable disk-failure meta storage, and wires a Node reference into BallotBox options for tracing.

Changes

Cohort / File(s) Summary
Vote persistence test & storage injection
jraft-core/src/test/java/com/alipay/sofa/jraft/core/VotePersistenceBugTest.java
New test class reproducing non-atomic vote persistence. Adds DiskFailureMetaStorage (wraps LocalRaftMetaStorage, can induce I/O failures), DiskFailureServiceFactory for injection, buildVoteRequest helper, and two tests covering normal restart and I/O-failure scenarios.
Pending bug tests & crash simulators
jraft-core/src/test/java/com/alipay/sofa/jraft/core/PendingBugTest.java
New comprehensive tests exercising meta-file loss, apply-exception stalls, and persistence/crash timing via CrashingMetaStorage/CrashingServiceFactory. Contains helpers and multiple test cases reproducing pending bug scenarios.
Tracing utility and integration test
jraft-core/src/main/java/com/alipay/sofa/jraft/core/RaftTracer.java, jraft-core/src/test/java/com/alipay/sofa/jraft/core/RaftTraceTest.java
Adds RaftTracer NDJSON emitter (configurable via system props), ID mapping, synchronized writer, and many emit helpers; adds RaftTraceTest that runs a 3-node cluster with tracing enabled. Review file handling, concurrency, and test cleanup.
Node tracing hooks & state helper
jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java
Inserts RaftTracer emit calls across request/response and state transitions (vote, appendEntries, config change, become candidate/leader) and adds traceStateJson() to capture node snapshot for traces. Review read-lock usage and context captured.
BallotBox wiring
jraft-core/src/main/java/com/alipay/sofa/jraft/core/BallotBox.java, jraft-core/src/main/java/com/alipay/sofa/jraft/option/BallotBoxOptions.java
Adds node field to BallotBoxOptions with getter/setter; BallotBox stores node during init and calls RaftTracer.logAdvanceCommitIndex(node, ...) when node is present. Review lifecycle and null-guarding.
Replicator tracing hooks
jraft-core/src/main/java/com/alipay/sofa/jraft/core/Replicator.java
Adds RaftTracer emits on heartbeat, appendEntries, and install-snapshot response paths without altering control flow. Verify no functional side effects and minimal overhead.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

size/L

Suggested reviewers

  • fengjiachun
  • SteNicholas

Poem

🐰 I nudged a file where stubborn bytes hide,
Made the disk grumble, then watched votes collide.
Traces spill stories in neat NDJSON lines,
Tests chase the bug through restarts and fines.
Hop—I've hopped off; the logs now sing fine.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.35% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the primary change: adding a test for non-atomic vote persistence in handleRequestVoteRequest, which matches the main changeset focus.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@jraft-core/src/test/java/com/alipay/sofa/jraft/core/VotePersistenceBugTest.java`:
- Around line 235-241: The test VotePersistenceBugTest currently asserts the
buggy double-vote behavior by checking resp2.getGranted() with assertTrue;
update the test to mark this as a known-to-be-broken assertion and flip it to
assertFalse once issue `#1241` (atomic vote persistence) is fixed. Specifically,
locate the block using NodeImpl node2, method
handleRequestVoteRequest(buildVoteRequest(...)) producing resp2, and replace the
assertTrue(resp2.getGranted()) with assertFalse(resp2.getGranted()) or wrap it
with a clear FIXME comment (e.g., "// FIXME: flip to assertFalse after `#1241` is
fixed") so future maintainers know to change the assertion when the bug is
resolved. Ensure the message text referencing candidate1/candidate2 remains
informative after the flip.
🧹 Nitpick comments (3)
jraft-core/src/test/java/com/alipay/sofa/jraft/core/VotePersistenceBugTest.java (3)

47-67: LGTM on class setup and Javadoc.

The class-level Javadoc clearly explains the two-phase persistence problem being tested. The @Before setup is consistent with other JRaft tests.

Minor: the testName rule (line 57-58) is declared but never referenced in either test method. Consider removing it if not needed.

,


155-158: getMetaFilePath is unused — consider removing.

This helper is defined but never called in either test. It appears to have been superseded by the CrashPointMetaStorage.metaFilePath approach for reading on-disk state. Removing it avoids confusion.

🧹 Remove unused method
-    private String getMetaFilePath(final PeerId peer) {
-        return this.dataPath + File.separator + peer.getEndpoint().toString().replace(':', '_') + File.separator
-               + "meta" + File.separator + "raft_meta";
-    }

69-78: Teardown robustness: consider adding a try-finally around stopAll.

If c.stopAll() throws for one cluster, the remaining clusters in CLUSTERS won't be stopped, and FileUtils.deleteDirectory / NodeManager.clear() won't execute. A try-finally or try-catch inside the loop would improve test isolation.

Wrap cluster shutdown in try-catch
     `@After`
     public void teardown() throws Exception {
         if (!TestCluster.CLUSTERS.isEmpty()) {
             for (final TestCluster c : TestCluster.CLUSTERS.removeAll()) {
-                c.stopAll();
+                try {
+                    c.stopAll();
+                } catch (final Exception e) {
+                    // log and continue to ensure remaining cleanup executes
+                }
             }
         }
-        FileUtils.deleteDirectory(new File(this.dataPath));
-        NodeManager.getInstance().clear();
+        try {
+            FileUtils.deleteDirectory(new File(this.dataPath));
+        } finally {
+            NodeManager.getInstance().clear();
+        }
     }

Comment thread jraft-core/src/test/java/com/alipay/sofa/jraft/core/VotePersistenceBugTest.java Outdated
@sofastack-cla sofastack-cla Bot added cla:yes and removed cla:no labels Feb 16, 2026
@Qian-Cheng-nju Qian-Cheng-nju force-pushed the test/vote-persistence-bug branch from dbc4bf3 to 0b9a695 Compare February 16, 2026 12:41
@sofastack-cla sofastack-cla Bot added size/XL and removed size/L labels Mar 2, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
jraft-core/src/main/java/com/alipay/sofa/jraft/core/RaftTracer.java (1)

170-176: Either use or remove newCommitIndex in commit-index tracing API.

logAdvanceCommitIndex(..., newCommitIndex) currently ignores the parameter, which is confusing for call sites.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jraft-core/src/main/java/com/alipay/sofa/jraft/core/RaftTracer.java` around
lines 170 - 176, logAdvanceCommitIndex currently ignores its newCommitIndex
parameter; either remove the parameter or include it in the trace output. Fix by
updating logAdvanceCommitIndex(NodeImpl, long) to pass newCommitIndex into the
tracing/logging call (e.g. include the value in the event details instead of
calling logNodeEvent("AdvanceCommitIndex", node) with no value), or if you
prefer the simpler API change, remove the newCommitIndex parameter and update
all callers; reference symbols: logAdvanceCommitIndex, newCommitIndex,
logNodeEvent, NodeImpl.
jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java (1)

2659-2673: De-duplicate identical vote-response tracing call.

The same RaftTracer.logMsgEvent(...) appears in three branches; extracting once improves maintainability and keeps future payload changes consistent.

♻️ Proposed refactor
             if (response.getTerm() > this.currTerm) {
                 ...
-                RaftTracer.logMsgEvent("HandleRequestVoteResponse", this, RaftTracer.voteResponseMsg(peerId.toString(),
-                    this.serverId.toString(), response.getTerm(), response.getGranted()));
+                RaftTracer.logMsgEvent("HandleRequestVoteResponse", this, RaftTracer.voteResponseMsg(peerId.toString(),
+                    this.serverId.toString(), response.getTerm(), response.getGranted()));
                 return;
             }
             // check granted quorum?
             if (response.getGranted()) {
                 this.voteCtx.grant(peerId);
-                RaftTracer.logMsgEvent("HandleRequestVoteResponse", this, RaftTracer.voteResponseMsg(peerId.toString(),
-                    this.serverId.toString(), response.getTerm(), response.getGranted()));
                 if (this.voteCtx.isGranted()) {
                     becomeLeader();
                 }
-            } else {
-                RaftTracer.logMsgEvent("HandleRequestVoteResponse", this, RaftTracer.voteResponseMsg(peerId.toString(),
-                    this.serverId.toString(), response.getTerm(), response.getGranted()));
             }
+            RaftTracer.logMsgEvent("HandleRequestVoteResponse", this, RaftTracer.voteResponseMsg(peerId.toString(),
+                this.serverId.toString(), response.getTerm(), response.getGranted()));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java` around
lines 2659 - 2673, The three identical RaftTracer.logMsgEvent calls in
NodeImpl.handleRequestVoteResponse should be consolidated: compute the
voteResponseMsg once into a local variable (e.g., msg =
RaftTracer.voteResponseMsg(peerId.toString(), this.serverId.toString(),
response.getTerm(), response.getGranted())) and invoke
RaftTracer.logMsgEvent("HandleRequestVoteResponse", this, msg) a single time
before the conditional branches that use response.getGranted(); then remove the
duplicate RaftTracer.logMsgEvent calls in the early-return branch and both
granted/else branches, leaving voteCtx.grant(peerId) and becomeLeader() logic
intact. Ensure references to voteCtx, becomeLeader(), peerId, response and
voteResponseMsg remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@jraft-core/src/main/java/com/alipay/sofa/jraft/core/RaftTracer.java`:
- Around line 108-113: shutdown() closes the PrintWriter but leaves the static
WRITER reference non-null so isEnabled() still returns true; modify shutdown()
(RaftTracer.shutdown) to close/flush the current WRITER and then set the static
WRITER field to null (ensure the field is mutable, e.g., not final) so
isEnabled() (which checks WRITER != null) will return false and tracing paths
stop executing against a closed stream.

In `@jraft-core/src/main/java/com/alipay/sofa/jraft/core/Replicator.java`:
- Around line 744-746: The trace call in Replicator uses response.getTerm()
without null-check, which will NPE on transport errors; update the
RaftTracer.logReplicatorEvent invocation in Replicator to handle a null response
from the InstallSnapshot RPC by checking if response == null and passing a safe
placeholder (e.g., -1 or "null") or by building the installSnapshotResponseMsg
with a nullable term so
installSnapshotResponseMsg(r.options.getPeerId().toString(),
r.options.getServerId().toString(), response == null ? /*safeTerm*/ :
response.getTerm(), success) is used; ensure you reference the same symbols
(Replicator, RaftTracer.logReplicatorEvent, installSnapshotResponseMsg,
response.getTerm(), r.options.getPeerId(), r.options.getServerId()) so the null
path is handled before calling getTerm().

In `@jraft-core/src/test/java/com/alipay/sofa/jraft/core/RaftTraceTest.java`:
- Line 68: In RaftTraceTest's teardown where FileUtils.deleteDirectory(new
File(this.dataPath)) is called, guard against null or uninitialized dataPath by
checking this.dataPath for null/empty (and optionally verifying the File
exists/isDirectory) before constructing the File and calling
FileUtils.deleteDirectory; if null, skip deletion and/or wrap the delete call in
a safe conditional or try/catch to avoid NullPointerException when setup failed.
- Line 104: The test currently calls latch.await() which can hang indefinitely;
replace this with a bounded wait like latch.await(5, TimeUnit.SECONDS) and
assert the boolean result so failures fail fast (e.g.
assertTrue(latch.await(...))). Update the import to include
java.util.concurrent.TimeUnit and adjust the test in RaftTraceTest where
latch.await() is used to use the timed wait and an assertion to fail the test if
the latch was not released in time.

---

Nitpick comments:
In `@jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java`:
- Around line 2659-2673: The three identical RaftTracer.logMsgEvent calls in
NodeImpl.handleRequestVoteResponse should be consolidated: compute the
voteResponseMsg once into a local variable (e.g., msg =
RaftTracer.voteResponseMsg(peerId.toString(), this.serverId.toString(),
response.getTerm(), response.getGranted())) and invoke
RaftTracer.logMsgEvent("HandleRequestVoteResponse", this, msg) a single time
before the conditional branches that use response.getGranted(); then remove the
duplicate RaftTracer.logMsgEvent calls in the early-return branch and both
granted/else branches, leaving voteCtx.grant(peerId) and becomeLeader() logic
intact. Ensure references to voteCtx, becomeLeader(), peerId, response and
voteResponseMsg remain unchanged.

In `@jraft-core/src/main/java/com/alipay/sofa/jraft/core/RaftTracer.java`:
- Around line 170-176: logAdvanceCommitIndex currently ignores its
newCommitIndex parameter; either remove the parameter or include it in the trace
output. Fix by updating logAdvanceCommitIndex(NodeImpl, long) to pass
newCommitIndex into the tracing/logging call (e.g. include the value in the
event details instead of calling logNodeEvent("AdvanceCommitIndex", node) with
no value), or if you prefer the simpler API change, remove the newCommitIndex
parameter and update all callers; reference symbols: logAdvanceCommitIndex,
newCommitIndex, logNodeEvent, NodeImpl.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b9a695 and a0022e3.

📒 Files selected for processing (6)
  • jraft-core/src/main/java/com/alipay/sofa/jraft/core/BallotBox.java
  • jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java
  • jraft-core/src/main/java/com/alipay/sofa/jraft/core/RaftTracer.java
  • jraft-core/src/main/java/com/alipay/sofa/jraft/core/Replicator.java
  • jraft-core/src/main/java/com/alipay/sofa/jraft/option/BallotBoxOptions.java
  • jraft-core/src/test/java/com/alipay/sofa/jraft/core/RaftTraceTest.java

Comment on lines +108 to +113
public static void shutdown() {
final PrintWriter w = WRITER;
if (w != null) {
w.flush();
w.close();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make shutdown actually disable tracing.

After shutdown(), WRITER is still non-null, so isEnabled() stays true and tracing paths keep executing against a closed stream.

💡 Proposed fix
     public static void shutdown() {
         final PrintWriter w = WRITER;
         if (w != null) {
             w.flush();
             w.close();
+            WRITER = null;
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jraft-core/src/main/java/com/alipay/sofa/jraft/core/RaftTracer.java` around
lines 108 - 113, shutdown() closes the PrintWriter but leaves the static WRITER
reference non-null so isEnabled() still returns true; modify shutdown()
(RaftTracer.shutdown) to close/flush the current WRITER and then set the static
WRITER field to null (ensure the field is mutable, e.g., not final) so
isEnabled() (which checks WRITER != null) will return false and tracing paths
stop executing against a closed stream.

Comment on lines +744 to +746
RaftTracer.logReplicatorEvent("HandleInstallSnapshotResponse", r,
RaftTracer.installSnapshotResponseMsg(r.options.getPeerId().toString(),
r.options.getServerId().toString(), response.getTerm(), success));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Prevent null dereference when tracing failed InstallSnapshot RPCs.

response can be null on transport errors, but response.getTerm() is called unconditionally.

🐛 Proposed fix
-        RaftTracer.logReplicatorEvent("HandleInstallSnapshotResponse", r,
-            RaftTracer.installSnapshotResponseMsg(r.options.getPeerId().toString(),
-                r.options.getServerId().toString(), response.getTerm(), success));
+        final long respTerm = response != null ? response.getTerm() : r.options.getTerm();
+        RaftTracer.logReplicatorEvent("HandleInstallSnapshotResponse", r,
+            RaftTracer.installSnapshotResponseMsg(r.options.getPeerId().toString(),
+                r.options.getServerId().toString(), respTerm, success));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
RaftTracer.logReplicatorEvent("HandleInstallSnapshotResponse", r,
RaftTracer.installSnapshotResponseMsg(r.options.getPeerId().toString(),
r.options.getServerId().toString(), response.getTerm(), success));
final long respTerm = response != null ? response.getTerm() : r.options.getTerm();
RaftTracer.logReplicatorEvent("HandleInstallSnapshotResponse", r,
RaftTracer.installSnapshotResponseMsg(r.options.getPeerId().toString(),
r.options.getServerId().toString(), respTerm, success));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jraft-core/src/main/java/com/alipay/sofa/jraft/core/Replicator.java` around
lines 744 - 746, The trace call in Replicator uses response.getTerm() without
null-check, which will NPE on transport errors; update the
RaftTracer.logReplicatorEvent invocation in Replicator to handle a null response
from the InstallSnapshot RPC by checking if response == null and passing a safe
placeholder (e.g., -1 or "null") or by building the installSnapshotResponseMsg
with a nullable term so
installSnapshotResponseMsg(r.options.getPeerId().toString(),
r.options.getServerId().toString(), response == null ? /*safeTerm*/ :
response.getTerm(), success) is used; ensure you reference the same symbols
(Replicator, RaftTracer.logReplicatorEvent, installSnapshotResponseMsg,
response.getTerm(), r.options.getPeerId(), r.options.getServerId()) so the null
path is handled before calling getTerm().

}
}
RaftTracer.shutdown();
FileUtils.deleteDirectory(new File(this.dataPath));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard teardown path deletion when setup did not complete.

new File(this.dataPath) can throw when dataPath is null (e.g., setup failure before assignment).

💡 Proposed fix
-        FileUtils.deleteDirectory(new File(this.dataPath));
+        if (this.dataPath != null) {
+            FileUtils.deleteDirectory(new File(this.dataPath));
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jraft-core/src/test/java/com/alipay/sofa/jraft/core/RaftTraceTest.java` at
line 68, In RaftTraceTest's teardown where FileUtils.deleteDirectory(new
File(this.dataPath)) is called, guard against null or uninitialized dataPath by
checking this.dataPath for null/empty (and optionally verifying the File
exists/isDirectory) before constructing the File and calling
FileUtils.deleteDirectory; if null, skip deletion and/or wrap the delete call in
a safe conditional or try/catch to avoid NullPointerException when setup failed.

final Task task = new Task(data, new ExpectClosure(RaftError.SUCCESS, latch));
leader.apply(task);
}
latch.await();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bound the latch wait to avoid hanging test jobs.

latch.await() without timeout can stall the suite indefinitely on failure paths, making CI non-recoverable.

💡 Proposed fix
+import java.util.concurrent.TimeUnit;
...
-        latch.await();
+        assertTrue("Timed out waiting for replicated tasks",
+            latch.await(30, TimeUnit.SECONDS));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
latch.await();
assertTrue("Timed out waiting for replicated tasks",
latch.await(30, TimeUnit.SECONDS));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jraft-core/src/test/java/com/alipay/sofa/jraft/core/RaftTraceTest.java` at
line 104, The test currently calls latch.await() which can hang indefinitely;
replace this with a bounded wait like latch.await(5, TimeUnit.SECONDS) and
assert the boolean result so failures fail fast (e.g.
assertTrue(latch.await(...))). Update the import to include
java.util.concurrent.TimeUnit and adjust the test in RaftTraceTest where
latch.await() is used to use the timed wait and an assertion to fail the test if
the latch was not released in time.

@sofastack-cla sofastack-cla Bot added size/XXL and removed size/XL labels Mar 9, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@jraft-core/src/test/java/com/alipay/sofa/jraft/core/PendingBugTest.java`:
- Around line 485-513: The observer-term check is insufficient; instead assert a
direct signal that the observer actually processed node0's RequestVote (e.g., a
recorded vote or trace event). Update the test to access the observer as
NodeImpl and assert its voted-for state (use NodeImpl.getVotedFor() or add such
an accessor) or verify a RequestVote-handling trace/spy was recorded for node0's
id after electSelf()/requestVote(); replace the assertTrue(observerTermAfter >=
crashTerm) with an assertion asserting the observer's recorded votedFor equals
node0's id (or the equivalent explicit RPC-handled flag).
- Around line 153-180: The test never samples the restarted node's term before
heartbeats arrive; modify PendingBugTest to assert the node's startup term by
adding an injected hook into the node/meta storage initialization path:
instrument LocalRaftMetaStorage.load() (or the node start path used by
cluster.start(...)) to call a test callback or set a volatile field immediately
after load() returns but before heartbeats are processed, expose that value via
the test harness (e.g., TestCluster.getNode(...).getStartupTerm() or a
callback/latch), then in PendingBugTest assert the captured startup term (expect
0 for the reproduced bug) instead of inferring it from start() success and
waitLeader(); this verifies the transient term reset directly.
- Around line 255-259: The test creates a standalone RaftGroupService
(RaftGroupService raftGroupService = new RaftGroupService(...); node =
raftGroupService.start()) but never guarantees teardown if an assertion fails;
wrap the body that creates and starts rpcServer/raftGroupService/node in a
try/finally and in the finally block check for non-null raftGroupService and
call raftGroupService.shutdown(); followed by raftGroupService.join() so the
service is always stopped; ensure the same pattern is applied to the other
occurrence around lines 347-349.
- Around line 447-463: The test currently silently returns when leadership
transfer fails (leader == node0) which masks a setup failure; update
PendingBugTest so that after calling leader.transferLeadershipTo(peers.get(1))
you either retry deterministically (e.g., loop up to N attempts with small
sleeps and re-check leader via the same re-find loop) or explicitly fail/skip
the test instead of returning; modify the block around transferLeadershipTo(...)
/ the re-find loop to implement a bounded retry (or call an assertion like
fail(...) or a test assume to skip) so that a failed transfer does not produce a
false green test and the failure is visible.
- Around line 326-345: The test currently only asserts exceptionCount >= 1 and
that lastSuccessfulApply didn't advance, which can be satisfied by fail-fast
behavior; update the final assertions in PendingBugTest to assert the
stall-specific signals: require extraCompleted == false (extraLatch.await timed
out), extraApplied.get() == 0 (no extra entries applied), exceptionCount.get() >
1 (retries happened), and appliedAfterExtra == appliedAfterException
(lastSuccessfulApply unchanged). Use the existing variables extraCompleted,
extraApplied, exceptionCount,
lastSuccessfulApply/appliedAfterException/appliedAfterExtra to implement these
stricter checks so the test fails for fail-fast rejections but passes only when
the pipeline is truly stalled.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f47c9b00-0463-480f-899c-7bb364119ac8

📥 Commits

Reviewing files that changed from the base of the PR and between a0022e3 and 7e5ca3c.

📒 Files selected for processing (1)
  • jraft-core/src/test/java/com/alipay/sofa/jraft/core/PendingBugTest.java

Comment on lines +153 to +180
// Find the restarted node and check its term
// After restart with missing meta, the node should have started with term=0
// and then been updated by heartbeats from the leader.
// But the BUG is that it STARTED with term=0, which means there was a
// window where it could have voted in an already-decided term.
//
// We can't easily observe the transient term=0 state because heartbeats
// update it quickly. Instead, we verify the meta file was missing and
// the node still started successfully (return true from init), which
// proves load() returned true with term=0.
//
// To make this more observable, we read the meta file right after
// restart but before heartbeats arrive. But TestCluster doesn't give us
// that granularity. So we verify the structural condition:
// the node started successfully despite the meta file being absent.

// The fact that start() returned true (above) already proves the bug:
// LocalRaftMetaStorage.load() returned true for FileNotFoundException,
// initializing with term=0.

// Let's also verify that eventually the node catches up
// (proving it was running with stale state initially)
this.cluster.waitLeader();
assertNotNull(this.cluster.getLeader());

System.out.println("=== PB-2: Node restarted successfully with deleted meta file ===");
System.out.println("=== PB-2: BUG CONFIRMED — node started with term=0, no cross-validation ===");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This test never observes the term reset it claims to reproduce.

After restart, the only checked behavior is that cluster.start(...) succeeds and the cluster still has a leader. That matches the current LocalRaftMetaStorage.load() behavior in jraft-core/src/main/java/com/alipay/sofa/jraft/storage/impl/LocalRaftMetaStorage.java:89-104, but it never samples the victim's term before heartbeats repair it, so a fix that keeps startup successful while closing the unsafe window would still pass. Please assert the victim's startup term through an injected hook instead of inferring the bug from start().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jraft-core/src/test/java/com/alipay/sofa/jraft/core/PendingBugTest.java`
around lines 153 - 180, The test never samples the restarted node's term before
heartbeats arrive; modify PendingBugTest to assert the node's startup term by
adding an injected hook into the node/meta storage initialization path:
instrument LocalRaftMetaStorage.load() (or the node start path used by
cluster.start(...)) to call a test callback or set a volatile field immediately
after load() returns but before heartbeats are processed, expose that value via
the test harness (e.g., TestCluster.getNode(...).getStartupTerm() or a
callback/latch), then in PendingBugTest assert the captured startup term (expect
0 for the reproduced bug) instead of inferring it from start() success and
waitLeader(); this verifies the transient term reset directly.

Comment on lines +255 to +259
// Must create RPC server for the node to function (same as TestCluster)
final RpcServer rpcServer = RaftRpcServerFactory.createRaftRpcServer(peer.getEndpoint());
final RaftGroupService raftGroupService = new RaftGroupService("pb3-test", peer, nodeOptions, rpcServer);
final Node node = raftGroupService.start();
assertNotNull("Node should start", node);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Always tear down the standalone RaftGroupService.

@After only stops TestCluster instances. If any assertion above Line 348 fails, this node keeps running and can poison later tests. Wrap the test body in try/finally so shutdown()/join() always execute.

Also applies to: 347-349

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jraft-core/src/test/java/com/alipay/sofa/jraft/core/PendingBugTest.java`
around lines 255 - 259, The test creates a standalone RaftGroupService
(RaftGroupService raftGroupService = new RaftGroupService(...); node =
raftGroupService.start()) but never guarantees teardown if an assertion fails;
wrap the body that creates and starts rpcServer/raftGroupService/node in a
try/finally and in the finally block check for non-null raftGroupService and
call raftGroupService.shutdown(); followed by raftGroupService.join() so the
service is always stopped; ensure the same pattern is applied to the other
occurrence around lines 347-349.

Comment on lines +326 to +345
// Wait — but these should NOT complete because the pipeline is stalled
boolean extraCompleted = extraLatch.await(5, TimeUnit.SECONDS);

// The pipeline should be stalled — extra entries should not be applied
final long appliedAfterExtra = lastSuccessfulApply.get();
System.out.println("=== PB-3: Applied index after extra tasks: " + appliedAfterExtra + " ===");
System.out.println("=== PB-3: Extra tasks completed: " + extraCompleted + " ===");
System.out.println("=== PB-3: Extra tasks applied ok: " + extraApplied.get() + " ===");
System.out.println("=== PB-3: Total exception count: " + exceptionCount.get() + " ===");

if (appliedAfterExtra == appliedAfterException && exceptionCount.get() > 1) {
System.out.println("=== PB-3: BUG CONFIRMED — apply pipeline stalled, repeated exceptions ===");
} else if (appliedAfterExtra > appliedAfterException) {
System.out.println("=== PB-3: BUG NOT REPRODUCED — pipeline recovered (entries were applied) ===");
}

// Assert the bug: exception count should be > 1 (retried the same batch)
assertTrue("Exception should have fired at least once", exceptionCount.get() >= 1);
// The pipeline should be stalled (lastAppliedIndex unchanged)
assertEquals("Apply index should not advance after exception", appliedAfterException, appliedAfterExtra);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The assertions don't distinguish a stall from fail-fast error handling.

You log extraCompleted, extraApplied, and exceptionCount > 1, but the test only asserts exceptionCount >= 1 and a frozen lastSuccessfulApply. A fix that rejects later apply() calls or surfaces the FSM error immediately would still pass. Please assert the stall-specific signal here instead.

Example of tightening the final assertions
-        assertTrue("Exception should have fired at least once", exceptionCount.get() >= 1);
+        assertTrue("The poison entry should be retried after the first failure", exceptionCount.get() > 1);
+        assertFalse("Later tasks should remain pending while the pipeline is stalled", extraCompleted);
+        assertFalse("Later tasks must not be applied successfully", extraApplied.get());
         assertEquals("Apply index should not advance after exception", appliedAfterException, appliedAfterExtra);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Wait — but these should NOT complete because the pipeline is stalled
boolean extraCompleted = extraLatch.await(5, TimeUnit.SECONDS);
// The pipeline should be stalled — extra entries should not be applied
final long appliedAfterExtra = lastSuccessfulApply.get();
System.out.println("=== PB-3: Applied index after extra tasks: " + appliedAfterExtra + " ===");
System.out.println("=== PB-3: Extra tasks completed: " + extraCompleted + " ===");
System.out.println("=== PB-3: Extra tasks applied ok: " + extraApplied.get() + " ===");
System.out.println("=== PB-3: Total exception count: " + exceptionCount.get() + " ===");
if (appliedAfterExtra == appliedAfterException && exceptionCount.get() > 1) {
System.out.println("=== PB-3: BUG CONFIRMED — apply pipeline stalled, repeated exceptions ===");
} else if (appliedAfterExtra > appliedAfterException) {
System.out.println("=== PB-3: BUG NOT REPRODUCED — pipeline recovered (entries were applied) ===");
}
// Assert the bug: exception count should be > 1 (retried the same batch)
assertTrue("Exception should have fired at least once", exceptionCount.get() >= 1);
// The pipeline should be stalled (lastAppliedIndex unchanged)
assertEquals("Apply index should not advance after exception", appliedAfterException, appliedAfterExtra);
// Wait — but these should NOT complete because the pipeline is stalled
boolean extraCompleted = extraLatch.await(5, TimeUnit.SECONDS);
// The pipeline should be stalled — extra entries should not be applied
final long appliedAfterExtra = lastSuccessfulApply.get();
System.out.println("=== PB-3: Applied index after extra tasks: " + appliedAfterExtra + " ===");
System.out.println("=== PB-3: Extra tasks completed: " + extraCompleted + " ===");
System.out.println("=== PB-3: Extra tasks applied ok: " + extraApplied.get() + " ===");
System.out.println("=== PB-3: Total exception count: " + exceptionCount.get() + " ===");
if (appliedAfterExtra == appliedAfterException && exceptionCount.get() > 1) {
System.out.println("=== PB-3: BUG CONFIRMED — apply pipeline stalled, repeated exceptions ===");
} else if (appliedAfterExtra > appliedAfterException) {
System.out.println("=== PB-3: BUG NOT REPRODUCED — pipeline recovered (entries were applied) ===");
}
// Assert the bug: exception count should be > 1 (retried the same batch)
assertTrue("The poison entry should be retried after the first failure", exceptionCount.get() > 1);
assertFalse("Later tasks should remain pending while the pipeline is stalled", extraCompleted);
assertFalse("Later tasks must not be applied successfully", extraApplied.get());
assertEquals("Apply index should not advance after exception", appliedAfterException, appliedAfterExtra);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jraft-core/src/test/java/com/alipay/sofa/jraft/core/PendingBugTest.java`
around lines 326 - 345, The test currently only asserts exceptionCount >= 1 and
that lastSuccessfulApply didn't advance, which can be satisfied by fail-fast
behavior; update the final assertions in PendingBugTest to assert the
stall-specific signals: require extraCompleted == false (extraLatch.await timed
out), extraApplied.get() == 0 (no extra entries applied), exceptionCount.get() >
1 (retries happened), and appliedAfterExtra == appliedAfterException
(lastSuccessfulApply unchanged). Use the existing variables extraCompleted,
extraApplied, exceptionCount,
lastSuccessfulApply/appliedAfterException/appliedAfterExtra to implement these
stricter checks so the test fails for fail-fast rejections but passes only when
the pipeline is truly stalled.

Comment on lines +447 to +463
if (leader == node0) {
// Transfer leadership to node1 first
final Status transferStatus = leader.transferLeadershipTo(peers.get(1));
Thread.sleep(2000);
// Re-find leader
leader = null;
for (final Node n : nodes) {
if (n.getLeaderId() != null && !n.getLeaderId().isEmpty()) {
if (n.getNodeId().getPeerId().equals(n.getLeaderId())) {
leader = n;
}
}
}
if (leader == node0) {
System.out.println("=== PB-1: Could not transfer leadership away from node0, skipping ===");
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't silently return when PB-1 setup fails.

Line 462 turns a failed leadership transfer into a green test. If node0 keeps leadership, this method never exercises the crash path at all. Please retry with a deterministic setup or fail/skip explicitly instead of returning.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jraft-core/src/test/java/com/alipay/sofa/jraft/core/PendingBugTest.java`
around lines 447 - 463, The test currently silently returns when leadership
transfer fails (leader == node0) which masks a setup failure; update
PendingBugTest so that after calling leader.transferLeadershipTo(peers.get(1))
you either retry deterministically (e.g., loop up to N attempts with small
sleeps and re-check leader via the same re-find loop) or explicitly fail/skip
the test instead of returning; modify the block around transferLeadershipTo(...)
/ the re-find loop to implement a bounded retry (or call an assertion like
fail(...) or a test assume to skip) so that a failed transfer does not produce a
false green test and the failure is visible.

Comment on lines +485 to +513
// Check the surviving non-leader, non-node0 node's term
// Find the surviving observer node (not node0, not the stopped leader)
Node observer = null;
for (int i = 0; i < nodes.size(); i++) {
if (i != 0 && i != leaderIdx) {
observer = nodes.get(i);
break;
}
}
assertNotNull("Should have an observer node", observer);

final long observerTermAfter = ((NodeImpl) observer).getCurrentTerm();
System.out
.println("=== PB-1: Observer term before=" + termBefore + ", after=" + observerTermAfter + " ===");
System.out.println("=== PB-1: Node0 tried to persist term=" + crashTerm + " ===");

if (observerTermAfter >= crashTerm) {
System.out.println("=== PB-1: BUG CONFIRMED — observer received RequestVote (term updated to "
+ observerTermAfter + ") before node0's persistence at term " + crashTerm + " ===");
} else {
System.out.println("=== PB-1: Observer term did not advance to crash term. "
+ "RPC may not have been delivered yet. ===");
}

// The bug assertion: the observer's term should have advanced to at least
// the crash term, proving the RequestVote RPC arrived before persistence.
assertTrue("Observer should have received RequestVote before persistence. " + "Observer term="
+ observerTermAfter + ", crash term=" + crashTerm, observerTermAfter >= crashTerm);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Observer term growth is an ambiguous proxy for RequestVote delivery.

After the original leader is shut down, the observer can time out and increment its own term without ever receiving node0's vote RPC. That makes observerTermAfter >= crashTerm insufficient to prove the ordering in NodeImpl.electSelf() (requestVote() before setTermAndVotedFor() at jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java:1228-1231). Please assert a direct signal from the request-vote path on the observer instead, such as a trace event, an RPC spy, or recorded votedFor.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jraft-core/src/test/java/com/alipay/sofa/jraft/core/PendingBugTest.java`
around lines 485 - 513, The observer-term check is insufficient; instead assert
a direct signal that the observer actually processed node0's RequestVote (e.g.,
a recorded vote or trace event). Update the test to access the observer as
NodeImpl and assert its voted-for state (use NodeImpl.getVotedFor() or add such
an accessor) or verify a RequestVote-handling trace/spy was recorded for node0's
id after electSelf()/requestVote(); replace the assertTrue(observerTermAfter >=
crashTerm) with an assertion asserting the observer's recorded votedFor equals
node0's id (or the equivalent explicit RPC-handled flag).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant