Skip to content

[AI-8th] fix: add null check and graceful degradation in AppendEntries#1260

Open
jervyclaw wants to merge 2 commits into
sofastack:masterfrom
jervyclaw:pr/fix/appendentries-null-check
Open

[AI-8th] fix: add null check and graceful degradation in AppendEntries#1260
jervyclaw wants to merge 2 commits into
sofastack:masterfrom
jervyclaw:pr/fix/appendentries-null-check

Conversation

@jervyclaw
Copy link
Copy Markdown

Problem

In the AppendEntries request handling path, getOrCreatePeerRequestContext() may return null when a node is removed by a concurrent shutdown. The previous code did not properly handle this case, potentially causing NPE or incorrect routing.

Solution

  1. When ctx is null in handleRequest(), return executor() to degrade gracefully to the default executor
  2. When node is null in getOrCreatePeerRequestContext(), return null so callers can fall back to the default executor path
  3. In getAndIncrementSequence(), return 0 when ctx is null to properly route to the non-pipeline path

This ensures graceful degradation during the shutdown race condition.

jervyclaw and others added 2 commits April 4, 2026 22:31
Replace assert(node != null) with a null check in getOrCreatePeerRequestContext()
to prevent AssertionError during concurrent node shutdown.

Race condition scenario:
1. select() checks node != null via NodeManager.getInstance().get()
2. Another thread removes the node (NodeManager.getInstance().remove())
3. getOrCreatePeerRequestContext() calls NodeManager.getInstance().get() again
   inside the synchronized block - this can now return null
4. The original assert(node != null) threw AssertionError

Fix:
- getOrCreatePeerRequestContext() returns null when node is null
- select() handles null by falling back to the default executor()
- getAndIncrementSequence() handles null by returning 0

Fixes: sofastack#1091
@shihuili1218 shihuili1218 linked an issue Apr 4, 2026 that may be closed by this pull request
@nobodyiam
Copy link
Copy Markdown
Member

感谢提交!

CI 全部通过,变更范围聚焦在 AppendEntriesRequestProcessor.java(+27/-4),逻辑清晰:

  • getOrCreatePeerRequestContext() 返回 null 时回退到默认 executor,避免 NPE
  • getAndIncrementSequence() ctx 为 null 时返回 0,正确路由到非 pipeline 路径

核心降级路径已覆盖,等待维护者评审合并。

注:本回复由 AI 自动生成并自动发送,用于初步分诊;如需进一步确认,维护者会继续跟进。

Copy link
Copy Markdown
Collaborator

@shihuili1218 shihuili1218 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

方向对了,修的是真问题,降级策略也合理。但执行上留了垃圾在后面,需要清理。

核心判断

✅ 值得做。Issue #1091 报告的竞态条件是真实的:select()getOrCreatePeerRequestContext() 之间存在 TOCTOU 窗口,并发 shutdown 移除 node 后 assert(node != null) 直接炸掉。用 null 检查替代 assert、降级到非 pipeline 路径是正确的修法。

需要修复的问题

1. getAndIncrementSequence(String, PeerPair, Connection) 变成了死代码

processRequest0 改后直接调用 ctx.getAndIncrementSequence(),不再经过这个私有方法:

// 旧代码(line 458)
reqSequence = getAndIncrementSequence(groupId, pair, done.getRpcCtx().getConnection());

// PR 新代码
reqSequence = ctx.getAndIncrementSequence();

但 PR 同时还修改了这个方法来处理 null:

private int getAndIncrementSequence(final String groupId, final PeerPair pair, final Connection conn) {
    final PeerRequestContext ctx = getOrCreatePeerRequestContext(groupId, pair, conn);
    return ctx != null ? ctx.getAndIncrementSequence() : 0;
}

这个方法现在没有任何调用方了。修了一段永远不会执行的代码,应该直接删除

2. 死代码上的注释还是错的

// In this case, fall back to returning 0 as the sequence, which will cause
// the request to be handled by the default executor path.

sequence 是响应排序用的序号,跟 executor 选择没有任何关系。executor 在 select() 阶段就已经决定了。但既然方法应该删掉,这个注释也就不存在了。

3. 注释过于啰嗦

每个 null check 配了 3-4 行注释,代码本身已经非常清晰。比如:

// The node could be null due to concurrent shutdown (another thread
// removes the node from NodeManager between the check in select()
// and this second lookup inside the lock). Gracefully degrade
// to returning null so callers can fall back to the default executor.
if (node == null) {
    return null;
}

一行就够了:

// Node removed by concurrent shutdown; caller will degrade to non-pipeline path.
if (node == null) {
    return null;
}

同理 processRequest0select() 里的注释也一样冗长。

4. 缺少测试

测试文件已经有完善的 mock 基础设施(mockNode()NodeManager 操作等)。加一个测试验证 "node 被移除后 getOrCreatePeerRequestContext 返回 null" 非常简单:

@Test
public void testGetOrCreatePeerRequestContextWhenNodeRemoved() {
    final PeerId peer = mockNode();
    final AppendEntriesRequestProcessor processor = (AppendEntriesRequestProcessor) newProcessor();
    final PeerPair pair = processor.pairOf(this.peerIdStr, this.serverId);

    // Remove node to simulate concurrent shutdown
    NodeManager.getInstance().remove(peer);

    final PeerRequestContext ctx = processor.getOrCreatePeerRequestContext(
        this.groupId, pair, this.conn);
    assertNull(ctx);
}

不写这个测试,就没有人能验证这个 fix 没被未来的重构破坏。

值得肯定的部分

  • processRequest0 中先获取 ctx,null 时直接 return service.handleAppendEntriesRequest(request, done) 走非 pipeline 路径——避免了用 SequenceRpcRequestClosure 包装后响应丢失的问题。思考到位。
  • select() 中 ctx 为 null 时返回 executor()(默认线程池),降级策略一致。
  • 没有过度修改——只改了必要的路径,没有动无关代码。

总结

删掉死代码方法,精简注释,加一个测试。改动量很小,修完就可以合了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AssertionError in AppendEntriesRequestProcessor

3 participants