[AI-8th] fix: add null check and graceful degradation in AppendEntries#1260
[AI-8th] fix: add null check and graceful degradation in AppendEntries#1260jervyclaw wants to merge 2 commits into
Conversation
Replace assert(node != null) with a null check in getOrCreatePeerRequestContext() to prevent AssertionError during concurrent node shutdown. Race condition scenario: 1. select() checks node != null via NodeManager.getInstance().get() 2. Another thread removes the node (NodeManager.getInstance().remove()) 3. getOrCreatePeerRequestContext() calls NodeManager.getInstance().get() again inside the synchronized block - this can now return null 4. The original assert(node != null) threw AssertionError Fix: - getOrCreatePeerRequestContext() returns null when node is null - select() handles null by falling back to the default executor() - getAndIncrementSequence() handles null by returning 0 Fixes: sofastack#1091
|
感谢提交! CI 全部通过,变更范围聚焦在
核心降级路径已覆盖,等待维护者评审合并。 注:本回复由 AI 自动生成并自动发送,用于初步分诊;如需进一步确认,维护者会继续跟进。 |
shihuili1218
left a comment
There was a problem hiding this comment.
Review
方向对了,修的是真问题,降级策略也合理。但执行上留了垃圾在后面,需要清理。
核心判断
✅ 值得做。Issue #1091 报告的竞态条件是真实的:select() 和 getOrCreatePeerRequestContext() 之间存在 TOCTOU 窗口,并发 shutdown 移除 node 后 assert(node != null) 直接炸掉。用 null 检查替代 assert、降级到非 pipeline 路径是正确的修法。
需要修复的问题
1. getAndIncrementSequence(String, PeerPair, Connection) 变成了死代码
processRequest0 改后直接调用 ctx.getAndIncrementSequence(),不再经过这个私有方法:
// 旧代码(line 458)
reqSequence = getAndIncrementSequence(groupId, pair, done.getRpcCtx().getConnection());
// PR 新代码
reqSequence = ctx.getAndIncrementSequence();但 PR 同时还修改了这个方法来处理 null:
private int getAndIncrementSequence(final String groupId, final PeerPair pair, final Connection conn) {
final PeerRequestContext ctx = getOrCreatePeerRequestContext(groupId, pair, conn);
return ctx != null ? ctx.getAndIncrementSequence() : 0;
}这个方法现在没有任何调用方了。修了一段永远不会执行的代码,应该直接删除。
2. 死代码上的注释还是错的
// In this case, fall back to returning 0 as the sequence, which will cause
// the request to be handled by the default executor path.sequence 是响应排序用的序号,跟 executor 选择没有任何关系。executor 在 select() 阶段就已经决定了。但既然方法应该删掉,这个注释也就不存在了。
3. 注释过于啰嗦
每个 null check 配了 3-4 行注释,代码本身已经非常清晰。比如:
// The node could be null due to concurrent shutdown (another thread
// removes the node from NodeManager between the check in select()
// and this second lookup inside the lock). Gracefully degrade
// to returning null so callers can fall back to the default executor.
if (node == null) {
return null;
}一行就够了:
// Node removed by concurrent shutdown; caller will degrade to non-pipeline path.
if (node == null) {
return null;
}同理 processRequest0 和 select() 里的注释也一样冗长。
4. 缺少测试
测试文件已经有完善的 mock 基础设施(mockNode()、NodeManager 操作等)。加一个测试验证 "node 被移除后 getOrCreatePeerRequestContext 返回 null" 非常简单:
@Test
public void testGetOrCreatePeerRequestContextWhenNodeRemoved() {
final PeerId peer = mockNode();
final AppendEntriesRequestProcessor processor = (AppendEntriesRequestProcessor) newProcessor();
final PeerPair pair = processor.pairOf(this.peerIdStr, this.serverId);
// Remove node to simulate concurrent shutdown
NodeManager.getInstance().remove(peer);
final PeerRequestContext ctx = processor.getOrCreatePeerRequestContext(
this.groupId, pair, this.conn);
assertNull(ctx);
}不写这个测试,就没有人能验证这个 fix 没被未来的重构破坏。
值得肯定的部分
processRequest0中先获取 ctx,null 时直接return service.handleAppendEntriesRequest(request, done)走非 pipeline 路径——避免了用SequenceRpcRequestClosure包装后响应丢失的问题。思考到位。select()中 ctx 为 null 时返回executor()(默认线程池),降级策略一致。- 没有过度修改——只改了必要的路径,没有动无关代码。
总结
删掉死代码方法,精简注释,加一个测试。改动量很小,修完就可以合了。
Problem
In the AppendEntries request handling path, getOrCreatePeerRequestContext() may return null when a node is removed by a concurrent shutdown. The previous code did not properly handle this case, potentially causing NPE or incorrect routing.
Solution
This ensures graceful degradation during the shutdown race condition.