fix: prevent CNClaim Finalize stuck and scale-in race during migration#592
Open
xzxiong wants to merge 1 commit into
Open
fix: prevent CNClaim Finalize stuck and scale-in race during migration#592xzxiong wants to merge 1 commit into
xzxiong wants to merge 1 commit into
Conversation
Fixes 4 bugs that cause CNClaim Finalize to get stuck and scale-in to select migrating claims: 1. Finalize() stuck when all owned Pods are claimed by another CNClaim — now releases the claimed-by label and completes finalization 2. CNClaimSet scale-in selects claims mid-migration (spec.SourcePod != nil) — now excludes migrating claims from scale-in candidates 3. Sync() Pod NotFound doesn't clear spec.PodName — claim stays in Lost forever with stale podName 4. watchPodChange only triggers reconcile via Pod label — now also triggers for CNClaims referencing the pod via spec.podName Closes #591 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Contributor
Author
Code Review (Self-Review)变更概览修复 4 个 CNClaim 生命周期 bug(Finalize 卡死、scale-in 竞态、Pod NotFound 残留、watch 关联缺失),涉及 2 个 controller 4 个文件。 优点
审查发现
建议(非阻塞)
总体评价LGTM — 修复目标明确,逻辑正确,测试覆盖了关键路径。 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
修复 CNClaim Finalize 卡死 + CNClaimSet scale-in 选中迁移中 claim 的 4 个 bug,解决 dev freetier-01 环境观察到的 claim 错乱、无限重试问题。
关联
变更内容
Bug 1: Finalize() 卡死(主因)
问题:当所有 owned Pod 都被另一个 CNClaim 引用(迁移场景下
ensureOwnership覆盖了 label),Finalize()正确 skip reclaim 但返回(false, nil)→ 永远无法完成。修复:skip 时主动清理自己残留的
claimed-bylabel,让 owning claim 正常管理 Pod;所有 owned CN 处理完后返回(true, nil)完成 finalization。Bug 2: Scale-in 选中迁移中 claim
问题:
scaleIn()不排除spec.SourcePod != nil的 claim,导致迁移进行到一半时 claim 被删除,产生 Finalize 与 migration 冲突。修复:在
scaleIn()中过滤掉正在迁移的 claim,不作为 scale-in 候选。Bug 3: Sync() Pod NotFound 不清理 spec
问题:Pod 不存在时只设
status.phase=Lost,但spec.podName未清空,claim 永远卡在 Lost 且保留过期的 Pod 引用。修复:Pod NotFound 时同时清空
spec.PodName和spec.NodeName。Bug 4: watchPodChange 关联不完整
问题:Pod 删除事件仅通过
claimed-bylabel 关联 CNClaim,如果 label 已在迁移/reclaim 中被清除,则 CNClaim 不会被 reconcile。修复:
watchPodChange使用 manager client 额外检索spec.podName匹配的 CNClaim,确保 Pod 删除时所有相关 claim 都被触发 reconcile。测试
Test_scaleIn_skipsMigratingClaims: 验证迁移中 claim 不被选为 scale-in 候选Test_containsRequest: 验证 reconcile request 去重辅助函数Test_sortClaimsToDelete和Test_buildPodClaimIndex继续通过Checklist
go build ./pkg/controllers/cnclaim/ ./pkg/controllers/cnclaimset/)go test ./pkg/controllers/cnclaimset/)-lmo库无法本地执行,CI 环境可正常运行)