Skip to content

fix: prevent CNClaim Finalize stuck and scale-in race during migration#592

Open
xzxiong wants to merge 1 commit into
mainfrom
worktree-claims
Open

fix: prevent CNClaim Finalize stuck and scale-in race during migration#592
xzxiong wants to merge 1 commit into
mainfrom
worktree-claims

Conversation

@xzxiong
Copy link
Copy Markdown
Contributor

@xzxiong xzxiong commented May 31, 2026

Summary

修复 CNClaim Finalize 卡死 + CNClaimSet scale-in 选中迁移中 claim 的 4 个 bug,解决 dev freetier-01 环境观察到的 claim 错乱、无限重试问题。

关联

变更内容

Bug 1: Finalize() 卡死(主因)

问题:当所有 owned Pod 都被另一个 CNClaim 引用(迁移场景下 ensureOwnership 覆盖了 label),Finalize() 正确 skip reclaim 但返回 (false, nil) → 永远无法完成。

修复:skip 时主动清理自己残留的 claimed-by label,让 owning claim 正常管理 Pod;所有 owned CN 处理完后返回 (true, nil) 完成 finalization。

Bug 2: Scale-in 选中迁移中 claim

问题scaleIn() 不排除 spec.SourcePod != nil 的 claim,导致迁移进行到一半时 claim 被删除,产生 Finalize 与 migration 冲突。

修复:在 scaleIn() 中过滤掉正在迁移的 claim,不作为 scale-in 候选。

Bug 3: Sync() Pod NotFound 不清理 spec

问题:Pod 不存在时只设 status.phase=Lost,但 spec.podName 未清空,claim 永远卡在 Lost 且保留过期的 Pod 引用。

修复:Pod NotFound 时同时清空 spec.PodNamespec.NodeName

Bug 4: watchPodChange 关联不完整

问题:Pod 删除事件仅通过 claimed-by label 关联 CNClaim,如果 label 已在迁移/reclaim 中被清除,则 CNClaim 不会被 reconcile。

修复watchPodChange 使用 manager client 额外检索 spec.podName 匹配的 CNClaim,确保 Pod 删除时所有相关 claim 都被触发 reconcile。

测试

  • Test_scaleIn_skipsMigratingClaims: 验证迁移中 claim 不被选为 scale-in 候选
  • Test_containsRequest: 验证 reconcile request 去重辅助函数
  • 现有 Test_sortClaimsToDeleteTest_buildPodClaimIndex 继续通过

Checklist

  • 代码编译通过 (go build ./pkg/controllers/cnclaim/ ./pkg/controllers/cnclaimset/)
  • go vet 通过
  • 单元测试通过 (go test ./pkg/controllers/cnclaimset/)
  • cnclaim 测试编译通过(链接阶段因环境缺 -lmo 库无法本地执行,CI 环境可正常运行)

Fixes 4 bugs that cause CNClaim Finalize to get stuck and scale-in to
select migrating claims:

1. Finalize() stuck when all owned Pods are claimed by another CNClaim
   — now releases the claimed-by label and completes finalization
2. CNClaimSet scale-in selects claims mid-migration (spec.SourcePod != nil)
   — now excludes migrating claims from scale-in candidates
3. Sync() Pod NotFound doesn't clear spec.PodName — claim stays in Lost
   forever with stale podName
4. watchPodChange only triggers reconcile via Pod label — now also triggers
   for CNClaims referencing the pod via spec.podName

Closes #591

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@xzxiong
Copy link
Copy Markdown
Contributor Author

xzxiong commented May 31, 2026

Code Review (Self-Review)

变更概览

修复 4 个 CNClaim 生命周期 bug(Finalize 卡死、scale-in 竞态、Pod NotFound 残留、watch 关联缺失),涉及 2 个 controller 4 个文件。

优点

  1. 精准定位:每个 fix 独立且最小化,不引入不必要的重构
  2. Finalize 修复逻辑完整:不仅返回 (true, nil),还主动清理残留 label,避免遗留脏数据
  3. Scale-in 保护:迁移中 claim 不参与 scale-in 排序,从根源避免竞态
  4. watchPodChange 增强:通过 closure 捕获 manager client,符合 controller-runtime 模式

审查发现

# 级别 位置 描述
1 controller.go:Sync Patch spec 清空后 c.Status.Phase = Lost 赋值在前、Patch 在后。如果 Patch 失败,status 已被内存修改但未持久化——不过 status 赋值后会在后续 reconcile 由框架重新 sync,影响可忽略
2 watchPodChangeFn 每次 Pod 事件都 List 全 namespace 的 CNClaim。在大规模集群(1000+ claims)下有性能风险。可后续优化为 indexer,当前规模无问题
3 scaleIn migrating slice 追加到 left 末尾不影响正确性,但语义上迁移中 claim 应保持原顺序——当前实现已满足

建议(非阻塞)

  • 考虑为 watchPodChangeFn 添加 field indexer(spec.podName → claim name),避免全量 List。可作为后续优化 PR
  • Sync() 中 status 赋值与 spec Patch 的顺序可调整为先 Patch 后赋值,但不阻塞合并

总体评价

LGTM — 修复目标明确,逻辑正确,测试覆盖了关键路径。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CNClaim Finalize 卡死 + Scale-in 选中迁移中 claim (dev freetier-01, 日志确认)

1 participant