Skip to content

fix: prevent duplicate CNClaim binding to same CN Pod during migration#586

Closed
xzxiong wants to merge 4 commits intomatrixorigin:mainfrom
xzxiong:fix-dup-claims
Closed

fix: prevent duplicate CNClaim binding to same CN Pod during migration#586
xzxiong wants to merge 4 commits intomatrixorigin:mainfrom
xzxiong:fix-dup-claims

Conversation

@xzxiong
Copy link
Copy Markdown
Contributor

@xzxiong xzxiong commented Apr 25, 2026

变更说明

修复 CNClaim controller 在 CN Pod 迁移流程中的竞态条件:多个 CNClaim 可以同时绑定到同一个 CN Pod,删除其中一个 claim 后 Pod 被错误回收,破坏另一个 claim 的绑定关系。核心思路是以 CNClaim.spec.podName 为 Pod 归属的 source of truth,在 selectCN()Finalize() 中增加 CNClaim 列表校验。

  • 新增 podClaimedByOthers() helper:列出 namespace 下所有 CNClaim,检查是否有其他 claim 的 spec.podName 指向目标 Pod
  • selectCN() 绑定前校验:在 ensureOwnership() 前调用 podClaimedByOthers() 跳过已被占用的 Pod
  • Finalize() 回收前校验:回收 Pod 前检查是否有其他 CNClaim 仍引用该 Pod,有则跳过

核心变更

1. CNClaim Controller (pkg/controllers/cnclaim/controller.go)

  • selectCN(): 在遍历 idle Pod 列表时,调用 podClaimedByOthers() 过滤已被其他 claim 绑定的 Pod
  • Finalize(): 回收 Pod 前检查 podClaimedByOthers(),如果有其他 claim 引用则跳过回收
  • 新增 podClaimedByOthers() 函数:列出 namespace 下所有 CNClaim,检查 spec.podName 是否匹配

调用链

selectCN()
  → sortCNByPriority(idleCNs)
  → for each idle pod:
      → podClaimedByOthers(pod.Name, claim.Name)  ← NEW: 跳过已被占用的 Pod
      → ensureOwnership(pod)                       ← 已有: optimistic lock

Finalize()
  → ListPods(Bound + ClaimedBy=self)
  → for each owned pod:
      → podClaimedByOthers(pod.Name, claim.Name)  ← NEW: 跳过被其他 claim 引用的 Pod
      → reclaimCN(pod)                             ← 已有: 设置 Draining

新增文件

文件 说明
无新增文件 所有变更在现有文件中

变更类型

  • 新功能(向后兼容)
  • 缺陷修复(向后兼容)
  • 破坏性变更(需要迁移指南)

影响范围

测试验证

  • 单元测试已添加/更新
  • 集成测试已通过
  • 手动测试已完成
  • 性能测试已进行
  • Issue修复必填:已添加能够100%复现Issue的单元测试

UT 覆盖说明

podClaimedByOthers — 5 个 case(controller_test.go

场景 输入 / mock 行为 期望结果
no claims exist 空 CNClaim 列表 返回 false
pod only claimed by self 仅自身 claim 引用该 Pod 返回 false
pod claimed by another claim 另一个 claim 的 spec.podName 指向同一 Pod 返回 true
pod not claimed by anyone 其他 claim 指向不同 Pod 返回 false
claim with empty podName pending claim 无 podName 返回 false

回归验证(Bug Fix PR 专项)

Test_podClaimedByOthers/pod_claimed_by_another_claim 精确复现了 Issue #581 的竞态场景:两个 CNClaim 的 spec.podName 指向同一 Pod。修复前 selectCN() 不检查此条件会绑定成功;修复后返回 true 并跳过。

相关链接

xzxiong added 2 commits April 25, 2026 13:47
Add podClaimedByOthers() check to selectCN() and Finalize() to verify
no other CNClaim references a Pod via spec.podName before binding or
reclaiming it. This prevents the race condition where migration causes
Pod labels to become stale while spec.podName still holds the true
binding relationship.

Fixes matrixorigin#581, matrixorigin#582, matrixorigin#585
…laims

- Extract podClaimedByOthers loop into buildPodClaimIndex() that builds
  a map[podName]claimName once per reconcile, reducing N API calls to 1
- Skip CNClaims with DeletionTimestamp != nil to prevent Pod leak when
  both claims are being deleted simultaneously
- Add holder claim name to log messages for easier debugging
- Add test for deleting claim filtering and buildPodClaimIndex
@xzxiong
Copy link
Copy Markdown
Contributor Author

xzxiong commented Apr 25, 2026

Review 修复 (7217df0)

已根据 code review 反馈修复以下问题:

P1 性能优化

  • 将循环内 N 次 List(CNClaimList) 优化为循环外 1 次 buildPodClaimIndex(),构建 map[podName]claimName 索引,循环内 O(1) 查表

P2 正确性

  • podClaimedByOthers()buildPodClaimIndex() 均跳过 DeletionTimestamp != nil 的 CNClaim,防止两个 claim 同时删除时 Pod 泄漏

P3 可观测性

  • 日志增加 holder 字段,显示持有 Pod 的 CNClaim 名称

新增测试

  • Test_podClaimedByOthers/deleting_claim_is_ignored: 验证 DeletionTimestamp 过滤
  • Test_buildPodClaimIndex: 验证索引构建(排除 self、deleting、empty podName)

@xzxiong xzxiong force-pushed the fix-dup-claims branch 2 times, most recently from d65f56d to f13d295 Compare April 25, 2026 06:36
skywalking-eyes defaults to current year when validating copyright
headers, causing all .go files to fail when the year rolls over.
Set copyright-year to '2024-2030' to accept existing and future headers.
@xzxiong
Copy link
Copy Markdown
Contributor Author

xzxiong commented Apr 26, 2026

Closing: recreating with correct head (same-repo mode, not fork mode)

@xzxiong xzxiong closed this Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant