CNClaim Finalize 卡死 + Scale-in 选中迁移中 claim (dev freetier-01, 日志确认)

## 关联

- 上游 Issue: #589 (prod freetier-01, 同类现象, OPEN)
- 历史 Issue: #581 (dev freetier-01, 2026-04-24, CLOSED)
- PR #587 修复: `eb9b356` (2026-04-26, 已部署, 部分修复)
- 环境: new-dev, namespace `freetier-01`
- Operator: `nightly-eb9b356-2026-04-26` (含 #587 fix)
- Unit-agent: `main-20260429-e153fb2` (#284)
- 分析文档: https://github.com/xzxiong/moi-core-handbooks/blob/main/docs/analysis/20260531-dev-freetier01-cnclaim-disorder.md

## 现象

2026-05-31 在 new-dev freetier-01 观察到 CNClaim 错乱：

### 1. CNClaim 绑定到已不存在的 Pod

```
CNClaim: ws-bf2d347f-dedicated-dedicated-mxxhz
  spec.podName: cn-zkskr     ← Pod 已不存在
  status.phase: Bound        ← 未变为 Lost
```

### 2. Pod label 与 CNClaim spec 不一致（双重引用）

```
Pod cn-n6zb9:
  label claimed-by = ws-bf2d347f-dedicated-dedicated-mxxhz   ← Pod label
  MIGRAT = y                                                  ← migration-requested

CNClaim sys-sys-hpjqv:
  spec.podName = cn-n6zb9    ← spec 引用同一 Pod
  status.phase = Bound
```

## 日志确认的时间线

从 DP Loki 日志（unit-agent + matrixone-operator）完整还原：

| 时间 (UTC) | 组件 | 事件 |
|---|---|---|
| 05-30 11:20:18 | operator | `start bind cn claim` → `CN bind success` 绑到 **cn-rvghl** |
| 05-30 11:23:50 | unit-agent | consolidator 触发迁移，`migrate CN` pod=cn-rvghl claim=mxxhz |
| 05-30 11:23:50 | operator | `migrate claimed pod` source=cn-rvghl **target=cn-n6zb9** |
| 05-30 11:24:06 | operator | `start draining source pod` pod=cn-rvghl |
| 05-30 11:24:06~25:xx | operator | 迁移进行中（source draining） |
| **05-30 11:26:04** | **operator** | **`sort claims to scale-in` — CNClaimSet replicas 减少，选中 mxxhz 删除** |
| **05-30 11:26:04** | **operator** | **`finalize CNClaim` name=mxxhz** |
| 05-30 11:26:04~27:08 | operator | **Finalize 循环重试 60+ 次（每秒 1-2 次）无法完成** |
| 05-30 11:30:05~11:46+ | unit-agent | 仍在重试 `migrate CN` pod=cn-n6zb9 claim=mxxhz（每 5s） |
| 05-31 14:xx | unit-agent | scaling-collector 持续报 `Failed to get Pod cn-zkskr` not found |

## 根因

**核心问题：CNClaimSet scale-in 与 CNClaim migration 竞态**

### Bug 1 (主因): `Finalize()` 在迁移中途被调用后卡死

`Finalize()` 逻辑 (controller.go:340-371)：
```go
func (r *Actor) Finalize(ctx) (bool, error) {
    ownedCNs = ListPods(phase=Bound, claimed-by=self)
    if len(ownedCNs) == 0 {
        return true, nil  // finalize 完成
    }
    claimIndex = buildPodClaimIndex(...)  // #587 fix
    for each owned pod:
        if holder, ok := claimIndex[pod.Name]; ok {
            skip  // #587: 其他 claim spec 引用该 Pod，不回收
        } else {
            reclaimCN(pod)
        }
    return false, nil  // ← 还有 owned CN 未处理
}
```

场景：
1. `cn-n6zb9` 被 `ensureOwnership()` 打上 `claimed-by=mxxhz` + `phase=Bound`
2. `buildPodClaimIndex` 发现 `sys-sys-hpjqv.spec.podName=cn-n6zb9` → **正确 skip reclaim**
3. 但 `ownedCNs` 非空 → `Finalize()` 返回 `(false, nil)` → **永远无法完成**

**#587 fix 的漏洞**：正确防止了误 reclaim，但 Finalize 无法判定 "这个 Pod 不归我管了" 并完成 finalizer 移除。

### Bug 2: CNClaimSet scale-in 不排除正在迁移的 claim

`cnclaimset/controller.go` `sort claims to scale-in` 逻辑没有排除 `spec.sourcePod != nil` 的 claim。结果在迁移进行到一半时就删除 claim，产生：
- Finalize 与 migration 的冲突
- Pod label 残留
- unit-agent migrator 无限重试

### Bug 3: `Sync()` Pod NotFound 时不清理 `spec.podName` (controller.go:264-269)

```go
if apierrors.IsNotFound(err) {
    c.Status.Phase = v1alpha1.CNClaimPhaseLost
    return nil  // spec.podName 未清空，claim 永远卡在 Lost
}
```

### Bug 4: `watchPodChange` Pod 删除事件关联不完整

Pod 删除时如果 `claimed-by` label 已被修改/删除（如迁移中 `reclaimCN()` 调用），则不会触发关联 claim 的 reconcile。

## 修复建议

### Fix 1 (优先): Finalize() skip 后清理 Pod label 并正确完成

```go
for i := range ownedCNs {
    cn := ownedCNs[i]
    if holder, ok := claimIndex[cn.Name]; ok {
        ctx.Log.Info("skip reclaim, pod claimed by other", "pod", cn.Name, "holder", holder)
        // FIX: 清理自己的 claimed-by label（claim 正在删除）
        ctx.Patch(&cn, func() error {
            delete(cn.Labels, v1alpha1.PodClaimedByLabel)
            return nil
        })
        continue  // 视为已处理
    }
    // ... reclaimCN
}
// FIX: 如果所有 ownedCNs 都被 skip/处理完，应返回 (true, nil) 允许 finalizer 移除
return true, nil
```

### Fix 2: CNClaimSet scale-in 排除迁移中 claim

```go
func (r *Actor) scaleIn(ctx, oc, count) error {
    // 过滤掉正在迁移的 claim
    candidates := filterNonMigrating(oc.claims)
    sort(candidates)
    toDelete := candidates[:count]
    ...
}

func filterNonMigrating(claims []CNClaim) []CNClaim {
    var result []CNClaim
    for _, c := range claims {
        if c.Spec.SourcePod == nil {
            result = append(result, c)
        }
    }
    return result
}
```

### Fix 3: Sync() Pod NotFound 清理 spec.podName

```go
if apierrors.IsNotFound(err) {
    if c.Status.BoundTime != nil && time.Since(c.Status.BoundTime.Time) < waitCacheTimeout {
        return recon.ErrReSync("wait cache", waitCacheTimeout)
    }
    c.Status.Phase = v1alpha1.CNClaimPhaseLost
    ctx.Patch(c, func() error {
        c.Spec.PodName = ""
        c.Spec.NodeName = ""
        return nil
    })
    return nil
}
```

### Fix 4: watchPodChange 补充 spec.podName 关联

Pod 删除事件应额外遍历 CNClaim 列表检查 `spec.podName` 匹配，不仅依赖 Pod label。

## 复现方式

可构造 UT：
1. 创建 CNClaimSet replicas=2
2. 创建 2 个 claim (A, B)，A 绑 pod-1，B 绑 pod-2
3. 触发 A 迁移（设 `sourcePod=pod-1`, `podName=pod-2`）— 此时 pod-2 被 A 和 B 双重引用
4. CNClaimSet replicas=1，触发 scale-in 选中 A
5. 验证 Finalize 是否卡死
6. 验证 pod-2 的 `claimed-by` label 是否残留

## 环境信息

- Operator: `nightly-eb9b356-2026-04-26` (34d, commit `eb9b356`, includes #587)
- Unit-agent: `main-20260429-e153fb2` (32d, commit `e153fb2`)
- Cluster: ack-unit-hz-new, namespace freetier-01
- 日志源: DP Loki (`unit-cn-hangzhou-loki`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNClaim Finalize 卡死 + Scale-in 选中迁移中 claim (dev freetier-01, 日志确认) #591

关联

现象

1. CNClaim 绑定到已不存在的 Pod

2. Pod label 与 CNClaim spec 不一致（双重引用）

日志确认的时间线

根因

Bug 1 (主因): `Finalize()` 在迁移中途被调用后卡死

Bug 2: CNClaimSet scale-in 不排除正在迁移的 claim

Bug 3: `Sync()` Pod NotFound 时不清理 `spec.podName` (controller.go:264-269)

Bug 4: `watchPodChange` Pod 删除事件关联不完整

修复建议

Fix 1 (优先): Finalize() skip 后清理 Pod label 并正确完成

Fix 2: CNClaimSet scale-in 排除迁移中 claim

Fix 3: Sync() Pod NotFound 清理 spec.podName

Fix 4: watchPodChange 补充 spec.podName 关联

复现方式

环境信息

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

时间 (UTC)	组件	事件
05-30 11:20:18	operator	`start bind cn claim` → `CN bind success` 绑到 cn-rvghl
05-30 11:23:50	unit-agent	consolidator 触发迁移，`migrate CN` pod=cn-rvghl claim=mxxhz
05-30 11:23:50	operator	`migrate claimed pod` source=cn-rvghl target=cn-n6zb9
05-30 11:24:06	operator	`start draining source pod` pod=cn-rvghl
05-30 11:24:06~25:xx	operator	迁移进行中（source draining）
05-30 11:26:04	operator	`sort claims to scale-in` — CNClaimSet replicas 减少，选中 mxxhz 删除
05-30 11:26:04	operator	`finalize CNClaim` name=mxxhz
05-30 11:26:04~27:08	operator	Finalize 循环重试 60+ 次（每秒 1-2 次）无法完成
05-30 11:30:05~11:46+	unit-agent	仍在重试 `migrate CN` pod=cn-n6zb9 claim=mxxhz（每 5s）
05-31 14:xx	unit-agent	scaling-collector 持续报 `Failed to get Pod cn-zkskr` not found

CNClaim Finalize 卡死 + Scale-in 选中迁移中 claim (dev freetier-01, 日志确认) #591

Description

关联

现象

1. CNClaim 绑定到已不存在的 Pod

2. Pod label 与 CNClaim spec 不一致（双重引用）

日志确认的时间线

根因

Bug 1 (主因): Finalize() 在迁移中途被调用后卡死

Bug 2: CNClaimSet scale-in 不排除正在迁移的 claim

Bug 3: Sync() Pod NotFound 时不清理 spec.podName (controller.go:264-269)

Bug 4: watchPodChange Pod 删除事件关联不完整

修复建议

Fix 1 (优先): Finalize() skip 后清理 Pod label 并正确完成

Fix 2: CNClaimSet scale-in 排除迁移中 claim

Fix 3: Sync() Pod NotFound 清理 spec.podName

Fix 4: watchPodChange 补充 spec.podName 关联

复现方式

环境信息

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug 1 (主因): `Finalize()` 在迁移中途被调用后卡死

Bug 3: `Sync()` Pod NotFound 时不清理 `spec.podName` (controller.go:264-269)

Bug 4: `watchPodChange` Pod 删除事件关联不完整