Skip to content

fix: failed to keep s3 bucket retain#595

Merged
loveRhythm1990 merged 2 commits into
matrixorigin:mainfrom
loveRhythm1990:lr90/memoria
Jun 30, 2026
Merged

fix: failed to keep s3 bucket retain#595
loveRhythm1990 merged 2 commits into
matrixorigin:mainfrom
loveRhythm1990:lr90/memoria

Conversation

@loveRhythm1990

@loveRhythm1990 loveRhythm1990 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

Fixes # #594

What this PR does / why we need it:

Not Available

Special notes for your reviewer:

Not Available

Additional documentation (e.g. design docs, usage docs, etc.):

Not Available

@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes S3 data being deleted when pvcRetentionPolicy=Delete is used together with s3RetentionPolicy=Retain, by ensuring released buckets do not run the S3 cleanup job during finalization. It also adjusts LogSet retention-policy getters to better handle older objects where only one of the two policies is set.

Changes:

  • Skip S3 data deletion and directly remove the BucketDataFinalizer when a BucketClaim is in StatusReleased (retain behavior).
  • Update LogSetSpec retention-policy getters to implement PVC↔S3 inheritance when only one policy is present, defaulting to Delete when neither is set.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/controllers/bucketclaim/controller.go Avoids deleting S3 data when the bucket is already marked Released (retain semantics) by skipping the cleanup job and removing the finalizer.
api/core/v1alpha1/logset_types.go Updates retention-policy getters to infer missing PVC/S3 policies for backward compatibility with older resources.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/controllers/bucketclaim/controller.go
Comment thread api/core/v1alpha1/logset_types.go

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@xzxiong

xzxiong commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Code Review: fix: failed to keep s3 bucket retain

〇、总结(TL;DR)

本 PR 修复 pvcRetentionPolicy=Deletes3RetentionPolicy=Retain 组合下,已 Released 的 BucketClaim 被显式删除时仍触发 S3 cleanup job 导致数据被清理的问题;同时把 LogSet retention policy getter 改为无副作用读取并兼容旧对象单字段缺失场景。

一、PR 描述评审

设计文档状态

⚠️ 未找到设计文档 — PR body、关联 Issue、PR comments 均未包含完整方案说明。Issue #594 只描述了 TKE CI 环境里 S3 数据未被正常保留的现象和期望。

PR body 四维评审

  • 背景(Context):🟡 不足。Issue 说明了出问题的 policy 组合,但缺少资源生命周期、触发顺序和环境版本。
  • 原因(Why):🟡 不足。未说明为什么 StatusReleased 之后仍会触发 cleanup job,也没有给出影响范围。
  • 方案(How):🔴 缺失。PR body 未描述“Released 直接移除 finalizer,不再执行 S3 cleanup job”和 getter 兼容逻辑。
  • 结果(Expected Outcome):🟡 不足。Issue 只有“S3 数据能被正常保留”,建议补充具体验收路径:删除 LogSet 后 BucketClaim Released,再删除 BucketClaim,S3 object 仍存在。

二、方案评审

2.0 变更可视化

修改前(简化):

LogSet deletion
  └─ s3RetentionPolicy=Retain
       └─ BucketClaim.State = Released
            └─ user/e2e deletes BucketClaim
                 └─ BucketClaim Finalize
                      ├─ State != InUse
                      ├─ AnnAnyInstanceRunning == true
                      └─ create cleanup Job → aws s3 rm --recursive  ← 问题点

修改后:

LogSet deletion
  └─ s3RetentionPolicy=Retain
       └─ BucketClaim.State = Released
            └─ user/e2e deletes BucketClaim
                 └─ BucketClaim Finalize
                      └─ State == Released
                           └─ remove BucketDataFinalizer, no cleanup Job  ← 修改

Retention policy getter:

修改前:
GetPVC/GetS3 → setDefaultRetentionPolicy() → 可能写回 spec

修改后:
GetPVC/GetS3 → 优先返回显式字段 → 单字段缺失时继承另一侧 → 默认 Delete
             → 不修改 spec  ← 修改

2.1 方案合理性

总体合理pkg/controllers/logset/bucketclaim.gos3RetentionPolicy=Retain 时把 BucketClaim 置为 StatusReleased 并清空 BindTo;因此 BucketClaim controller 在 StatusReleased 删除路径上跳过 S3 cleanup job,符合“Retain 后显式删除 CR 只释放 Kubernetes 对象,不删除 S3 数据”的修复目标。

getter 兼容逻辑合理。Webhook/default helper 仍会在新对象上把缺失 policy 落到 spec,getter 改为只读后主要影响旧对象或绕过 webhook 的测试对象;新增 UT 覆盖了 nil、单字段、双字段组合,并断言 getter 不修改 spec。

🟡 测试语义需要同步更新。本 PR 改变了 Released 删除路径,已有两个 e2e “reclaim job success” 场景现在不会再触发 reclaim job,见 I-1。

2.2/2.3 不适用

本 PR 无架构重构或存量破坏性 API 变更;pvcRetentionPolicy / s3RetentionPolicy 字段未新增、删除或改名。

2.4 测试方案梳理

变更函数/方法 变更类型 覆盖判定 对应测试
LogSetSpec.GetPVCRetentionPolicy 行为变更 TestGetRetentionPolicy
LogSetSpec.GetS3RetentionPolicy 行为变更 TestGetRetentionPolicy
bucketclaim.Actor.Finalize 行为变更 ✅ / 🟡 TestActor_Finalize_ReleasedSkipsS3CleanupJob、新增 e2e;但旧 e2e 场景语义需调整

回归验证效率:

  • TestGetRetentionPolicy:本地 UT 精确覆盖 getter 组合和无副作用读取。
  • TestActor_Finalize_ReleasedSkipsS3CleanupJob:本地 UT 精确覆盖 Released 不创建 Job。
  • 🟡 新增 e2e:覆盖真实 S3 object 在 Released BucketClaim 删除后仍存在,验证价值高,但依赖 e2e 环境。

本地验证情况:

  • git diff --check origin/main...HEAD:通过。
  • go test -run TestGetRetentionPolicy -count=1 ./api/core/v1alpha1:未能执行,原因是本地模块校验失败:
    github.com/openkruise/kruise-api@v1.4.0/go.mod: checksum mismatch
  • go test -run TestActor_Finalize_ReleasedSkipsS3CleanupJob -count=1 ./pkg/controllers/bucketclaim:同上,受同一 checksum mismatch 阻塞。

三、变更概述

PR 在 BucketClaim finalizer 中新增 StatusReleased 快速路径:Released bucket 删除时直接移除 BucketDataFinalizer,不再创建 S3 cleanup Job。这样 s3RetentionPolicy=Retain 释放出的 BucketClaim 被显式删除时,S3 数据仍然保留。

PR 同时调整 LogSet 的 PVC/S3 retention policy getter:不再调用会写回 spec 的 setDefaultRetentionPolicy(),而是在读取时按“显式值优先、单字段缺失继承另一侧、默认 Delete”的逻辑返回 policy,避免 getter 产生副作用。

测试方面新增 getter UT、BucketClaim finalizer UT 和一个真实 MinIO e2e 回归场景,覆盖 #594 的关键路径。

四、代码审查(逐文件)


🟡 I-1test/e2e/bucket_test.go L392、L466

问题:Should reclaim job success when bucket not existShould reclaim job success when prefix not exist 这两个已有用例仍写着 “delete bucket, trigger reclaim job” / “bucket will been deleted only when job exits successfully”,但本 PR 在 pkg/controllers/bucketclaim/controller.go L75-L82 新增了 StatusReleased 直接移除 finalizer 的分支。两个用例前面都把 bucket 等到 StatusReleased,所以删除 BucketClaim 后不会再创建或等待 cleanup Job;现在它们只验证 Released CR 能被删除,已经不能覆盖 “bucket 不存在 / prefix 不存在时 cleanup job 成功”。

影响:回归测试名称和断言会误导后续维护者,以为缺 bucket / 缺 prefix 的 S3 cleanup job 成功路径仍有 e2e 保护;实际上相关 job 逻辑已不被这些用例触达。

建议:二选一处理:

  1. 如果 retain 语义下不再需要 Released 后 cleanup job,用例应改名并删除 “trigger reclaim job / job exits successfully” 语义,或直接移除这两个重复场景。
  2. 如果仍要保留 cleanup job 对 “bucket 不存在 / prefix 不存在” 的 e2e 覆盖,应让测试进入 StatusDeleting 路径,例如使用 s3RetentionPolicy=Delete 的 LogSet 删除流程,或显式构造一个带 BucketDataFinalizerAnnAnyInstanceRunning 且非 StatusReleased 的 BucketClaim,再断言 Job 被创建并完成。


🟡 I-2go.mod L163

问题:github.com/stretchr/testify 仍被 api/core/v1alpha1/logset_types_test.go L20 直接导入,但本 PR 把它从 direct require 移到了 indirect block。

影响:go.mod 的 direct/indirect 标记与源码实际导入关系不一致,后续 go mod tidy 或依赖维护会反复产生无关 diff。

建议:把 github.com/stretchr/testify v1.9.0 放回第一个 require block,或将 logset_types_test.go 也迁移到已有的 Gomega 断言风格后再保持 indirect。

4.5、API / 配置变更清单

类型 变更项 变更内容 向后兼容 备注
CRD 行为 LogSetSpec.GetPVCRetentionPolicy / GetS3RetentionPolicy getter 不再写回 spec,读取时兼容单字段缺失 字段 schema 未变
控制器行为 BucketClaim finalizer StatusReleased 删除时跳过 S3 cleanup Job 符合 s3RetentionPolicy=Retain 语义
依赖 github.com/stretchr/testify direct → indirect ⚠️ 与测试文件直接导入不一致,见 I-2

五、潜在风险检查

  • 并发安全:未引入共享状态变更;BucketClaim finalizer 仍以 Kubernetes object update 为边界。
  • 性能/成本StatusReleased 路径减少一次 Job/ConfigMap 创建和 S3 API 调用,成本降低。
  • 安全:未引入新的输入解析、权限或敏感信息输出。
  • 兼容性:getter 无副作用后,当前调用点只依赖返回值;新对象仍由 webhook/default helper 写默认值。
  • 测试风险:本地 go testopenkruise/kruise-api checksum mismatch 阻塞,建议以 CI 结果为准;另外需修正 I-1 中已失真的 e2e 场景。

@mergify

mergify Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

  • Queue this pull request

@loveRhythm1990 loveRhythm1990 merged commit e6fb0df into matrixorigin:main Jun 30, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants