Add backup lifecycle management with GFS rotation to pbm#1293
Add backup lifecycle management with GFS rotation to pbm#1293daniel-percona wants to merge 7 commits intopercona:devfrom
Conversation
|
Hey @daniel-percona, thank you for contributing with this interesting feature. We’re going to discuss it during our next refinement session first, and then we'll proceed with PR review. Thanks again! Also, please create a JIRA ticket for this feature when you have a moment. Much appreciated! |
Thank you, happy to be able to contribute. I'll create the jira, I have some minor changes and tweaks to make and once I'm done I'll create the jira accordingly. |
|
@boris-ilijic , JIRA created and link to docs submitted via the jira issue https://perconadev.atlassian.net/browse/PBM-1731
|
|
Thanks for the ticket. Please pull latest |
…licies Summary of changes: - Extended the communication layer to allow PBM configuration profiles to carry nested 'lifecycle' configuration blocks. - Fixed a bug in the lifecycle evaluation engine where the 'minKeep' safety threshold was not correctly rescuing the last successful backups from the purge list. - Updated the engine to respect 'Enabled: false' states instantly, preventing accidental purges when the policy is disabled. Key Technical Modifications: - pbm/ctrl/cmd.go: Added Lifecycle struct to ProfileCmd payload. - pbm/ctrl/send.go: Updated SendAddConfigProfile to accept lifecycle data. - sdk/impl.go: Modified AddConfigProfile to pass lifecycle pointers from the parsed YAML configuration. - pbm-agent/profile.go: Updated handleAddConfigProfile to map received lifecycle metadata back into the MongoDB profile documents. - pbm/lifecycle/manager.go: Re-engineered Evaluate() to correctly enforce MinKeep logic and handle disabled policies via early exit. - pbm/lifecycle/manager_test.go: Updated unit tests to reflect and verify the corrected safety behavior for in-progress and last-base backups. Validated in a sharded lab environment with physical and logical storage profiles, verifying that retention rules are applied correctly per-profile.
…he best Physical backup for every single week and month in the schedule
61bc3b7 to
8840688
Compare
@boris-ilijic Done. I've rebased the branch onto dev and cleaned up the commit history. It should be ready for another look, thank you for your help. |
There was a problem hiding this comment.
Pull request overview
This PR adds Backup Lifecycle Management to PBM, introducing a GFS-style retention/rotation evaluator and a new pbm lifecycle CLI command to plan and (optionally) delete expired backups, with support for config profiles.
Changes:
- Introduces
pbm/lifecyclewith anEvaluate()planner and corresponding unit tests. - Extends config/profile plumbing to carry lifecycle settings (SDK → ctrl command → agent → stored profile → config retrieval).
- Adds a
pbm lifecyclecommand that filters backups by profile, prints a report, and deletes selected backups.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/impl.go | Passes lifecycle settings when adding a config profile via ctrl command. |
| pbm/lifecycle/manager.go | Implements lifecycle evaluation logic (rolling/calendar bucketing, failed handling, minKeep behavior) and report formatting. |
| pbm/lifecycle/manager_test.go | Adds unit coverage for core evaluator scenarios (rolling/calendar, leap day, failures, type-aware buckets). |
| pbm/ctrl/send.go | Extends SendAddConfigProfile to include lifecycle config in the command payload. |
| pbm/ctrl/cmd.go | Extends ProfileCmd BSON schema to store lifecycle settings. |
| pbm/config/util.go | Ensures profiled config application includes lifecycle settings. |
| pbm/config/config.go | Adds LifecycleConf, persists it on Config, and adds GetProfileConfig helper. |
| cmd/pbm/main.go | Adds new pbm lifecycle command with report output, prompting, and deletion loop. |
| cmd/pbm-agent/profile.go | Persists lifecycle settings when handling “add config profile” commands. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| Epoch primitive.Timestamp `bson:"epoch" json:"-" yaml:"-"` | ||
| Epoch primitive.Timestamp `bson:"epoch" json:"-" yaml:"-"` | ||
| Lifecycle LifecycleConf `bson:"lifecycle,omitempty" json:"lifecycle,omitempty" yaml:"lifecycle,omitempty"` |
There was a problem hiding this comment.
Config.Lifecycle is declared as a non-pointer struct with omitempty tags. For JSON/YAML/BSON encoding, omitempty does not omit non-pointer structs, so lifecycle will always be emitted/persisted (often as an empty object), which can be a backward-incompatible change for config output and stored documents. Consider making this field a *LifecycleConf (and applying defaults when nil) so omitempty works as intended.
| Lifecycle LifecycleConf `bson:"lifecycle,omitempty" json:"lifecycle,omitempty" yaml:"lifecycle,omitempty"` | |
| Lifecycle *LifecycleConf `bson:"lifecycle,omitempty" json:"lifecycle,omitempty" yaml:"lifecycle,omitempty"` |
| return res | ||
| } | ||
|
|
||
| // Replace the Evaluate function (~ line 50) with this: |
There was a problem hiding this comment.
This looks like a patch note that accidentally made it into the codebase ("Replace the Evaluate function..."). It will be confusing for future maintainers; please remove it or replace it with a doc comment describing what Evaluate does/returns.
| // Replace the Evaluate function (~ line 50) with this: | |
| // Evaluate analyzes backups according to the lifecycle configuration and | |
| // returns a report describing which backups should be kept or purged. |
| // BUG FIX: 4. Enforce Min Keep (Rescue backups from PURGE) | ||
| minKeep := 1 | ||
| if cfg.MinKeep != nil { | ||
| minKeep = *cfg.MinKeep | ||
| } | ||
|
|
||
| if minKeep > 0 && len(report.BackupsKept) < minKeep { | ||
| var rescue []backup.BackupMeta |
There was a problem hiding this comment.
Evaluate currently enforces minKeep by modifying the keep/purge decision (rescuing backups from the purge list). This effectively bypasses the CLI-side safety circuit breaker (the later len(report.BackupsKept) < minKeep check will rarely/never trigger) and changes semantics from “abort purge” to “purge anyway but keep at least N”. Consider keeping Evaluate purely as a planner (reporting what would be purged) and returning a MinKeepViolated flag / error so the caller can abort or prompt, instead of silently rewriting the plan.
| // GetProfileConfig fetches a specific configuration profile from the database | ||
| func GetProfileConfig(ctx context.Context, m connect.Client, name string) (*Config, error) { | ||
| res := m.ConfigCollection().FindOne(ctx, bson.D{{"name", name}, {"profile", true}}) | ||
| if err := res.Err(); err != nil { | ||
| if errors.Is(err, mongo.ErrNoDocuments) { | ||
| return nil, errors.Wrapf(ErrMissedConfigProfile, "profile '%s' not found", name) | ||
| } | ||
| return nil, errors.Wrap(err, "get profile") | ||
| } | ||
|
|
||
| cfg := &Config{} | ||
| if err := res.Decode(&cfg); err != nil { | ||
| return nil, errors.Wrap(err, "decode profile config") | ||
| } | ||
|
|
||
| // Apply base struct defaults | ||
| if cfg.PITR == nil { | ||
| cfg.PITR = &PITRConf{} | ||
| } | ||
| if cfg.Backup == nil { | ||
| cfg.Backup = &BackupConf{} | ||
| } | ||
| if cfg.Restore == nil { | ||
| cfg.Restore = &RestoreConf{} | ||
| } | ||
|
|
||
| return cfg, nil | ||
| } |
There was a problem hiding this comment.
GetProfileConfig largely duplicates the fetch/decode/defaulting logic from GetConfig, while a GetProfile helper already exists in profile.go. To avoid two separate code paths drifting (e.g., defaults added in one but not the other), consider consolidating this into a shared internal helper or extending GetProfile to apply the same defaults as GetConfig.
| } | ||
|
|
||
| if *dryRun || !cfg.Lifecycle.Enabled { | ||
| return lifecycleResult{Report: report}, nil |
There was a problem hiding this comment.
In text mode, the report is printed directly via fmt.Println(report.String()), but for dry-run / disabled lifecycle the returned lifecycleResult has an empty Msg, so printo() will still emit an extra blank line. Consider setting a non-empty message for these early returns, or returning nil in text mode when you’ve already printed the report (while still returning a JSON-serializable object in JSON mode).
| return lifecycleResult{Report: report}, nil | |
| if isJSON { | |
| return lifecycleResult{Report: report}, nil | |
| } | |
| return nil, nil |
|
|
||
| func (c *Client) AddConfigProfile(ctx context.Context, name string, cfg *Config) (CommandID, error) { | ||
| opid, err := ctrl.SendAddConfigProfile(ctx, c.conn, name, cfg.Storage) | ||
| opid, err := ctrl.SendAddConfigProfile(ctx, c.conn, name, cfg.Storage, &cfg.Lifecycle) |
There was a problem hiding this comment.
SendAddConfigProfile now accepts a *config.LifecycleConf, but this call always passes &cfg.Lifecycle (non-nil). If the intent is to omit lifecycle settings when they’re not specified (and rely on defaults), consider passing nil when lifecycle is unset/zero, especially since the downstream command uses omitempty on the lifecycle field.
| opid, err := ctrl.SendAddConfigProfile(ctx, c.conn, name, cfg.Storage, &cfg.Lifecycle) | |
| var lifecycle *config.LifecycleConf | |
| if cfg != nil && cfg.Lifecycle != (config.LifecycleConf{}) { | |
| lifecycle = &cfg.Lifecycle | |
| } | |
| opid, err := ctrl.SendAddConfigProfile(ctx, c.conn, name, cfg.Storage, lifecycle) |
| // Calendar Option: Find the backup closest to the targeted day | ||
| minDiff := 31 // Max possible difference | ||
| for i, bcp := range candidates { | ||
| bcpTime := time.Unix(bcp.StartTS, 0).UTC() | ||
| diff := 0 | ||
|
|
||
| if isMonthly { | ||
| diff = int(math.Abs(float64(bcpTime.Day() - targetDayInt))) | ||
| } else { | ||
| diff = int(math.Abs(float64(bcpTime.Weekday() - time.Weekday(targetDayInt)))) | ||
| } |
There was a problem hiding this comment.
findBestCandidate in "calendar" mode selects the backup with the closest weekday/day-of-month (nearest-neighbor). This means it may keep a non-Friday/non-15th backup if the exact anchored day is missing, which conflicts with the PR description’s “strict anchoring” behavior (e.g., “keep every Friday backup”). If strict anchoring is required, consider only accepting exact matches (or explicitly documenting the nearest-neighbor fallback in user-facing docs/output).
Summary
This PR introduces Backup Lifecycle Management for Percona Backup for MongoDB (PBM). This feature automates the retention and rotation of snapshots using a Grandfather-Father-Son (GFS) policy, addressing critical needs for cost-efficient storage management while ensuring long-term compliance and data durability.
Key Features
Rolling(Default): A time-decay algorithm that creates age-based buckets, ensuring resilience against skipped backup windows.Calendar: A strict anchoring strategy for compliance-heavy environments (e.g., "Keep every Friday backup").minKeep): Prevents the "Empty Storage Paradox." If a rotation would leave the cluster with fewer thanXbackups (default 1), the operation aborts to protect the last remaining recovery points.StatusErrorbackups after they age out of the daily window, while protecting in-progress backups.--dry-runcapability and a mandatory confirmation prompt (defaulttrue) to prevent accidental data loss during manual runs.Technical Implementation Details
backup.DeleteBackupfunction, inheriting PBM's internalisRequiredForOplogSlicingchecks. Deletion will be refused if a snapshot is required for an active Point-in-Time Recovery chain.donesnapshots, ensuring thatstarting,running, orstuckbackups are never targeted for deletion.PBM_MONGODB_URIenvironment variables and standard PBM connection string formats.Testing Performed
minKeepabort logic in both interactive and automated (prompt: false) modes.Ctrl+Csignal handling during the deletion loop to ensure graceful termination.User Documentation Snapshot
Configuration Parameters
lifecycle.enabledfalselifecycle.strategy"rolling""rolling"or"calendar".lifecycle.minKeep1lifecycle.prompttrue[y/N]before deletion.lifecycle.purgeFailedfalseRecommended Baseline (7 / 4 / 12)
Automation (Cron)
For background automation, set lifecycle.prompt: false. The minKeep safety net will automatically abort the purge if no new backups are detected, protecting your historical data.
Full documentation available and will be provided to engineering separately.