Skip to content

Add backup lifecycle management with GFS rotation to pbm#1293

Open
daniel-percona wants to merge 7 commits intopercona:devfrom
daniel-percona:feature/lifecycle-management
Open

Add backup lifecycle management with GFS rotation to pbm#1293
daniel-percona wants to merge 7 commits intopercona:devfrom
daniel-percona:feature/lifecycle-management

Conversation

@daniel-percona
Copy link
Copy Markdown

Summary

This PR introduces Backup Lifecycle Management for Percona Backup for MongoDB (PBM). This feature automates the retention and rotation of snapshots using a Grandfather-Father-Son (GFS) policy, addressing critical needs for cost-efficient storage management while ensuring long-term compliance and data durability.

Key Features

  • Dual Rotation Strategies:
    • Rolling (Default): A time-decay algorithm that creates age-based buckets, ensuring resilience against skipped backup windows.
    • Calendar: A strict anchoring strategy for compliance-heavy environments (e.g., "Keep every Friday backup").
  • Safety Circuit Breaker (minKeep): Prevents the "Empty Storage Paradox." If a rotation would leave the cluster with fewer than X backups (default 1), the operation aborts to protect the last remaining recovery points.
  • Intelligent Failed Backup Handling: Optional purging of StatusError backups after they age out of the daily window, while protecting in-progress backups.
  • Interactive Safeguards: Built-in --dry-run capability and a mandatory confirmation prompt (default true) to prevent accidental data loss during manual runs.
  • Chronological Purging: Deletions are executed from oldest to newest, ensuring the most recent data is preserved longest in the event of a network or process interruption.

Technical Implementation Details

  • PITR Safety: The lifecycle manager utilizes the core backup.DeleteBackup function, inheriting PBM's internal isRequiredForOplogSlicing checks. Deletion will be refused if a snapshot is required for an active Point-in-Time Recovery chain.
  • Concurrency Safe: The manager strictly filters for done snapshots, ensuring that starting, running, or stuck backups are never targeted for deletion.
  • Standardized URI Handling: Fully compatible with existing PBM_MONGODB_URI environment variables and standard PBM connection string formats.

Testing Performed

  • Validated GFS bucket logic across 2 months of synthetic backup history.
  • Verified minKeep abort logic in both interactive and automated (prompt: false) modes.
  • Confirmed Ctrl+C signal handling during the deletion loop to ensure graceful termination.
  • Dry-run verification of "Calendar" vs "Rolling" selection logic.

User Documentation Snapshot

Configuration Parameters

Option Type Default Description
lifecycle.enabled Boolean false Toggles automated lifecycle management.
lifecycle.strategy String "rolling" Options: "rolling" or "calendar".
lifecycle.minKeep Integer 1 Ensures you are never left with 0 backups.
lifecycle.prompt Boolean true Requires interactive confirmation [y/N] before deletion.
lifecycle.purgeFailed Boolean false Purges failed backups after the daily window expires.

Recommended Baseline (7 / 4 / 12)

lifecycle:
  enabled: true
  strategy: "rolling"
  minKeep: 1
  prompt: true
  purgeFailed: true
  dailyRetention: 7
  weeklyRetention: 4
  monthlyRetention: 12

Automation (Cron)

For background automation, set lifecycle.prompt: false. The minKeep safety net will automatically abort the purge if no new backups are detected, protecting your historical data.

0 3 * * * /usr/bin/pbm lifecycle >> /var/log/pbm-lifecycle.log 2>&1

Full documentation available and will be provided to engineering separately.

@it-percona-cla
Copy link
Copy Markdown

it-percona-cla commented Mar 27, 2026

CLA assistant check
All committers have signed the CLA.

@boris-ilijic
Copy link
Copy Markdown
Member

boris-ilijic commented Mar 27, 2026

Hey @daniel-percona, thank you for contributing with this interesting feature. We’re going to discuss it during our next refinement session first, and then we'll proceed with PR review. Thanks again!

Also, please create a JIRA ticket for this feature when you have a moment. Much appreciated!

@daniel-percona
Copy link
Copy Markdown
Author

Hey @daniel-percona, thank you for contributing with this interesting feature. We’re going to discuss it during our next refinement session first, and then we'll proceed with PR review. Thanks again!

Also, please create a JIRA ticket for this feature when you have a moment. Much appreciated!

Thank you, happy to be able to contribute. I'll create the jira, I have some minor changes and tweaks to make and once I'm done I'll create the jira accordingly.

@daniel-percona
Copy link
Copy Markdown
Author

@boris-ilijic , JIRA created and link to docs submitted via the jira issue https://perconadev.atlassian.net/browse/PBM-1731

Hey @daniel-percona, thank you for contributing with this interesting feature. We’re going to discuss it during our next refinement session first, and then we'll proceed with PR review. Thanks again!

Also, please create a JIRA ticket for this feature when you have a moment. Much appreciated!

@boris-ilijic
Copy link
Copy Markdown
Member

boris-ilijic commented Mar 31, 2026

Thanks for the ticket. Please pull latest dev into your branch (you merged main instead d5c4758).

…licies

Summary of changes:
- Extended the communication layer to allow PBM configuration profiles
  to carry nested 'lifecycle' configuration blocks.
- Fixed a bug in the lifecycle evaluation engine where the 'minKeep'
  safety threshold was not correctly rescuing the last successful backups
  from the purge list.
- Updated the engine to respect 'Enabled: false' states instantly,
  preventing accidental purges when the policy is disabled.

Key Technical Modifications:
- pbm/ctrl/cmd.go: Added Lifecycle struct to ProfileCmd payload.
- pbm/ctrl/send.go: Updated SendAddConfigProfile to accept lifecycle data.
- sdk/impl.go: Modified AddConfigProfile to pass lifecycle pointers from
  the parsed YAML configuration.
- pbm-agent/profile.go: Updated handleAddConfigProfile to map received
  lifecycle metadata back into the MongoDB profile documents.
- pbm/lifecycle/manager.go: Re-engineered Evaluate() to correctly enforce
  MinKeep logic and handle disabled policies via early exit.
- pbm/lifecycle/manager_test.go: Updated unit tests to reflect and
  verify the corrected safety behavior for in-progress and last-base backups.

Validated in a sharded lab environment with physical and logical storage
profiles, verifying that retention rules are applied correctly per-profile.
…he best Physical backup for every single week and month in the schedule
@daniel-percona daniel-percona force-pushed the feature/lifecycle-management branch from 61bc3b7 to 8840688 Compare March 31, 2026 17:47
@daniel-percona
Copy link
Copy Markdown
Author

daniel-percona commented Mar 31, 2026

Thanks for the ticket. Please pull latest dev into your branch (you merged main instead d5c4758).

@boris-ilijic Done. I've rebased the branch onto dev and cleaned up the commit history. It should be ready for another look, thank you for your help.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Backup Lifecycle Management to PBM, introducing a GFS-style retention/rotation evaluator and a new pbm lifecycle CLI command to plan and (optionally) delete expired backups, with support for config profiles.

Changes:

  • Introduces pbm/lifecycle with an Evaluate() planner and corresponding unit tests.
  • Extends config/profile plumbing to carry lifecycle settings (SDK → ctrl command → agent → stored profile → config retrieval).
  • Adds a pbm lifecycle command that filters backups by profile, prints a report, and deletes selected backups.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
sdk/impl.go Passes lifecycle settings when adding a config profile via ctrl command.
pbm/lifecycle/manager.go Implements lifecycle evaluation logic (rolling/calendar bucketing, failed handling, minKeep behavior) and report formatting.
pbm/lifecycle/manager_test.go Adds unit coverage for core evaluator scenarios (rolling/calendar, leap day, failures, type-aware buckets).
pbm/ctrl/send.go Extends SendAddConfigProfile to include lifecycle config in the command payload.
pbm/ctrl/cmd.go Extends ProfileCmd BSON schema to store lifecycle settings.
pbm/config/util.go Ensures profiled config application includes lifecycle settings.
pbm/config/config.go Adds LifecycleConf, persists it on Config, and adds GetProfileConfig helper.
cmd/pbm/main.go Adds new pbm lifecycle command with report output, prompting, and deletion loop.
cmd/pbm-agent/profile.go Persists lifecycle settings when handling “add config profile” commands.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pbm/config/config.go Outdated

Epoch primitive.Timestamp `bson:"epoch" json:"-" yaml:"-"`
Epoch primitive.Timestamp `bson:"epoch" json:"-" yaml:"-"`
Lifecycle LifecycleConf `bson:"lifecycle,omitempty" json:"lifecycle,omitempty" yaml:"lifecycle,omitempty"`
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config.Lifecycle is declared as a non-pointer struct with omitempty tags. For JSON/YAML/BSON encoding, omitempty does not omit non-pointer structs, so lifecycle will always be emitted/persisted (often as an empty object), which can be a backward-incompatible change for config output and stored documents. Consider making this field a *LifecycleConf (and applying defaults when nil) so omitempty works as intended.

Suggested change
Lifecycle LifecycleConf `bson:"lifecycle,omitempty" json:"lifecycle,omitempty" yaml:"lifecycle,omitempty"`
Lifecycle *LifecycleConf `bson:"lifecycle,omitempty" json:"lifecycle,omitempty" yaml:"lifecycle,omitempty"`

Copilot uses AI. Check for mistakes.
Comment thread pbm/lifecycle/manager.go Outdated
return res
}

// Replace the Evaluate function (~ line 50) with this:
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a patch note that accidentally made it into the codebase ("Replace the Evaluate function..."). It will be confusing for future maintainers; please remove it or replace it with a doc comment describing what Evaluate does/returns.

Suggested change
// Replace the Evaluate function (~ line 50) with this:
// Evaluate analyzes backups according to the lifecycle configuration and
// returns a report describing which backups should be kept or purged.

Copilot uses AI. Check for mistakes.
Comment thread pbm/lifecycle/manager.go Outdated
Comment on lines +192 to +199
// BUG FIX: 4. Enforce Min Keep (Rescue backups from PURGE)
minKeep := 1
if cfg.MinKeep != nil {
minKeep = *cfg.MinKeep
}

if minKeep > 0 && len(report.BackupsKept) < minKeep {
var rescue []backup.BackupMeta
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evaluate currently enforces minKeep by modifying the keep/purge decision (rescuing backups from the purge list). This effectively bypasses the CLI-side safety circuit breaker (the later len(report.BackupsKept) < minKeep check will rarely/never trigger) and changes semantics from “abort purge” to “purge anyway but keep at least N”. Consider keeping Evaluate purely as a planner (reporting what would be purged) and returning a MinKeepViolated flag / error so the caller can abort or prompt, instead of silently rewriting the plan.

Copilot uses AI. Check for mistakes.
Comment thread pbm/config/config.go Outdated
Comment on lines +593 to +620
// GetProfileConfig fetches a specific configuration profile from the database
func GetProfileConfig(ctx context.Context, m connect.Client, name string) (*Config, error) {
res := m.ConfigCollection().FindOne(ctx, bson.D{{"name", name}, {"profile", true}})
if err := res.Err(); err != nil {
if errors.Is(err, mongo.ErrNoDocuments) {
return nil, errors.Wrapf(ErrMissedConfigProfile, "profile '%s' not found", name)
}
return nil, errors.Wrap(err, "get profile")
}

cfg := &Config{}
if err := res.Decode(&cfg); err != nil {
return nil, errors.Wrap(err, "decode profile config")
}

// Apply base struct defaults
if cfg.PITR == nil {
cfg.PITR = &PITRConf{}
}
if cfg.Backup == nil {
cfg.Backup = &BackupConf{}
}
if cfg.Restore == nil {
cfg.Restore = &RestoreConf{}
}

return cfg, nil
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetProfileConfig largely duplicates the fetch/decode/defaulting logic from GetConfig, while a GetProfile helper already exists in profile.go. To avoid two separate code paths drifting (e.g., defaults added in one but not the other), consider consolidating this into a shared internal helper or extending GetProfile to apply the same defaults as GetConfig.

Copilot uses AI. Check for mistakes.
Comment thread cmd/pbm/main.go Outdated
}

if *dryRun || !cfg.Lifecycle.Enabled {
return lifecycleResult{Report: report}, nil
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In text mode, the report is printed directly via fmt.Println(report.String()), but for dry-run / disabled lifecycle the returned lifecycleResult has an empty Msg, so printo() will still emit an extra blank line. Consider setting a non-empty message for these early returns, or returning nil in text mode when you’ve already printed the report (while still returning a JSON-serializable object in JSON mode).

Suggested change
return lifecycleResult{Report: report}, nil
if isJSON {
return lifecycleResult{Report: report}, nil
}
return nil, nil

Copilot uses AI. Check for mistakes.
Comment thread sdk/impl.go Outdated

func (c *Client) AddConfigProfile(ctx context.Context, name string, cfg *Config) (CommandID, error) {
opid, err := ctrl.SendAddConfigProfile(ctx, c.conn, name, cfg.Storage)
opid, err := ctrl.SendAddConfigProfile(ctx, c.conn, name, cfg.Storage, &cfg.Lifecycle)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SendAddConfigProfile now accepts a *config.LifecycleConf, but this call always passes &cfg.Lifecycle (non-nil). If the intent is to omit lifecycle settings when they’re not specified (and rely on defaults), consider passing nil when lifecycle is unset/zero, especially since the downstream command uses omitempty on the lifecycle field.

Suggested change
opid, err := ctrl.SendAddConfigProfile(ctx, c.conn, name, cfg.Storage, &cfg.Lifecycle)
var lifecycle *config.LifecycleConf
if cfg != nil && cfg.Lifecycle != (config.LifecycleConf{}) {
lifecycle = &cfg.Lifecycle
}
opid, err := ctrl.SendAddConfigProfile(ctx, c.conn, name, cfg.Storage, lifecycle)

Copilot uses AI. Check for mistakes.
Comment thread pbm/lifecycle/manager.go
Comment on lines +252 to +262
// Calendar Option: Find the backup closest to the targeted day
minDiff := 31 // Max possible difference
for i, bcp := range candidates {
bcpTime := time.Unix(bcp.StartTS, 0).UTC()
diff := 0

if isMonthly {
diff = int(math.Abs(float64(bcpTime.Day() - targetDayInt)))
} else {
diff = int(math.Abs(float64(bcpTime.Weekday() - time.Weekday(targetDayInt))))
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findBestCandidate in "calendar" mode selects the backup with the closest weekday/day-of-month (nearest-neighbor). This means it may keep a non-Friday/non-15th backup if the exact anchored day is missing, which conflicts with the PR description’s “strict anchoring” behavior (e.g., “keep every Friday backup”). If strict anchoring is required, consider only accepting exact matches (or explicitly documenting the nearest-neighbor fallback in user-facing docs/output).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants