Skip to content

Release oasis storage prune and compact subcomands#6521

Open
martintomazic wants to merge 12 commits intomasterfrom
martin/feature/release-storage-prune-cmd
Open

Release oasis storage prune and compact subcomands#6521
martintomazic wants to merge 12 commits intomasterfrom
martin/feature/release-storage-prune-cmd

Conversation

@martintomazic
Copy link
Copy Markdown
Contributor

@martintomazic martintomazic commented May 6, 2026

Closes #6519.

Relatively trivial to review. LOC is big because I extracted commands to separate files and added missing tests (separate commits).

Motivation

  1. Offline pruning/compaction was introduced because late pruning 1. may not reclaim disk space, 2. may cause node to fall behind even once registered as ready (pending compaction load).
  2. We could also use it to create go/oasis-node: Enable snapshot creation with exact start version #6423, serving as alternative to go/oasis-node/cmd/storage: Add create and import checkpoint cmd #6454.

How to test locally

For testing I used latest Sapphire snapshot (cca 7 months of data, keep_n = 100_000). It took 10min to prune and 6min to compact. For consensus I used snapshot-old (cca 3 months of data, keep_n = 100_000), due to limited disk space on my machine. Pruning took 2min and compaction 3min.

Node started syncing normally after prune/compact commands.

Possible follow-up

For #6454, we may want to add additional flag to the pruning command, e.g. retain_height, that ignores pruning config and calculates corresponding rounds at this height of every runtime, implementing keep_from proposal.

@netlify
Copy link
Copy Markdown

netlify Bot commented May 6, 2026

Deploy Preview for oasisprotocol-oasis-core canceled.

Name Link
🔨 Latest commit 32bbd4e
🔍 Latest deploy log https://app.netlify.com/projects/oasisprotocol-oasis-core/deploys/69fd9b5370919a0008f45929

@martintomazic martintomazic force-pushed the martin/feature/release-storage-prune-cmd branch 4 times, most recently from c39199f to ddd1106 Compare May 7, 2026 12:49
Comment on lines +267 to +270
// By calculating retain round from the runtime state DB latest round,
// we ensure light history is never pruned past the latest synced runtime
// round.
retainRound := latest - numKept
Copy link
Copy Markdown
Contributor Author

@martintomazic martintomazic May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice we should probably also respect executor prune handler that should prevent pruning past last normal round.

// runtimeLastNormalRound returns the last normal round for the given runtime.
func runtimeLastNormalRound(ctx context.Context, ndb db.NodeDB, runtimeID common.Namespace) (uint64, error) {
	latest, ok := ndb.GetLatestVersion()
	if !ok {
		return 0, fmt.Errorf("consensus node DB is empty")
	}

	roots, err := ndb.GetRootsForVersion(latest)
	if err != nil {
		return 0, fmt.Errorf("failed to get roots for consensus version %d: %w", latest, err)
	}
	if len(roots) == 0 {
		return 0, fmt.Errorf("no roots found for consensus version %d", latest)
	}

	tree := mkvs.NewWithRoot(nil, ndb, roots[0], mkvs.WithoutWriteLog())
	defer tree.Close()

	s := roothashState.NewImmutableState(tree)
	rtState, err := s.RuntimeState(ctx, runtimeID)
	if err != nil {
		return 0, fmt.Errorf("failed to get runtime state: %w", err)
	}

	return rtState.LastNormalRound, nil
}

We have two options:

  1. Pass consensus node DB to pruneRuntimeDBs.
    • Or just open and close it there (simplest and least changes).
    • This feels weird as this function should probably accept retainRound, like pruneConsensusDBs should accept retainHeight instead of runtimeIds (solution 2.).
  2. Add new consensusRetainHeight and runtimeRetainRound functions that precompute this limits, possibly write unit tests for those two functions. The annoying part is that those two functions consume runtime histories, nodedbs and consensus nodedbs, meaning we loose resource encapsulation and need to keep them open throughout the whole command. Feels like a better direction, but requires a thorough refactor + makes logging and orchestration incredibly messy.

In addition adding this handler requires consensus state to be always present. Without it you can prune runtime state without having a consensus state locally (e.g. when hacking state if imported using snapshots, don't think this is needed though).

Comment thread go/oasis-node/cmd/storage/prune.go
Comment on lines +94 to +121
ndb, close, err := openConsensusNodeDB(dataDir)
if err != nil {
return fmt.Errorf("failed to open NodeDB: %w", err)
}
defer close()

latest, ok := ndb.GetLatestVersion()
if !ok {
logger.Info("skipping consensus pruning as state db is empty")
return nil
}

if latest < numKept {
logger.Info("skipping consensus pruning as the latest version is smaller than the number of versions to keep")
return nil
}

// In case of configured runtimes, do not prune past the earliest reindexed
// consensus height, so that light history can be populated correctly.
minReindexed, err := minReindexedHeight(dataDir, runtimes)
if err != nil {
return fmt.Errorf("failed to fetch earliest reindexed consensus height: %w", err)
}

retainHeight := min(
latest-numKept, // underflow not possible due to if above.
uint64(minReindexed),
)
Copy link
Copy Markdown
Contributor Author

@martintomazic martintomazic May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. this could be func consensusRetainHeight(ndb db.NodeDB, histories []history.Histories)(uint64, bool, error) and this function takes retainHeight: uint64 instead of the last two params, which would also allow us to unit test the business logic. As stated above this complicates the orchestration a lot though :(

@martintomazic martintomazic force-pushed the martin/feature/release-storage-prune-cmd branch from ddd1106 to c768a9a Compare May 7, 2026 21:38
logger.Info("Starting databases pruning. This may take a while...")

dataDir := cmdCommon.DataDir()
ctx := cmd.Context()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: We should probably wire

 signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM) .

As cobra context does not capture this signals by default, resulting in non-graceful shutdown (never had an issue so fat though). Still it is annoying that BadgerDB command does not expose cancelation api, so even with this fixed it won't be complete.

Best way to mitigate unlikely stale state scenario. Moreover,
it enables introducing context cancelation.
Code was preserved as is, expect for using the new new*Cmd()
pattern introduced in the inspect command.
Offline compaction now also works for the runtime history
and state DB.
Make it symetric with the prune command helpers.
@martintomazic martintomazic force-pushed the martin/feature/release-storage-prune-cmd branch from c768a9a to 32bbd4e Compare May 8, 2026 08:14
@martintomazic martintomazic marked this pull request as ready for review May 8, 2026 08:58
@martintomazic
Copy link
Copy Markdown
Contributor Author

Ready for a review.

Please check the PR context, the main question is how much do we want to complicate our life with prune handlers - comment.

Since the PR is already rather big, I am open to push one PR infront that does:

  • factors prune/compact commands to separate files,
  • adds missing tests
  • minor fixes
  • prune block and state store in small intervals.

Then this PR becomes much smaller and focused only one new runtime logic and releasing the commands.

Comment thread docs/oasis-node/cli.md
Comment thread docs/oasis-node/cli.md
@@ -386,7 +386,7 @@ enabling it for the first time, or later changing it to retain less data. This
way they guarantee the node is healthy when it starts.

Following successful pruning, to release disk space, they are encouraged to run
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So many words :)

Suggested change
Following successful pruning, to release disk space, they are encouraged to run
After the pruning operators should run [the compaction command](#compact) to release disk space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Release offline pruning and compact command

2 participants