Skip to content

HDDS-12791. Make Retention Service suspendable#9426

Merged
ChenSammi merged 5 commits intoapache:HDDS-8342from
xichen01:HDDS-12791
Dec 29, 2025
Merged

HDDS-12791. Make Retention Service suspendable#9426
ChenSammi merged 5 commits intoapache:HDDS-8342from
xichen01:HDDS-12791

Conversation

@xichen01
Copy link
Copy Markdown
Contributor

@xichen01 xichen01 commented Dec 4, 2025

What changes were proposed in this pull request?

This PR makes the KeyLifecycleService suspendable at runtime without requiring an OM restart.

  • Make ozone.lifecycle.service.enabled reconfigurable, the KeyLifecycleService can be stopped by reconfiguring this configuration.
  • Add KeyLifecycleService status query command
    Example:
$ ozone admin om lifecycle status --service-id <omServiceId>
========================================
          Lifecycle Service Status
========================================
IsEnabled: true
IsSuspended: false
Running Buckets:
  - /vol1/bucket1
========================================

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12791

How was this patch tested?

unit test.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the KeyLifecycleService suspendable at runtime without requiring an OM restart, adding the ability to dynamically enable/disable the service and query its status.

  • Makes ozone.lifecycle.service.enabled reconfigurable so the service can be stopped/started via reconfiguration
  • Adds a new ozone admin om lifecycle status CLI command to query the service state
  • Implements additional shouldRun() checks in the service to allow graceful suspension during bucket processing

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
hadoop-ozone/interface-client/src/main/proto/OmClientProtocol.proto Adds GetLifecycleServiceStatus request/response proto messages and type enum
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/OmUtils.java Marks GetLifecycleServiceStatus as a read-only operation
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocol/OzoneManagerProtocol.java Adds getLifecycleServiceStatus() method to the OM protocol interface
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java Implements client-side translation for the new getLifecycleServiceStatus() RPC
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerRequestHandler.java Adds request handling for GetLifecycleServiceStatus command type
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java Implements getLifecycleServiceStatus() and adds reconfiguration handler for lifecycle service enabled flag
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/KeyLifecycleService.java Changes isServiceEnabled to AtomicBoolean, adds setServiceEnabled() and status() methods, and adds runtime suspension checks
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/om/OMAdmin.java Registers the new LifecycleSubCommand
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/om/LifecycleSubCommand.java New subcommand for lifecycle-related admin operations
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/om/LifecycleStatusSubCommand.java Implements the status query CLI command
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/om/FinalizationStatusSubCommand.java Refactored to use try-with-resources for proper client cleanup
hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/service/TestKeyLifecycleService.java Adds test case for lifecycle service status query functionality
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/shell/TestOzoneShellHA.java Adds integration test for the lifecycle status CLI command
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/reconfig/TestOmReconfiguration.java Adds lifecycle service enabled configuration to reconfigurable properties test

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jojochuang jojochuang added the s3-lifecycle HDDS-8342 label Dec 8, 2025
@jojochuang jojochuang requested a review from ChenSammi December 8, 2025 17:27
}

private String reconfKeyLifecycleServiceEnabled(String newVal) {
boolean enabled = StringUtils.isEmpty(newVal) ?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we throw exception if newVal is null? I think "null" is an invalid input, so it's better response to user with correct info. Will the message in exception be passed to client and show as the output of CLI?

LOG.warn("Failed reconfiguration for '{}'. KeyLifecycleService is not initialized.",
OZONE_KEY_LIFECYCLE_SERVICE_ENABLED);
}
getConfiguration().setBoolean(OZONE_KEY_LIFECYCLE_SERVICE_ENABLED, enabled);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change the configuration first, and then update the service state.

UpgradeFinalization.StatusAndMessages progress =
client.queryUpgradeFinalizationProgress(upgradeClientID, false, true);
System.out.println(progress.status());
try (OzoneManagerProtocol client = parent.createOmClient(omServiceId, omHost, false)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a relevant change ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

* Handler of ozone admin status command.
*/
@Command(
name = "status",
Copy link
Copy Markdown
Contributor

@ChenSammi ChenSammi Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have new CLI to query the service status, and leverage the reconfiguration framework for the service start/stop, can we use consolidate to one entry, for example, new CLI for both start/stop/status? Current both status CLI and reconfiguration with both ozone-site.xml content change and reconfiguration CLI, feels a little complicated for user.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added suspend subcommand.

keyTable.iterator(omMetadataManager.getBucketKey(volumeName, bucketName))) {
while (keyTblItr.hasNext()) {
if (!shouldRun()) {
return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a log message here too about the abort of bucket key evaluation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

return result;
}

if (!shouldRun()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it comes here, the key evaluation is finished, the rest of work is sending the expired remaining key to delete/rename. I think we can let this run to finish, instead of return here.

The most time consuming part of DB key evaluation iteration, send keys to delete/rename doesn't take much time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Set<String> runningBuckets = new HashSet<>(inFlight.keySet());
return GetLifecycleServiceStatusResponse.newBuilder()
.setIsRunning(!runningBuckets.isEmpty())
.setIsEnabled(isServiceEnabled.get())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine the IsRunning and IsEnabled to one state. If IsRuning is the result of whether there is bucket task running or not, this info is already covered by RunningBuckets fields.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the setIsRunning.

* Set isServiceEnabled.
* @param enabled whether enable the lifecycle Service
*/
public void setServiceEnabled(boolean enabled) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse the existing suspended variable, and suspend(), resume() function, instead of this new setServiceEnabled() function, as the current suspended variable and it involved function is actually not used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reused the suspended variable and the the isServiceEnabled only based on the Configuration, cannot modify by user

repeated string runningBuckets = 2;
}

message SuspendLifecycleServiceRequest {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an option in the SuspendLifecycleServiceRequest so that we can resume the service once suspend is no longer needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a resume command. new we can suspend and resume the lifecycle service

}

message GetLifecycleServiceStatusResponse {
required bool isEnabled = 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add one more state, to indicate that service is suspened/paused or not?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the suspened flag

@ChenSammi
Copy link
Copy Markdown
Contributor

Thanks @xichen01 for updating the patch. The last commit looks overall good to me, +1.

@ChenSammi ChenSammi merged commit b16ea44 into apache:HDDS-8342 Dec 29, 2025
41 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

s3-lifecycle HDDS-8342

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants