Skip to content

HDDS-13890. Datanode supports dynamic configuration of SCM#9385

Merged
ChenSammi merged 9 commits intoapache:masterfrom
ivandika3:HDDS-13890
Dec 15, 2025
Merged

HDDS-13890. Datanode supports dynamic configuration of SCM#9385
ChenSammi merged 9 commits intoapache:masterfrom
ivandika3:HDDS-13890

Conversation

@ivandika3
Copy link
Copy Markdown
Contributor

@ivandika3 ivandika3 commented Nov 27, 2025

What changes were proposed in this pull request?

Currently if we would like to migrate the SCM from (scm1, scm2, scm3) to (scm4, scm5, scm6), all the datanodes need to be restarted 2 times with updated "ozone.scm.nodes." configuration

  1. Before migration: Update ozone.scm.nodes to (scm1, scm2, scm3, scm4, scm5, scm6)
  2. After (scm1, scm2, scm3) are decommissioned: Update ozone.scm.nodes to (scm4, scm5, scm6)

As mentioned in HDDS-12391, rolling restarting all the datanodes might take a while. For large datanodes fleet this might take a lot of time (days or even weeks).

It might be good to support dynamic reconfiguration of SCM endpoints in DN to prevent restarts. A possible flow

  1. Admin update the "ozone.scm.nodes" to the new value (with some new nodes and removed nodes)
  2. DN will compare the new and previous configuration and find the SCM endpoints to add and remove
  3. DN will add the SCM endpoints (e.g. SCMConnectionManager#addSCMServer) and then remove the SCM endpoints (e.g. SCMConnectionManager#removeSCMServer).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13890

How was this patch tested?

Integration test.

Clean CI: https://github.com/ivandika3/ozone/actions/runs/19727935170

@ivandika3 ivandika3 marked this pull request as ready for review November 27, 2025 08:53
@ivandika3 ivandika3 self-assigned this Nov 27, 2025
@adoroszlai adoroszlai requested a review from ChenSammi December 1, 2025 12:08
@jojochuang
Copy link
Copy Markdown
Contributor

@rnblough

@jojochuang jojochuang self-requested a review December 1, 2025 17:36
Copy link
Copy Markdown
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR reminds me we don't have an administrator doc on how to operationalize OM and SCM migration. The existing decommission doc (https://ozone.apache.org/docs/edge/feature/decommission.html) covers only the usage of the command but it stops short of being an end-to-end complete tutorial for migration.

@peterxcli peterxcli self-requested a review December 3, 2025 03:35
@@ -234,6 +235,7 @@ public void removeSCMServer(InetSocketAddress address) throws IOException {
}

EndpointStateMachine endPoint = scmMachines.get(address);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question for the existing logic. Could it be better to remove the endpoint from scmMachines first, and then shutdown it?

Copy link
Copy Markdown
Contributor Author

@ivandika3 ivandika3 Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Although scmMachines are protected by read and write lock, I think it's better to remove the endpoint from the scmMachines first. If there are no locks, there might be issues where SCMConnectionManager#getValues might return EndpointStateMachine that was shutdown which might trigger DN shutdown (see RunningDatanodeState#execute).

@ChenSammi
Copy link
Copy Markdown
Contributor

@ivandika3 , thanks for enabling Ozone with this useful new capability. The patch overall looks good to me. From the test case, it looks like new SCM add and decommissioned SCM remove are with two steps, instead of one step, so it's a recommended way, right?

Can we have a document about the recommended steps of SCM migration(add/remove) and DN configuration change using this dynamic configuration, so that community users can follow the examples? Either open a new JIRA for doc, or include within the patch, both are fine to me.

Copy link
Copy Markdown
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would need to deal with Primordial SCM change as well per the Decommission doc.

@jojochuang
Copy link
Copy Markdown
Contributor

Can we have a document about the recommended steps of SCM migration(add/remove) and DN configuration change using this dynamic configuration, so that community users can follow the examples? Either open a new JIRA for doc, or include within the patch, both are fine to me.

something like this: https://github.com/jojochuang/ozone/pull/new/SCMMigrationDoc

# Conflicts:
#	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/HddsDatanodeService.java
#	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java
@ivandika3
Copy link
Copy Markdown
Contributor Author

ivandika3 commented Dec 14, 2025

Thanks @ChenSammi and @jojochuang for the reviews.

From the test case, it looks like new SCM add and decommissioned SCM remove are with two steps, instead of one step, so it's a recommended way, right?

Yes, this is to reduce the risks and allow easier rollback. So we will add the SCMs first and after the new SCMs are working fine (i.e. it exits the safemode), we will transfer leadership from the old SCM to the new SCM and remove the old SCMs.

Regarding the docs, I have raised HDDS-14167 and HDDS-14168 for OM and SCM migrations respectively.

Copy link
Copy Markdown
Contributor

@ChenSammi ChenSammi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ivandika3 for updating the patch. The last one LGTM, +1.

@ChenSammi ChenSammi merged commit fd89481 into apache:master Dec 15, 2025
43 checks passed
@ivandika3 ivandika3 deleted the HDDS-13890 branch December 15, 2025 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants