HDDS-13890. Datanode supports dynamic configuration of SCM#9385
HDDS-13890. Datanode supports dynamic configuration of SCM#9385ChenSammi merged 9 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
This PR reminds me we don't have an administrator doc on how to operationalize OM and SCM migration. The existing decommission doc (https://ozone.apache.org/docs/edge/feature/decommission.html) covers only the usage of the command but it stops short of being an end-to-end complete tutorial for migration.
| @@ -234,6 +235,7 @@ public void removeSCMServer(InetSocketAddress address) throws IOException { | |||
| } | |||
|
|
|||
| EndpointStateMachine endPoint = scmMachines.get(address); | |||
There was a problem hiding this comment.
A question for the existing logic. Could it be better to remove the endpoint from scmMachines first, and then shutdown it?
There was a problem hiding this comment.
Thanks for the review. Although scmMachines are protected by read and write lock, I think it's better to remove the endpoint from the scmMachines first. If there are no locks, there might be issues where SCMConnectionManager#getValues might return EndpointStateMachine that was shutdown which might trigger DN shutdown (see RunningDatanodeState#execute).
|
@ivandika3 , thanks for enabling Ozone with this useful new capability. The patch overall looks good to me. From the test case, it looks like new SCM add and decommissioned SCM remove are with two steps, instead of one step, so it's a recommended way, right? Can we have a document about the recommended steps of SCM migration(add/remove) and DN configuration change using this dynamic configuration, so that community users can follow the examples? Either open a new JIRA for doc, or include within the patch, both are fine to me. |
jojochuang
left a comment
There was a problem hiding this comment.
we would need to deal with Primordial SCM change as well per the Decommission doc.
something like this: https://github.com/jojochuang/ozone/pull/new/SCMMigrationDoc |
# Conflicts: # hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/HddsDatanodeService.java # hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java
|
Thanks @ChenSammi and @jojochuang for the reviews.
Yes, this is to reduce the risks and allow easier rollback. So we will add the SCMs first and after the new SCMs are working fine (i.e. it exits the safemode), we will transfer leadership from the old SCM to the new SCM and remove the old SCMs. Regarding the docs, I have raised HDDS-14167 and HDDS-14168 for OM and SCM migrations respectively. |
ChenSammi
left a comment
There was a problem hiding this comment.
Thanks @ivandika3 for updating the patch. The last one LGTM, +1.
What changes were proposed in this pull request?
Currently if we would like to migrate the SCM from (scm1, scm2, scm3) to (scm4, scm5, scm6), all the datanodes need to be restarted 2 times with updated "ozone.scm.nodes." configuration
As mentioned in HDDS-12391, rolling restarting all the datanodes might take a while. For large datanodes fleet this might take a lot of time (days or even weeks).
It might be good to support dynamic reconfiguration of SCM endpoints in DN to prevent restarts. A possible flow
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13890
How was this patch tested?
Integration test.
Clean CI: https://github.com/ivandika3/ozone/actions/runs/19727935170