HDDS-15327. Proactively clear failed replication commands in SCM#10540
HDDS-15327. Proactively clear failed replication commands in SCM#10540chihsuan wants to merge 10 commits into
Conversation
Follower SCMs no longer clear pending ops on stray command-status reports, matching the existing leader guard in DeletedBlockLogImpl.
|
Thank @chihsuan for your PR and the proposal. However, the primary scope of this specific ticket is strictly to adjust the default value of Modifying the ReplicationManager is quite complex and sensitive, we'd love to see a formal design doc under a new ticket to discuss this deeply. |
|
Thanks @chungen0126 for the review! My reading was that the config change is only the workaround mentioned in the ticket, and the actual goal of HDDS-15327 is to fix the root cause. That's why this PR goes after the proactive-clear path. cc @jojochuang since you filed this, could you confirm the intended scope of HDDS-15327? 🙏 Is it the proactive-clear fix this PR implements, or just bumping the default Just for context, this PR reuses the exact command-status feedback path that That said, I agree this touches a sensitive, larger area. Happy to write up a more details doc and discuss the scope further in the Jira so we align before landing anything. Thanks! 🙂 |
|
Added a design doc for this change, as suggested: https://docs.google.com/document/d/1bXr-ULcdCxAkwftdlaJJFsibz3wjXsG4XxZnqFIn1oE/edit?usp=sharing Feedback welcome here or on the doc. |
What changes were proposed in this pull request?
Problem: When a replication or EC reconstruction command fails on a datanode (a transient network blip, a busy node, etc.), SCM is never told. The pending "ADD" op keeps counting against the inflight replication quota until its deadline expires, which defaults to 12 minutes (
hdds.scm.replication.event.timeout).That stale entry causes two problems:
During decommission this stalls progress badly: thousands of commands are issued, and even a small failure rate leaks enough stale entries to block the cluster for up to 12 minutes at a time.
Fix: Let the datanode report when a replication/reconstruction command finishes, so SCM clears the op immediately instead of waiting for the timeout. This re-introduces the command-status feedback path that was removed in HDDS-1368, with no Protobuf/wire change (the
CommandStatusmessage already hasFAILED,cmdId, andtype).EXECUTED/FAILEDforreplicateContainerCommandandreconstructECContainersCommand, exactly the waydeleteBlocksCommandalready does. Tasks that are skipped, ignored before running (deadline passed, not in service, stale SCM term), or dropped before queueing (queue full ->FAILED, duplicate ->EXECUTED) also report a terminal status, so noPENDINGentry is left behind in the datanode's command-status map. Tasks with no backing SCM command (e.g. reconcile) are unaffected.CommandStatusReportHandlerfires a newREPLICATION_STATUSevent for failed commands; a dedicatedReplicationStatusHandler(leader-only, mirroring the delete-block path) consumes it and clears the matching pending op via the newContainerReplicaPendingOps#onReplicationCommandFailed(cmdId), decrementing the inflight counter and freeing the scheduled size. Both problems above are resolved at once.Compatibility is graceful: an old datanode against a new SCM just never sends the report and falls back to the 12-minute timeout; a new datanode against an old SCM has the status ignored, as before.
Follow-up (separate HDDS Jira): every failure is currently rescheduled like a timeout. A later change can carry a failure reason in the existing
CommandStatus.msgfield (no wire change) so transient failures retry promptly while queue-full failures back off, avoiding a resend/drop loop under heavy load. This touches the ReplicationManager retry policy, so it is kept out of this surgical change.Flow
sequenceDiagram autonumber participant SCMcmd as SCM (ReplicationManager) participant DN as Datanode (StateContext, ReplicationSupervisor) participant Pub as CommandStatusReportPublisher participant H as CommandStatusReportHandler participant Ops as ContainerReplicaPendingOps Note over SCMcmd,Ops: replicateContainerCommand / reconstructECContainersCommand SCMcmd->>DN: send ADD command (cmdId) Note right of SCMcmd: pendingOps records ADD<br/>commandIdToContainer[cmdId] = container<br/>inflight quota consumed DN->>DN: addCmdStatus registers PENDING(cmdId) DN->>DN: TaskRunner.run() executes task alt task FAILED or early-return DN->>DN: updateCommandStatus(cmdId, markAsFailed) Pub->>H: heartbeat report {cmdId: FAILED} H->>H: filter FAILED replicate/reconstruct, fire REPLICATION_STATUS H->>Ops: ReplicationStatusHandler (leader only) calls onReplicationCommandFailed(cmdId) Ops->>Ops: remove ADD op, decrement inflight,<br/>release scheduled size, drop index entry Ops-->>SCMcmd: notifySubscribers(timedOut=true), op freed and rescheduled next RM cycle else task DONE or SKIPPED (success) DN->>DN: updateCommandStatus(cmdId, markAsExecuted) Pub->>H: heartbeat report {cmdId: EXECUTED} H->>H: EXECUTED not routed (debug log only) Note over Ops: op already cleared earlier by<br/>container report completeOp() end Note over Pub: PENDING entries stay in the map and are<br/>re-sent each interval until resolved, so every<br/>exit path must mark EXECUTED or FAILEDWhat is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15327
How was this patch tested?
New and updated unit tests:
TestContainerReplicaPendingOps: a failed command removes the matching ADD op and decrements the inflight counter; an unknown command id is a no-op.TestCommandStatusReportHandler: a FAILED replication status firesREPLICATION_STATUS.TestStateContext: replicate/reconstruct commands register a PENDING status.TestReplicationSupervisor: a finished task reportsEXECUTEDon success andFAILEDon failure; skipped, deadline-passed, queue-full, and duplicate tasks all drain their status entry (no leftoverPENDING).TestReplicationStatusHandler: the leader clears the pending op on a failed status; a follower does not.Local CI-aligned checks all pass:
checkstyle.sh,rat.sh,author.sh.Generated-by: Claude Code (Claude Opus 4.8)