HDDS-15327. Proactively clear failed replication commands in SCM by chihsuan · Pull Request #10540 · apache/ozone

chihsuan · 2026-06-18T11:06:18Z

What changes were proposed in this pull request?

Problem: When a replication or EC reconstruction command fails on a datanode (a transient network blip, a busy node, etc.), SCM is never told. The pending "ADD" op keeps counting against the inflight replication quota until its deadline expires, which defaults to 12 minutes (hdds.scm.replication.event.timeout).

That stale entry causes two problems:

The cluster-wide inflight count fills up with dead entries, so SCM stops scheduling new replication even while the datanodes sit idle.
The affected container is not retried, because the health check still thinks an ADD is in flight for that replica.

During decommission this stalls progress badly: thousands of commands are issued, and even a small failure rate leaks enough stale entries to block the cluster for up to 12 minutes at a time.

Fix: Let the datanode report when a replication/reconstruction command finishes, so SCM clears the op immediately instead of waiting for the timeout. This re-introduces the command-status feedback path that was removed in HDDS-1368, with no Protobuf/wire change (the CommandStatus message already has FAILED, cmdId, and type).

Datanode: reports EXECUTED/FAILED for replicateContainerCommand and reconstructECContainersCommand, exactly the way deleteBlocksCommand already does. Tasks that are skipped, ignored before running (deadline passed, not in service, stale SCM term), or dropped before queueing (queue full -> FAILED, duplicate -> EXECUTED) also report a terminal status, so no PENDING entry is left behind in the datanode's command-status map. Tasks with no backing SCM command (e.g. reconcile) are unaffected.
SCM: CommandStatusReportHandler fires a new REPLICATION_STATUS event for failed commands; a dedicated ReplicationStatusHandler (leader-only, mirroring the delete-block path) consumes it and clears the matching pending op via the new ContainerReplicaPendingOps#onReplicationCommandFailed(cmdId), decrementing the inflight counter and freeing the scheduled size. Both problems above are resolved at once.

Compatibility is graceful: an old datanode against a new SCM just never sends the report and falls back to the 12-minute timeout; a new datanode against an old SCM has the status ignored, as before.

Follow-up (separate HDDS Jira): every failure is currently rescheduled like a timeout. A later change can carry a failure reason in the existing CommandStatus.msg field (no wire change) so transient failures retry promptly while queue-full failures back off, avoiding a resend/drop loop under heavy load. This touches the ReplicationManager retry policy, so it is kept out of this surgical change.

Flow

sequenceDiagram
    autonumber
    participant SCMcmd as SCM (ReplicationManager)
    participant DN as Datanode (StateContext, ReplicationSupervisor)
    participant Pub as CommandStatusReportPublisher
    participant H as CommandStatusReportHandler
    participant Ops as ContainerReplicaPendingOps

    Note over SCMcmd,Ops: replicateContainerCommand / reconstructECContainersCommand

    SCMcmd->>DN: send ADD command (cmdId)
    Note right of SCMcmd: pendingOps records ADD<br/>commandIdToContainer[cmdId] = container<br/>inflight quota consumed
    DN->>DN: addCmdStatus registers PENDING(cmdId)

    DN->>DN: TaskRunner.run() executes task

    alt task FAILED or early-return
        DN->>DN: updateCommandStatus(cmdId, markAsFailed)
        Pub->>H: heartbeat report {cmdId: FAILED}
        H->>H: filter FAILED replicate/reconstruct, fire REPLICATION_STATUS
        H->>Ops: ReplicationStatusHandler (leader only) calls onReplicationCommandFailed(cmdId)
        Ops->>Ops: remove ADD op, decrement inflight,<br/>release scheduled size, drop index entry
        Ops-->>SCMcmd: notifySubscribers(timedOut=true), op freed and rescheduled next RM cycle
    else task DONE or SKIPPED (success)
        DN->>DN: updateCommandStatus(cmdId, markAsExecuted)
        Pub->>H: heartbeat report {cmdId: EXECUTED}
        H->>H: EXECUTED not routed (debug log only)
        Note over Ops: op already cleared earlier by<br/>container report completeOp()
    end

    Note over Pub: PENDING entries stay in the map and are<br/>re-sent each interval until resolved, so every<br/>exit path must mark EXECUTED or FAILED

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15327

How was this patch tested?

New and updated unit tests:

TestContainerReplicaPendingOps: a failed command removes the matching ADD op and decrements the inflight counter; an unknown command id is a no-op.
TestCommandStatusReportHandler: a FAILED replication status fires REPLICATION_STATUS.
TestStateContext: replicate/reconstruct commands register a PENDING status.
TestReplicationSupervisor: a finished task reports EXECUTED on success and FAILED on failure; skipped, deadline-passed, queue-full, and duplicate tasks all drain their status entry (no leftover PENDING).
TestReplicationStatusHandler: the leader clears the pending op on a failed status; a follower does not.

Local CI-aligned checks all pass: checkstyle.sh, rat.sh, author.sh.

Generated-by: Claude Code (Claude Opus 4.8)

…d failure

…caPendingOps

…to SCM

…and handling

…ing comment

…ion paths

Follower SCMs no longer clear pending ops on stray command-status reports, matching the existing leader guard in DeletedBlockLogImpl.

…ore running

chungen0126 · 2026-06-20T03:07:10Z

Thank @chihsuan for your PR and the proposal.

However, the primary scope of this specific ticket is strictly to adjust the default value of hdds.scm.replication.event.timeout. Since the internal logic of the ReplicationManager is quite complex and a change like this involves deeper architectural considerations, we shouldn't mix it into this ticket.

Modifying the ReplicationManager is quite complex and sensitive, we'd love to see a formal design doc under a new ticket to discuss this deeply.

chihsuan · 2026-06-20T03:35:00Z

Thanks @chungen0126 for the review!

My reading was that the config change is only the workaround mentioned in the ticket, and the actual goal of HDDS-15327 is to fix the root cause. That's why this PR goes after the proactive-clear path.

cc @jojochuang since you filed this, could you confirm the intended scope of HDDS-15327? 🙏 Is it the proactive-clear fix this PR implements, or just bumping the default event.timeout as a workaround, with the real fix tracked under a new design-doc ticket? Happy either way.

Just for context, this PR reuses the exact command-status feedback path that deleteBlocksCommand already uses, with no Protobuf/wire change, and degrades gracefully against mixed old/new SCM and DN.

That said, I agree this touches a sensitive, larger area. Happy to write up a more details doc and discuss the scope further in the Jira so we align before landing anything. Thanks! 🙂

chihsuan · 2026-06-21T05:34:50Z

Added a design doc for this change, as suggested: https://docs.google.com/document/d/1bXr-ULcdCxAkwftdlaJJFsibz3wjXsG4XxZnqFIn1oE/edit?usp=sharing

Feedback welcome here or on the doc.

chihsuan added 10 commits June 17, 2026 21:11

HDDS-15327. Clear pending replication op when datanode reports comman…

c745ed7

…d failure

HDDS-15327. Drop command index entry for failed delete ops

99df4da

HDDS-15327. Route failed replication command status to ContainerRepli…

ffe96f7

…caPendingOps

HDDS-15327. Datanode reports replication command success and failure …

48aeadb

…to SCM

HDDS-15327. Fix checkstyle and test style for failed replication comm…

cd8eaac

…and handling

HDDS-15327. Clarify failed-command Javadoc and executed-status report…

bb83cf0

…ing comment

HDDS-15327. Drain command status on skipped and early-return replicat…

4611edb

…ion paths

HDDS-15327. Collapse over-wrapped assertions in replication status tests

fa2abce

HDDS-15327. Extract ReplicationStatusHandler with leader guard

5b87c59

Follower SCMs no longer clear pending ops on stray command-status reports, matching the existing leader guard in DeletedBlockLogImpl.

HDDS-15327. Drain command status when replication task is dropped bef…

af5d8a2

…ore running

chihsuan marked this pull request as ready for review June 18, 2026 12:34

chungen0126 requested a review from jojochuang June 19, 2026 13:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-15327. Proactively clear failed replication commands in SCM#10540

HDDS-15327. Proactively clear failed replication commands in SCM#10540
chihsuan wants to merge 10 commits into
apache:masterfrom
chihsuan:HDDS-15327

chihsuan commented Jun 18, 2026 •

edited

Loading

Uh oh!

chungen0126 commented Jun 20, 2026

Uh oh!

chihsuan commented Jun 20, 2026

Uh oh!

chihsuan commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chihsuan commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Flow

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

chungen0126 commented Jun 20, 2026

Uh oh!

chihsuan commented Jun 20, 2026

Uh oh!

chihsuan commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chihsuan commented Jun 18, 2026 •

edited

Loading