[fix][broker] Handle synchronous schema lookup failures in replication by Denovo1998 · Pull Request #26108 · apache/pulsar

Denovo1998 · 2026-06-29T11:24:59Z

Motivation

Geo replication pauses and rewinds the cursor when a replicated message needs schema information that is not immediately available. However, if the local schema lookup throws synchronously before returning a future, the current entry is not cleaned up through the schema-fetch path.

This can leave the in-flight task permit incomplete and skip releasing the entry resources for the failed message.

Modifications

Catch synchronous failures from getSchemaInfo(msg) in GeoPersistentReplicator.
Release the current entry, retained payload buffer, and recycled message on that failure path.
Mark the current in-flight entry as completed before rewinding the cursor.
Add a regression test covering synchronous schema lookup failure cleanup.

Verifying this change

Make sure that the change passes the CI checks.
gradlew :pulsar-broker:test --tests org.apache.pulsar.broker.service.persistent.GeoPersistentReplicatorTest -PtestRetryCount=0`

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

void-ptr974

Thanks for the fix. The cleanup path makes sense to me.

I left a few comments around exception handling, retry behavior, and test coverage.

void-ptr974 · 2026-06-29T14:18:33Z

+                CompletableFuture<SchemaInfo> schemaFuture;
+                try {
+                    schemaFuture = getSchemaInfo(msg);
+                } catch (Exception e) {


Would it be better to narrow this catch to the expected exception type? Since getSchemaInfo only declares ExecutionException, catching all Exceptions could accidentally turn unrelated bugs into schema retry loops. Another option might be to normalize getSchemaInfo to return a failed future and then reuse the existing schemaFuture.isCompletedExceptionally() path.

Good point. I narrowed the catch to ExecutionException and normalize that synchronous schema lookup failure into a failed future, so unexpected exceptions are no longer converted into schema retry loops while the existing schema future cleanup path is reused.

void-ptr974 · 2026-06-29T14:18:47Z

+                    headersAndPayload.release();
+                    msg.recycle();
+                    skipRemainingMessages = true;
+                    doRewindCursor(false);


This path can immediately rewind and re-read the same entry if the synchronous schema lookup failure persists. replicateEntries() returns false, so readEntriesComplete() may call readMoreEntries() right away.

One way to avoid a tight retry loop is to keep the replicator in the cursor-rewinding wait state and schedule doRewindCursor(true) after a small backoff instead of rewinding immediately.

Agreed. The exceptional schema future path now keeps the replicator in the cursor-rewinding wait state and schedules doRewindCursor(true) after MESSAGE_RATE_BACKOFF_MS. Successful schema fetches still rewind immediately.

void-ptr974 · 2026-06-29T14:19:04Z

+            return null;
+        }).when(entry).release();
+
+        List<Entry> entries = List.of(entry);


This test only covers the current entry cleanup. It does not verify the batch behavior after skipRemainingMessages is set.

Please extend it to use a multi-entry batch and verify that the remaining entries are skipped/released, completedEntries reaches the full batch size, and the cursor is rewound.

Extended the regression test to use a multi-entry batch. It now verifies the remaining entry is skipped and released, completedEntries reaches the full batch size, and cursor rewind/read retry is triggered only by the scheduled backoff task.

void-ptr974 · 2026-07-01T14:09:43Z

Thanks for the update. The main concerns look addressed.

One small follow-up: after this path calls beforeTerminateOrCursorRewinding(...), replicateEntries() returns false, so readEntriesComplete() will still call readMoreEntries(). That call should not start a read while waitForCursorRewindingRefCnf > 0, but it may schedule another delayed retry. Could we skip the outer readMoreEntries() when the replicator is already waiting for cursor rewind, and let the scheduled doRewindCursor(true) resume reads instead?

Denovo1998 · 2026-07-02T12:38:48Z

Thanks for the update. The main concerns look addressed.

One small follow-up: after this path calls beforeTerminateOrCursorRewinding(...), replicateEntries() returns false, so readEntriesComplete() will still call readMoreEntries(). That call should not start a read while waitForCursorRewindingRefCnf > 0, but it may schedule another delayed retry. Could we skip the outer readMoreEntries() when the replicator is already waiting for cursor rewind, and let the scheduled doRewindCursor(true) resume reads instead?

I added a guard in readEntriesComplete() to skip the outer readMoreEntries() while the replicator is already waiting for cursor rewind. This leaves the scheduled doRewindCursor(true) as the path that resumes reads. I also updated the regression test to exercise readEntriesComplete() end-to-end and verify no extra read is triggered before the scheduled rewind runs.

void-ptr974

LGTM. Thanks for addressing the comments.

[fix][broker] Handle synchronous schema lookup failures in replication

9283a8e

void-ptr974 reviewed Jun 29, 2026

View reviewed changes

add test for schema lookup failures in replication

ef55f81

handle synchronous schema lookup failures in replication

4a549ae

void-ptr974 approved these changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fix][broker] Handle synchronous schema lookup failures in replication#26108

[fix][broker] Handle synchronous schema lookup failures in replication#26108
Denovo1998 wants to merge 3 commits into
apache:masterfrom
Denovo1998:handle_synchronous_schema_lookup_failures_in_replication

Denovo1998 commented Jun 29, 2026 •

edited

Loading

Uh oh!

void-ptr974 left a comment

Uh oh!

void-ptr974 Jun 29, 2026

Uh oh!

Denovo1998 Jun 30, 2026

Uh oh!

void-ptr974 Jun 29, 2026

Uh oh!

Denovo1998 Jun 30, 2026

Uh oh!

void-ptr974 Jun 29, 2026

Uh oh!

Denovo1998 Jun 30, 2026

Uh oh!

void-ptr974 commented Jul 1, 2026

Uh oh!

Denovo1998 commented Jul 2, 2026

Uh oh!

void-ptr974 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Denovo1998 commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Uh oh!

void-ptr974 left a comment

Choose a reason for hiding this comment

Uh oh!

void-ptr974 Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Denovo1998 Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

void-ptr974 Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Denovo1998 Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

void-ptr974 Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Denovo1998 Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

void-ptr974 commented Jul 1, 2026

Uh oh!

Denovo1998 commented Jul 2, 2026

Uh oh!

void-ptr974 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Denovo1998 commented Jun 29, 2026 •

edited

Loading