Skip to content

[Bug] BE may return empty publish success when PUBLISH_VERSION arrives before local close finishes #62057

@xiaobijuan2026

Description

@xiaobijuan2026

Search before asking

  • I had searched in the issues and found no similar issues.

Version

4.0.4

What's Wrong?

Search before asking

  • I had searched in the existing issues and found no similar bug report.

Version

Observed on Doris 4.0 branch.

What's Wrong?

In Unique Key MoW / partial update workloads, a slow BE may receive PUBLISH_VERSION before its local TabletsChannel::close() fully finishes.

At that time, the BE may still be in local close stages such as:

  • wait_flush
  • build_rowset
  • delete bitmap related work
  • commit_txn

As a result, EnginePublishVersionTask may see:

txn_manager()->get_txn_related_tablets(transaction_id, partition_id, &tablet_related_rs)


### What You Expected?

### Search before asking

- [x] I had searched in the existing issues and found no similar bug report.

### Version

Observed on Doris 4.0 branch.

### What's Wrong?

In Unique Key MoW / partial update workloads, a slow BE may receive `PUBLISH_VERSION` before its local `TabletsChannel::close()` fully finishes.

At that time, the BE may still be in local close stages such as:
- `wait_flush`
- `build_rowset`
- delete bitmap related work
- `commit_txn`

As a result, `EnginePublishVersionTask` may see:

```cpp
txn_manager()->get_txn_related_tablets(transaction_id, partition_id, &tablet_related_rs)
return an empty result.

The problematic part is that this path may still end up returning a successful publish task result:

taskStatus = OK
succ_tablets = 0
error_tablet_ids = 0
So FE treats the publish task as finished, but the replica actually neither published any tablet nor explicitly reported itself as a failed replica.

This looks like a silent "empty publish success".

What You Expected?
If a BE receives PUBLISH_VERSION but still has no local publishable txn state for the current transaction/partition, it should not silently look like a successful publish with empty results.

Expected behavior should be one of:

recover the local committed rowset and continue publish, or
explicitly report affected tablets as failed replicas, so FE can continue quorum-based handling correctly.
But it should not silently return success with both empty succ_tablets and empty error_tablet_ids.

How to Reproduce?
This issue is easier to trigger in slow replica scenarios, especially with Unique Key MoW fixed partial update where local delete bitmap work is heavy.

A typical sequence is:

enough replicas succeed so FE can commit the transaction
FE marks txn as COMMITTED
FE sends PUBLISH_VERSION
a slow BE is still finishing local close()
that BE sees empty txn_related_tablets
publish task may still report OK with no success tablets and no error tablets
Logs
The issue is usually accompanied by logs showing:

FE-side publish task submission happened before the slow BE finished local close
BE-side publish found no local txn-related tablets
much later, the same BE finally logs closed tablets_channel
For example, local close may finish much later than publish submission on the slow BE, which indicates FE transaction-level COMMITTED and BE local publish-input-ready are not the same timing point.

Why This Happens?
My current understanding is:

FE transaction-level COMMITTED is based on quorum success per tablet
but BE local publish readiness depends on whether local close() has already completed and commit_txn() has populated txn_manager
these two readiness points are not identical
So it is possible that FE legitimately sends PUBLISH_VERSION, while some slow BE still has no local publishable txn state.

### How to Reproduce?

_No response_

### Anything Else?

_No response_

### Are you willing to submit PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions