Skip to content

[Bug](load) INSERT SELECT data invisible after quorum success with cancelled node channel #61916

@xiaobijuan2026

Description

@xiaobijuan2026

Problem

When INSERT INTO ... SELECT ... writes to multiple replicas, if one node channel is slow and times out during close_wait, it gets cancelled but NOT marked as failed. This causes:

  1. close_wait returns OK even though a node was cancelled
  2. FE is unaware of the failure, commits the transaction
  3. PUBLISH_VERSION task is sent to ALL nodes including the cancelled one
  4. Cancelled node can't find the rowset → publish fails
  5. Data stays COMMITTED but not VISIBLE for a long time (30+ minutes until retry)

Root Cause

In IndexChannel::close_wait() (vtablet_writer.cpp), when unfinished node channels are cancelled due to timeout, mark_as_failed() is not called. FE receives no error tablet info for the cancelled replicas.

Fix

After cancelling unfinished node channels in close_wait timeout:

  1. Call mark_as_failed() to record failed tablets
  2. Call check_intolerable_failure() - if failures exceed tolerance, fail the entire load
  3. Call set_error_tablet_in_state() to propagate error info to FE

This allows FE to:

  • Skip failed replicas during PUBLISH_VERSION
  • Data becomes visible immediately on healthy replicas
  • Background TabletScheduler auto-repairs the failed replica

Behavior after fix

Scenario Replicas Result
3 replicas, 1 timeout 2/3 success ✅ Publish succeeds, failed replica auto-repairs
3 replicas, 2 timeout 1/3 success ❌ Load fails, user gets error

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions