Skip to content

MSQ Compaction Not Working (Overlord Metadata Cache Syncing Forever) #19112

@GWphua

Description

@GWphua

Affected Version

Master Branch
Between v36~v37: (HEAD at commit 8307d8a, PR#19030)

Encountered when trying out MSQ Compaction using MM-less ingestion.
The test datasource is ingested via Kafka supervisor.

Relevant Overlord settings:

## Required for Supervisor compaction in v36
    druid_supervisor_compaction_enabled: true
    druid_supervisor_compaction_engine: "msq"
    druid_manager_segments_useIncrementalCache: "always"

// Loadlist for MSQ extension + MM-less

Description

I have set up supervisor auto-compaction, using this and created a supervisor to compact my datasource.

However, the supervisor keeps showing 'RUNNING', with no tasks being created. Looking at the overlord logs, it seems like there's some problem metadata cache. The leader is trying to sync for an hour...

2026-03-09T09:38:12,533 INFO [qtp1741034833-54] org.apache.druid.metadata.segment.cache.HeapMemorySegmentMetadataCache - Wait complete. Cache is now in state[LEADER_FIRST_SYNC_PENDING].
...
2026-03-09T09:44:12,325 INFO [CompactionScheduler-0] org.apache.druid.metadata.segment.cache.HeapMemorySegmentMetadataCache - Wait complete. Cache is now in state[LEADER_FIRST_SYNC_PENDING].
... // Same logs till Overlord restart
2026-03-09T10:16:50,529 INFO [CompactionScheduler-0] org.apache.druid.metadata.segment.cache.HeapMemorySegmentMetadataCache - Wait complete. Cache is now in state[FOLLOWER].
... // Pending logs after restart again
2026-03-09T10:36:58,682 INFO [CompactionScheduler-0] org.apache.druid.metadata.segment.cache.HeapMemorySegmentMetadataCache - Wait complete. Cache is now in state[LEADER_FIRST_SYNC_PENDING]

I am testing this on a test cluster, with only 1 datasource of 6065 segments. This should not take an hour. I'm not sure if there's some kind of deadlocks preventing the sync.

Coordinator

INFO [org.apache.druid.metadata.SqlSegmentsMetadataManager-Exec--0] org.apache.druid.metadata.SqlSegmentsMetadataManager - Polled and found [6,065] segments in the database in [152]ms.

Overlord

WARN [qtp1741034833-73] org.apache.druid.metadata.segment.SqlSegmentMetadataTransactionFactory - Starting read-write transaction for datasource[test_ds]. Reads will be done directly from metadata store since cache is not synced yet.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions