Skip to content

Parallel indexing with deepstore intermediary data does not clean up shuffle-data in deep storage #19163

@FrankChen021

Description

@FrankChen021

Background

Druid supports druid.processing.intermediaryData.storage.type=deepstore for native parallel indexing, introduced to address reliability issues such as rolling updates / worker loss during shuffle.

In this mode, phase-1 tasks write shuffle/intermediary data under the shuffle-data prefix in deep storage.

Problem

These deep-storage shuffle artifacts are not cleaned up after the parallel indexing task completes.

The current local intermediary data path has cleanup behavior, but the deepstore path does not appear to have an equivalent end-to-end cleanup flow. In particular, cleanup cannot rely on the producing worker/task process still being alive, since the task may exit before deletion is triggered.

As a result, completed parallel indexing tasks can leave behind many residual files/directories under shuffle-data, which accumulate over time in deep storage.

Even the document calls out the auto clean up is not implemented and require retention policy configured at deep storage side, I don't think we should leave the responsibility to deep storage.
And for hdfs deep storage, there's no such built-in auto clean up policy.

Image

Proposal

Trigger deepstore intermediary-data cleanup from the supervisor task after the parallel indexing flow completes.

This would make cleanup deterministic and consistent with the lifecycle of the overall parallel indexing task.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions