Parallel indexing with deepstore intermediary data does not clean up shuffle-data in deep storage

### Background

Druid supports `druid.processing.intermediaryData.storage.type=deepstore` for native parallel indexing, introduced to address reliability issues such as rolling updates / worker loss during shuffle.

In this mode, phase-1 tasks write shuffle/intermediary data under the `shuffle-data` prefix in deep storage.

### Problem

These deep-storage shuffle artifacts are not cleaned up after the parallel indexing task completes.

The current local intermediary data path has cleanup behavior, but the deepstore path does not appear to have an equivalent end-to-end cleanup flow. In particular, cleanup cannot rely on the producing worker/task process still being alive, since the task may exit before deletion is triggered.

As a result, completed parallel indexing tasks can leave behind many residual files/directories under `shuffle-data`, which accumulate over time in deep storage.

**Even the document calls out the auto clean up is not implemented and require retention policy configured at deep storage side, I don't think we should leave the responsibility to deep storage.
And for hdfs deep storage, there's no such built-in auto clean up policy.**

<img width="1024" height="451" alt="Image" src="https://github.com/user-attachments/assets/715a2e73-71a4-4997-a2bc-ca3460fc72e6" />

### Proposal

Trigger deepstore intermediary-data cleanup from the supervisor task after the parallel indexing flow completes.

This would make cleanup deterministic and consistent with the lifecycle of the overall parallel indexing task.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel indexing with deepstore intermediary data does not clean up shuffle-data in deep storage #19163

Background

Problem

Proposal

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Parallel indexing with deepstore intermediary data does not clean up shuffle-data in deep storage #19163

Description

Background

Problem

Proposal

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions