Is your feature request related to a problem? Please describe
When async operations (update-by-query, reindex, delete-by-query, force-merge, etc.) are submitted with wait_for_completion=false, their results are persisted to the .tasks system index via TaskResultsService.storeResult(). However, there is no mechanism to remove these results after they have been consumed or after a configurable retention period.
In production clusters that heavily use async operations (e.g., automated UBQ pipelines, scheduled reindex jobs), the .tasks index grows unbounded over time. This leads to:
- Wasted storage — completed task results that will never be read again accumulate indefinitely
- Cluster state overhead — the
.tasks index metadata contributes to cluster state size
- Operational burden — users must manually run delete-by-query against
.tasks or set up external ISM policies as a workaround
Describe the solution you'd like
A two-layer cleanup strategy:
Layer 1: Delete on retrieval (primary mechanism)
When a client retrieves a completed task result via GET /_tasks/<taskId>, the document is deleted from the .tasks index after the response is returned. This is the natural lifecycle — the result exists only so the client can poll for it; once read, it's served its purpose.
- Default behavior: delete completed results on retrieval (
cleanup=true by default)
- Opt-out:
GET /_tasks/<taskId>?cleanup=false to retain the result for re-reading
- Safe: only deletes completed results; in-progress tasks are unaffected
- Non-blocking: deletion happens asynchronously after the response is sent; failure to delete doesn't affect the client response
Layer 2: TTL-based fallback (safety net)
A background service catches orphaned results that are never retrieved (client crashed, fire-and-forget patterns, etc.):
| Setting |
Default |
Description |
task.result.ttl |
-1 (disabled) |
Time-to-live for completed task results. Results older than this are eligible for deletion. Example: 12h, 7d |
task.result.cleanup.interval |
1h |
How often the cleanup service runs |
task.result.cleanup.batch_size |
1000 |
Maximum documents deleted per cleanup cycle |
Combined behavior
Client submits async UBQ (wait_for_completion=false)
→ Task runs, completes, result stored in .tasks
→ Client polls GET /_tasks/<id>
→ Response returned, .tasks document deleted ← Layer 1
If client never polls:
→ After TTL expires, background service deletes it ← Layer 2
Implementation sketch
Layer 1: Delete on retrieval
The change is in TransportGetTaskAction.onGetFinishedTaskFromIndex(). After successfully parsing and returning the TaskResult, issue an async delete:
// After returning the response to the client:
if (cleanup && result.isCompleted()) {
client.prepareDelete(TaskResultsService.TASK_INDEX, response.getId())
.execute(ActionListener.wrap(
r -> logger.trace("Cleaned up completed task result [{}]", response.getId()),
e -> logger.debug("Failed to clean up task result [{}]: {}", response.getId(), e.getMessage())
));
}
Changes needed:
GetTaskRequest: add boolean cleanup field (version-gated for wire compat)
RestGetTaskAction: parse ?cleanup=true|false parameter (default: true)
- Cluster-level override:
task.result.cleanup_on_get setting (default: true)
Layer 2: TTL-based cleanup service (TaskResultsCleanupService)
- Runs on elected cluster manager only (implements
ClusterStateListener)
- Periodically runs delete-by-query against
.tasks for completed results older than TTL
- Disabled by default (
task.result.ttl=-1)
- Batched deletions to avoid resource spikes
- Requires adding a
completion_time field to the task result mapping (version bump)
Settings
task.result.cleanup_on_get = true # Layer 1 default
task.result.ttl = -1 # Layer 2 disabled by default
task.result.cleanup.interval = 1h # Layer 2 frequency
task.result.cleanup.batch_size = 1000 # Layer 2 batch limit
Backward Compatibility
- Layer 1 defaults to
true but is overridable per-request (?cleanup=false) and per-cluster
- Layer 2 is disabled by default
- Wire format:
cleanup field in GetTaskRequest is version-gated; older nodes ignore it
- Existing
.tasks documents work fine — Layer 1 doesn't need any mapping change, Layer 2 can fall back to start_time_in_millis for old documents
Affected operations
Any operation that supports wait_for_completion=false:
- Update by query
- Delete by query
- Reindex
- Force merge
- Open index
- Resize (split/shrink/clone)
Related component
Cluster Manager
Related issues
Is your feature request related to a problem? Please describe
When async operations (update-by-query, reindex, delete-by-query, force-merge, etc.) are submitted with
wait_for_completion=false, their results are persisted to the.taskssystem index viaTaskResultsService.storeResult(). However, there is no mechanism to remove these results after they have been consumed or after a configurable retention period.In production clusters that heavily use async operations (e.g., automated UBQ pipelines, scheduled reindex jobs), the
.tasksindex grows unbounded over time. This leads to:.tasksindex metadata contributes to cluster state size.tasksor set up external ISM policies as a workaroundDescribe the solution you'd like
A two-layer cleanup strategy:
Layer 1: Delete on retrieval (primary mechanism)
When a client retrieves a completed task result via
GET /_tasks/<taskId>, the document is deleted from the.tasksindex after the response is returned. This is the natural lifecycle — the result exists only so the client can poll for it; once read, it's served its purpose.cleanup=trueby default)GET /_tasks/<taskId>?cleanup=falseto retain the result for re-readingLayer 2: TTL-based fallback (safety net)
A background service catches orphaned results that are never retrieved (client crashed, fire-and-forget patterns, etc.):
task.result.ttl-1(disabled)12h,7dtask.result.cleanup.interval1htask.result.cleanup.batch_size1000Combined behavior
Implementation sketch
Layer 1: Delete on retrieval
The change is in
TransportGetTaskAction.onGetFinishedTaskFromIndex(). After successfully parsing and returning theTaskResult, issue an async delete:Changes needed:
GetTaskRequest: addboolean cleanupfield (version-gated for wire compat)RestGetTaskAction: parse?cleanup=true|falseparameter (default:true)task.result.cleanup_on_getsetting (default:true)Layer 2: TTL-based cleanup service (
TaskResultsCleanupService)ClusterStateListener).tasksfor completed results older than TTLtask.result.ttl=-1)completion_timefield to the task result mapping (version bump)Settings
Backward Compatibility
truebut is overridable per-request (?cleanup=false) and per-clustercleanupfield inGetTaskRequestis version-gated; older nodes ignore it.tasksdocuments work fine — Layer 1 doesn't need any mapping change, Layer 2 can fall back tostart_time_in_millisfor old documentsAffected operations
Any operation that supports
wait_for_completion=false:Related component
Cluster Manager
Related issues