Workflows sometimes don't resume after child/sub-workflows finish in distributed environment.

## Description
DispatchWorkflow with WaitForCompletion = true intermittently leaves the parent workflow suspended forever after the child completes. No error is thrown; the parent simply never resumes.

## Expected Behavior
The parent workflow resumes after the child workflow reaches Finished status.

## Actual Behavior
The parent remains in Suspended state indefinitely. The child shows Finished. No faulted state, no log error.

## Environment
- Elsa Package Version: 3.x (distributed runtime, UseDistributedRuntime())
- Deployment: Multiple application instances (Azure App Service, ≥2 instances)
- Persistence: EF Core / SQL Server for workflow instances, bookmark store, and bookmark queue
- Distributed lock: SqlDistributedSynchronizationProvider (Medallion.Threading.SqlServer)
- Distributed cache/dispatch: MassTransit + Azure Service Bus

## Preliminary Cause Analysis

The failure is a race between DistributedBookmarkQueueWorker, BookmarkQueueSignaler, and DefaultBookmarkQueuePurger:

1. Child completes on Node B. ResumeDispatchWorkflowActivity.HandleAsync enqueues a BookmarkQueueItem into the DB, then calls bookmarkQueueSignaler.TriggerAsync() — which writes to Node B's in-memory Channel<T> ([BookmarkQueueSignaler.cs:27](https://github.com/elsa-workflows/elsa-core/blob/main/src/modules/Elsa.Workflows.Runtime/Services/BookmarkQueueSignaler.cs#L27)). Node A's channel is never written to.

2. Node B's DistributedBookmarkQueueWorker wakes and calls TryAcquireLockAsync with TimeSpan.Zero ([DistributedBookmarkQueueWorker.cs:15](https://github.com/elsa-workflows/elsa-core/blob/main/src/modules/Elsa.Workflows.Runtime.Distributed/Services/DistributedBookmarkQueueWorker.cs#L15)). If Node A holds the lock (e.g. its periodic trigger just fired), handle is null and the method returns immediately without processing — no retry, no re-signal.

3. The default BookmarkQueuePurgeOptions.Ttl is 1 minute; PurgeBookmarkQueueRecurringTask runs every 10 seconds; TriggerBookmarkQueueRecurringTask runs every 1 minute. If the purge task wins the ~60-second race against the next periodic trigger, the queue item is deleted before any node processes it. The parent never receives its resume signal.

## Troubleshooting Attempts
Workaround: Setting BookmarkQueuePurgeOptions.Ttl to a value significantly larger than the periodic trigger interval (e.g. 1 hour) reduces permanent failures by ensuring the item survives until the next retry sweep.

## Additional Context
Occurs intermittently (~5–10% of sub-workflow executions under production load with 2+ instances).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflows sometimes don't resume after child/sub-workflows finish in distributed environment. #7397

Description

Expected Behavior

Actual Behavior

Environment

Preliminary Cause Analysis

Troubleshooting Attempts

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Workflows sometimes don't resume after child/sub-workflows finish in distributed environment. #7397

Description

Description

Expected Behavior

Actual Behavior

Environment

Preliminary Cause Analysis

Troubleshooting Attempts

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions