Description
DispatchWorkflow with WaitForCompletion = true intermittently leaves the parent workflow suspended forever after the child completes. No error is thrown; the parent simply never resumes.
Expected Behavior
The parent workflow resumes after the child workflow reaches Finished status.
Actual Behavior
The parent remains in Suspended state indefinitely. The child shows Finished. No faulted state, no log error.
Environment
- Elsa Package Version: 3.x (distributed runtime, UseDistributedRuntime())
- Deployment: Multiple application instances (Azure App Service, ≥2 instances)
- Persistence: EF Core / SQL Server for workflow instances, bookmark store, and bookmark queue
- Distributed lock: SqlDistributedSynchronizationProvider (Medallion.Threading.SqlServer)
- Distributed cache/dispatch: MassTransit + Azure Service Bus
Preliminary Cause Analysis
The failure is a race between DistributedBookmarkQueueWorker, BookmarkQueueSignaler, and DefaultBookmarkQueuePurger:
-
Child completes on Node B. ResumeDispatchWorkflowActivity.HandleAsync enqueues a BookmarkQueueItem into the DB, then calls bookmarkQueueSignaler.TriggerAsync() — which writes to Node B's in-memory Channel (BookmarkQueueSignaler.cs:27). Node A's channel is never written to.
-
Node B's DistributedBookmarkQueueWorker wakes and calls TryAcquireLockAsync with TimeSpan.Zero (DistributedBookmarkQueueWorker.cs:15). If Node A holds the lock (e.g. its periodic trigger just fired), handle is null and the method returns immediately without processing — no retry, no re-signal.
-
The default BookmarkQueuePurgeOptions.Ttl is 1 minute; PurgeBookmarkQueueRecurringTask runs every 10 seconds; TriggerBookmarkQueueRecurringTask runs every 1 minute. If the purge task wins the ~60-second race against the next periodic trigger, the queue item is deleted before any node processes it. The parent never receives its resume signal.
Troubleshooting Attempts
Workaround: Setting BookmarkQueuePurgeOptions.Ttl to a value significantly larger than the periodic trigger interval (e.g. 1 hour) reduces permanent failures by ensuring the item survives until the next retry sweep.
Additional Context
Occurs intermittently (~5–10% of sub-workflow executions under production load with 2+ instances).
Description
DispatchWorkflow with WaitForCompletion = true intermittently leaves the parent workflow suspended forever after the child completes. No error is thrown; the parent simply never resumes.
Expected Behavior
The parent workflow resumes after the child workflow reaches Finished status.
Actual Behavior
The parent remains in Suspended state indefinitely. The child shows Finished. No faulted state, no log error.
Environment
Preliminary Cause Analysis
The failure is a race between DistributedBookmarkQueueWorker, BookmarkQueueSignaler, and DefaultBookmarkQueuePurger:
Child completes on Node B. ResumeDispatchWorkflowActivity.HandleAsync enqueues a BookmarkQueueItem into the DB, then calls bookmarkQueueSignaler.TriggerAsync() — which writes to Node B's in-memory Channel (BookmarkQueueSignaler.cs:27). Node A's channel is never written to.
Node B's DistributedBookmarkQueueWorker wakes and calls TryAcquireLockAsync with TimeSpan.Zero (DistributedBookmarkQueueWorker.cs:15). If Node A holds the lock (e.g. its periodic trigger just fired), handle is null and the method returns immediately without processing — no retry, no re-signal.
The default BookmarkQueuePurgeOptions.Ttl is 1 minute; PurgeBookmarkQueueRecurringTask runs every 10 seconds; TriggerBookmarkQueueRecurringTask runs every 1 minute. If the purge task wins the ~60-second race against the next periodic trigger, the queue item is deleted before any node processes it. The parent never receives its resume signal.
Troubleshooting Attempts
Workaround: Setting BookmarkQueuePurgeOptions.Ttl to a value significantly larger than the periodic trigger interval (e.g. 1 hour) reduces permanent failures by ensuring the item survives until the next retry sweep.
Additional Context
Occurs intermittently (~5–10% of sub-workflow executions under production load with 2+ instances).