Skip to content

Workflows sometimes don't resume after child/sub-workflows finish in distributed environment. #7397

@cobrafast

Description

@cobrafast

Description

DispatchWorkflow with WaitForCompletion = true intermittently leaves the parent workflow suspended forever after the child completes. No error is thrown; the parent simply never resumes.

Expected Behavior

The parent workflow resumes after the child workflow reaches Finished status.

Actual Behavior

The parent remains in Suspended state indefinitely. The child shows Finished. No faulted state, no log error.

Environment

  • Elsa Package Version: 3.x (distributed runtime, UseDistributedRuntime())
  • Deployment: Multiple application instances (Azure App Service, ≥2 instances)
  • Persistence: EF Core / SQL Server for workflow instances, bookmark store, and bookmark queue
  • Distributed lock: SqlDistributedSynchronizationProvider (Medallion.Threading.SqlServer)
  • Distributed cache/dispatch: MassTransit + Azure Service Bus

Preliminary Cause Analysis

The failure is a race between DistributedBookmarkQueueWorker, BookmarkQueueSignaler, and DefaultBookmarkQueuePurger:

  1. Child completes on Node B. ResumeDispatchWorkflowActivity.HandleAsync enqueues a BookmarkQueueItem into the DB, then calls bookmarkQueueSignaler.TriggerAsync() — which writes to Node B's in-memory Channel (BookmarkQueueSignaler.cs:27). Node A's channel is never written to.

  2. Node B's DistributedBookmarkQueueWorker wakes and calls TryAcquireLockAsync with TimeSpan.Zero (DistributedBookmarkQueueWorker.cs:15). If Node A holds the lock (e.g. its periodic trigger just fired), handle is null and the method returns immediately without processing — no retry, no re-signal.

  3. The default BookmarkQueuePurgeOptions.Ttl is 1 minute; PurgeBookmarkQueueRecurringTask runs every 10 seconds; TriggerBookmarkQueueRecurringTask runs every 1 minute. If the purge task wins the ~60-second race against the next periodic trigger, the queue item is deleted before any node processes it. The parent never receives its resume signal.

Troubleshooting Attempts

Workaround: Setting BookmarkQueuePurgeOptions.Ttl to a value significantly larger than the periodic trigger interval (e.g. 1 hour) reduces permanent failures by ensuring the item survives until the next retry sweep.

Additional Context

Occurs intermittently (~5–10% of sub-workflow executions under production load with 2+ instances).

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions