Skip to content

Flaky component test: BulkDispatchWorkflows should wait for all child workflows to complete #7404

@Fatilov

Description

@Fatilov

Summary

The component test BulkDispatchWorkflows should wait for all child workflows to complete in test/component/Elsa.Workflows.ComponentTests/Scenarios/Activities/Composition/BulkDispatchWorkflows/BulkDispatchWorkflowsTests.cs is non-deterministic. Two runs of the same commit on the same branch yield different outcomes.

Evidence

Running on an internal fork of main @ commit 5e97e9130 (main + a workflow-only additive change, no source modification):

Run Workflow Outcome for BulkDispatchWorkflows_should_wait...
A Internal ci-mediawan.yml (unit + component, dotnet test per-project, net10.0) PASS — 188 passed / 3 skipped / 191 total in Elsa.Workflows.ComponentTests.dll
B Upstream packages.yml (Test with coverage job, same per-project pattern, net10.0) FAILAssert.Equal() Failure: Values differ, Expected: 4, Actual: 3

Both runs executed on the same commit SHA, on ubuntu-latest, with .NET SDK 10.0.1xx, Release config, /p:CollectCoverage=true. Only difference : other test projects running alongside in the same job (and therefore overall test duration / resource contention).

Hypothesis

The assertion reads Expected: 4 / Actual: 3, which suggests one of the four dispatched child workflows hadn't reported completion at the moment the parent asserts. Classic race on a shared completion signal / missing await / premature read of a counter incremented asynchronously by child callbacks.

Possible culprits (to be confirmed by someone with the runtime context):

  1. The test polls/awaits on a fixed timeout rather than a deterministic signal
  2. A child workflow completion event is dispatched fire-and-forget and may be observed out of order
  3. Shared state (counter, dictionary) accessed without memory barrier — reminiscent of the pattern that #7284 fixed (Fix BulkDispatchWorkflows sharing input dictionary across dispatches) but in a different surface

Suggested next steps

  • Reproduce locally with dotnet test --filter "BulkDispatchWorkflows should wait for all child workflows to complete" in a loop (for i in {1..30}; do ...; done) to quantify flake rate
  • Consider an explicit barrier / completion waiter instead of a sleep-then-assert
  • If the root cause is time-sensitive, [Retry(3)] is a temporary mitigation but not a fix

I'm happy to file a follow-up PR if you have a direction in mind. Opening this primarily to flag the flake before it masks a future regression.

cc @sfmskywalker

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions