Summary
The component test BulkDispatchWorkflows should wait for all child workflows to complete in test/component/Elsa.Workflows.ComponentTests/Scenarios/Activities/Composition/BulkDispatchWorkflows/BulkDispatchWorkflowsTests.cs is non-deterministic. Two runs of the same commit on the same branch yield different outcomes.
Evidence
Running on an internal fork of main @ commit 5e97e9130 (main + a workflow-only additive change, no source modification):
| Run |
Workflow |
Outcome for BulkDispatchWorkflows_should_wait... |
| A |
Internal ci-mediawan.yml (unit + component, dotnet test per-project, net10.0) |
PASS — 188 passed / 3 skipped / 191 total in Elsa.Workflows.ComponentTests.dll |
| B |
Upstream packages.yml (Test with coverage job, same per-project pattern, net10.0) |
FAIL — Assert.Equal() Failure: Values differ, Expected: 4, Actual: 3 |
Both runs executed on the same commit SHA, on ubuntu-latest, with .NET SDK 10.0.1xx, Release config, /p:CollectCoverage=true. Only difference : other test projects running alongside in the same job (and therefore overall test duration / resource contention).
Hypothesis
The assertion reads Expected: 4 / Actual: 3, which suggests one of the four dispatched child workflows hadn't reported completion at the moment the parent asserts. Classic race on a shared completion signal / missing await / premature read of a counter incremented asynchronously by child callbacks.
Possible culprits (to be confirmed by someone with the runtime context):
- The test polls/awaits on a fixed timeout rather than a deterministic signal
- A child workflow completion event is dispatched fire-and-forget and may be observed out of order
- Shared state (counter, dictionary) accessed without memory barrier — reminiscent of the pattern that #7284 fixed (
Fix BulkDispatchWorkflows sharing input dictionary across dispatches) but in a different surface
Suggested next steps
- Reproduce locally with
dotnet test --filter "BulkDispatchWorkflows should wait for all child workflows to complete" in a loop (for i in {1..30}; do ...; done) to quantify flake rate
- Consider an explicit barrier / completion waiter instead of a sleep-then-assert
- If the root cause is time-sensitive,
[Retry(3)] is a temporary mitigation but not a fix
I'm happy to file a follow-up PR if you have a direction in mind. Opening this primarily to flag the flake before it masks a future regression.
cc @sfmskywalker
Summary
The component test
BulkDispatchWorkflows should wait for all child workflows to completeintest/component/Elsa.Workflows.ComponentTests/Scenarios/Activities/Composition/BulkDispatchWorkflows/BulkDispatchWorkflowsTests.csis non-deterministic. Two runs of the same commit on the same branch yield different outcomes.Evidence
Running on an internal fork of
main@ commit5e97e9130(main + a workflow-only additive change, no source modification):BulkDispatchWorkflows_should_wait...ci-mediawan.yml(unit + component,dotnet testper-project, net10.0)Elsa.Workflows.ComponentTests.dllpackages.yml(Test with coveragejob, same per-project pattern, net10.0)Assert.Equal() Failure: Values differ, Expected: 4, Actual: 3Both runs executed on the same commit SHA, on
ubuntu-latest, with .NET SDK 10.0.1xx, Release config,/p:CollectCoverage=true. Only difference : other test projects running alongside in the same job (and therefore overall test duration / resource contention).Hypothesis
The assertion reads
Expected: 4 / Actual: 3, which suggests one of the four dispatched child workflows hadn't reported completion at the moment the parent asserts. Classic race on a shared completion signal / missing await / premature read of a counter incremented asynchronously by child callbacks.Possible culprits (to be confirmed by someone with the runtime context):
Fix BulkDispatchWorkflows sharing input dictionary across dispatches) but in a different surfaceSuggested next steps
dotnet test --filter "BulkDispatchWorkflows should wait for all child workflows to complete"in a loop (for i in {1..30}; do ...; done) to quantify flake rate[Retry(3)]is a temporary mitigation but not a fixI'm happy to file a follow-up PR if you have a direction in mind. Opening this primarily to flag the flake before it masks a future regression.
cc @sfmskywalker