Skip to content

Fix Flowable workflow termination message lookup and async-job race on deletion#27117

Merged
yan-3005 merged 5 commits intomainfrom
ram/fix-flowable-termination-message
Apr 8, 2026
Merged

Fix Flowable workflow termination message lookup and async-job race on deletion#27117
yan-3005 merged 5 commits intomainfrom
ram/fix-flowable-termination-message

Conversation

@yan-3005
Copy link
Copy Markdown
Contributor

@yan-3005 yan-3005 commented Apr 7, 2026

Summary

  • Production bug fix: WorkflowHandler.findTerminationMessageName was splitting task definition keys on _ and searching for messages named <key>_terminateProcess. Workflow.getFlowableElementId joins with a ., so actual task keys are <subProcess>.approvalTask and messages are <subProcess>.terminateProcess. Every conflict-resolution path (a newer workflow instance replacing an older one for the same entity) silently failed since the original Custom Workflows commit — the old subprocess was never terminated via message, left dangling until force-delete. Fixed by splitting on the last . and looking up <subProcessId>.terminateProcess directly.
  • Test fix: waitForWorkflowIdle now waits for async jobs on the trigger process definition IDs rather than waiting for process instances to drain. The async-after jobs on PeriodicBatchEntityTrigger's fetch task are ephemeral (<100ms); the old approach waited for the entire batch loop to finish (potentially minutes with many entities), causing test suite time to blow up.
  • Removed waitForWorkflowIdle from cleanupCreatedWorkflows @AfterEach — it ran a Flowable query per workflow in every test regardless of whether the workflow was ever triggered.

Test plan

  • WorkflowDefinitionResourceIT#test_DataCompletenessWorkflow_SDK — zero NullPointerException / JOB_EXECUTION_FAILURE lines
  • WorkflowDefinitionResourceIT#test_TagApprovalWorkflow_SDK — zero No termination message found warnings; no 60s ConditionTimeoutException timeout
  • Full WorkflowDefinitionResourceIT suite — no regressions, test time back to baseline (~2m30s)

…n deletion

WorkflowHandler.findTerminationMessageName was splitting task definition keys
on underscore and looking for messages named '<key>_terminateProcess'. But
Workflow.getFlowableElementId joins with a dot, so actual task keys are
'<subProcess>.approvalTask' and termination messages are
'<subProcess>.terminateProcess'. The bug has been latent since the original
Custom Workflows commit - every conflict-resolution path silently failed to
terminate the old subprocess, leaving it dangling until force-delete.

Fix: split on the last dot, derive the subprocess ID, and check for the
'<subProcessId>.terminateProcess' subscription directly. Remove the dead
'terminateChangeReviewProcess' pattern and the debug scaffolding loops.

Also add getManagementService/getRepositoryService getters to WorkflowHandler
for use in integration tests.

In WorkflowDefinitionResourceIT, replace waitForWorkflowIdle to check pending
async jobs on the trigger process definition IDs instead of waiting for
process instances to drain. Async-after jobs on the fetch task are ephemeral
(<100ms) - waiting for them has near-zero cost, whereas the old approach
waited for the entire batch loop (potentially minutes with many entities).
Also remove the waitForWorkflowIdle call from cleanupCreatedWorkflows @AfterEach
since it ran a query per workflow in every test unnecessarily.
Copilot AI review requested due to automatic review settings April 7, 2026 06:22
@yan-3005 yan-3005 added safe to test Add this label to run secure Github workflows on PRs backend labels Apr 7, 2026
@yan-3005 yan-3005 self-assigned this Apr 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a long-standing Flowable custom-workflow termination bug by aligning termination message lookup with the actual element/message naming scheme, and hardens integration-test cleanup by waiting for short-lived async jobs to drain before deleting workflow deployments.

Changes:

  • Update WorkflowHandler.findTerminationMessageName(...) to derive <subProcessId>.terminateProcess from task definition keys using the last . separator and validate via message subscription query.
  • Add ManagementService / RepositoryService accessors on WorkflowHandler to support test-side Flowable queries.
  • Update WorkflowDefinitionResourceIT cleanup paths to wait for trigger process async jobs to drain (instead of waiting for process instances to finish), and avoid per-test unconditional idle-waits.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/WorkflowHandler.java Correct termination message name resolution to match Flowable element IDs; add Flowable service getters for test usage.
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/WorkflowDefinitionResourceIT.java Add waitForWorkflowIdle to avoid async-job vs deletion race during workflow cleanup and reduce test suite latency.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

🟡 Playwright Results — all passed (23 flaky)

✅ 3595 passed · ❌ 0 failed · 🟡 23 flaky · ⏭️ 207 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 456 0 1 2
🟡 Shard 2 640 0 2 32
🟡 Shard 3 647 0 4 26
🟡 Shard 4 615 0 7 47
🟡 Shard 5 605 0 2 67
🟡 Shard 6 632 0 7 33
🟡 23 flaky test(s) (passed on retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DataQuality/TableLevelTests.spec.ts › Table Column Count To Be Between (shard 2, 1 retry)
  • Features/Permissions/EntityPermissions.spec.ts › DashboardDataModel allow common operations permissions (shard 3, 1 retry)
  • Features/Permissions/GlossaryPermissions.spec.ts › Team-based permissions work correctly (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should show "Active recently" for users active within last hour (shard 3, 1 retry)
  • Flow/ExploreDiscovery.spec.ts › Should display deleted assets when showDeleted is checked and deleted is not present in queryFilter (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Store Procedure (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Set (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Rename domain with tags and glossary terms preserves associations (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Multiple consecutive domain renames preserve all associations (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Certification Add Remove (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Glossary Term Add, Update and Remove (shard 4, 1 retry)
  • Pages/ExploreTree.spec.ts › Verify Database and Database Schema available in explore tree (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Add and Remove Assets (shard 5, 2 retries)
  • Pages/HyperlinkCustomProperty.spec.ts › should display URL when no display text is provided (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Container (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Lineage/PlatformLineage.spec.ts › Verify domain platform view (shard 6, 1 retry)
  • Pages/ProfilerConfigurationPage.spec.ts › Non admin user (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

…ocesses

WorkflowInstanceListener and WorkflowInstanceStageListener were firing on
Flowable process instances with numeric keys (e.g. 239376, 239373) that have
no corresponding OM workflow definition, causing EntityNotFoundException noise
in AUT runs.

- WorkflowInstanceListener: skip addWorkflowInstance/updateWorkflowInstance
  when getMainWorkflowDefinitionNameFromTrigger returns the raw process key
  unchanged (no 'Trigger' suffix to strip = not an OM trigger workflow)
- WorkflowInstanceStageListener: skip addNewStage/updateStage when the
  process key is purely numeric (isUnknownFlowableProcess guard)
- WorkflowDefinitionResourceIT: fix waitForWorkflowIdle to use
  TriggerFactory.getTriggerWorkflowId(fqn) instead of name+"Trigger%"
  so it correctly queries by FQN-based process definition key
Copilot AI review requested due to automatic review settings April 7, 2026 14:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comment on lines +238 to +240
private boolean isUnknownFlowableProcess(String processKey) {
return processKey.matches("\\d+");
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isUnknownFlowableProcess treats any numeric-only process key as “not an OM workflow”. WorkflowDefinition name/fullyQualifiedName uses entityName, which allows numeric-only values, so a workflow named e.g. "123" would have stage tracking silently disabled. Consider detecting non-OM processes via presence/format of the expected businessKey/variables (or validating/disallowing numeric-only workflow names) rather than processKey.matches("\\d+").

Copilot uses AI. Check for mistakes.
* picked up and executed. These jobs are ephemeral (complete in &lt;100ms) but if the deployment
* is deleted while one is in flight the Flowable job executor NPEs when it tries to resolve the
* process definition. Returns immediately for event-based workflows that have no
* "&lt;name&gt;Trigger-*" process definitions deployed, or for periodic-batch workflows whose
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Javadoc says this returns immediately for event-based workflows that have no "Trigger-*" process definitions, but the code queries processDefinitionKeyLike(... + "%"), which matches both "Trigger" and "Trigger-...". Update the comment to reflect the actual matching criteria ("Trigger%") so future readers don’t infer the wrong behavior.

Suggested change
* "&lt;name&gt;Trigger-*" process definitions deployed, or for periodic-batch workflows whose
* "&lt;name&gt;Trigger%" process definitions deployed, or for periodic-batch workflows whose

Copilot uses AI. Check for mistakes.
Comment on lines +1971 to +1975
List<ProcessDefinition> triggerPDs =
repositoryService
.createProcessDefinitionQuery()
.processDefinitionKeyLike(TriggerFactory.getTriggerWorkflowId(workflowFqn) + "%")
.list();
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processDefinitionKeyLike(...) uses SQL LIKE semantics where _ and % are wildcards. Since workflow names/FQNs in tests can include _ (e.g., via ns.prefix(...)), this query may pick up unrelated process definitions and make the wait slower/flaky. Consider filtering the returned list by pd.getKey().startsWith(baseKey) (or otherwise escaping LIKE wildcards) before collecting PD IDs.

Suggested change
List<ProcessDefinition> triggerPDs =
repositoryService
.createProcessDefinitionQuery()
.processDefinitionKeyLike(TriggerFactory.getTriggerWorkflowId(workflowFqn) + "%")
.list();
String triggerWorkflowId = TriggerFactory.getTriggerWorkflowId(workflowFqn);
List<ProcessDefinition> triggerPDs =
repositoryService
.createProcessDefinitionQuery()
.processDefinitionKeyLike(triggerWorkflowId + "%")
.list()
.stream()
.filter(pd -> pd.getKey() != null && pd.getKey().startsWith(triggerWorkflowId))
.toList();

Copilot uses AI. Check for mistakes.
Comment on lines +1068 to +1072
.createExecutionQuery()
.processInstanceId(task.getProcessInstanceId())
.messageEventSubscriptionName(messageName)
.count()
> 0;
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findTerminationMessageName queries Flowable for the message subscription via .count(), and terminateTaskProcessInstance immediately runs a second query (.singleResult()) for the same messageEventSubscriptionName. You can avoid the extra round-trip by having one of these queries serve both purposes (e.g., fetch the Execution once and branch on null).

Copilot uses AI. Check for mistakes.
- WorkflowInstanceStageListener: replace fragile numeric-key guard with
  businessKey-based guard — skip stage tracking when no business key is
  present (non-OM-managed process instance) instead of using processKey
  format heuristic that could misfire on numeric-named workflows
- WorkflowDefinitionResourceIT: fix Javadoc comment ("Trigger-*" ->
  "Trigger%") and add startsWith filter after processDefinitionKeyLike
  query to prevent SQL LIKE wildcard '_' from matching unrelated PDs
- WorkflowHandler: eliminate double Flowable round-trip in termination
  logic by merging findTerminationMessageName into terminateTask helper
  that does a single singleResult() query per task
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@yan-3005
Copy link
Copy Markdown
Contributor Author

yan-3005 commented Apr 7, 2026

@gitar-bot Fix the Edge Case: Silent skip on null businessKey uses DEBUG; WARN would aid diagnosis and do mvn spotless:apply on the root of the repository and push it

…ceStageListener

Co-authored-by: Ram Narayan Balaji <81347100+yan-3005@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 7, 2026 16:44
@yan-3005 yan-3005 review requested due to automatic review settings April 7, 2026 16:44
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 7, 2026

@gitar-bot Fix the Edge Case: Silent skip on null businessKey uses DEBUG; WARN would aid diagnosis and do mvn spotless:apply on the root of the repository and push it

Done. Changed LOG.debug to LOG.warn in WorkflowInstanceStageListener.java for the null business key skip case, then ran mvn spotless:apply on the repository root. Changes committed.

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 7, 2026

Code Review ✅ Approved 2 resolved / 2 findings

Fixes Flowable workflow termination message lookup and async-job race on deletion by correcting variable references and improving logging verbosity. Two issues were addressed: waitForWorkflowIdle now uses the correct workflowFqn parameter, and null businessKey skips now log at WARN level for better diagnostics.

✅ 2 resolved
Bug: waitForWorkflowIdle uses undefined workflowName instead of workflowFqn

📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/WorkflowDefinitionResourceIT.java:1982
In waitForWorkflowIdle(String workflowFqn), line 1982 references workflowName in the await() description string, but the method parameter is named workflowFqn. There is no workflowName field on the class. This should either fail to compile or resolve to a variable from an unexpected scope. The intended variable is clearly workflowFqn.

Edge Case: Silent skip on null businessKey uses DEBUG; WARN would aid diagnosis

📄 openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/WorkflowInstanceStageListener.java:140-144 📄 openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/WorkflowInstanceListener.java:100-106 📄 openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/WorkflowInstanceListener.java:126-132
In both WorkflowInstanceListener.java and WorkflowInstanceStageListener.java, when a non-OM-managed process instance is detected (null business key or non-trigger process key), the code logs at DEBUG level and silently returns. While this is correct for known orphaned Flowable processes, if a legitimate OM-managed workflow ever reached these paths without a business key due to a bug elsewhere, the issue would be invisible at typical log levels.

The WorkflowInstanceStageListener change (line 140) is the more impactful one — it previously threw IllegalStateException, providing a hard failure signal. Now it silently skips stage tracking entirely. Consider logging at WARN level for the WorkflowInstanceStageListener case specifically, since missing stage tracking for a real workflow would be a user-visible problem.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 7, 2026

@yan-3005 yan-3005 merged commit 804f551 into main Apr 8, 2026
50 checks passed
@yan-3005 yan-3005 deleted the ram/fix-flowable-termination-message branch April 8, 2026 13:46
SaaiAravindhRaja pushed a commit to SaaiAravindhRaja/OpenMetadata that referenced this pull request Apr 12, 2026
…n deletion (open-metadata#27117)

* Fix Flowable workflow termination message lookup and async-job race on deletion

WorkflowHandler.findTerminationMessageName was splitting task definition keys
on underscore and looking for messages named '<key>_terminateProcess'. But
Workflow.getFlowableElementId joins with a dot, so actual task keys are
'<subProcess>.approvalTask' and termination messages are
'<subProcess>.terminateProcess'. The bug has been latent since the original
Custom Workflows commit - every conflict-resolution path silently failed to
terminate the old subprocess, leaving it dangling until force-delete.

Fix: split on the last dot, derive the subprocess ID, and check for the
'<subProcessId>.terminateProcess' subscription directly. Remove the dead
'terminateChangeReviewProcess' pattern and the debug scaffolding loops.

Also add getManagementService/getRepositoryService getters to WorkflowHandler
for use in integration tests.

In WorkflowDefinitionResourceIT, replace waitForWorkflowIdle to check pending
async jobs on the trigger process definition IDs instead of waiting for
process instances to drain. Async-after jobs on the fetch task are ephemeral
(<100ms) - waiting for them has near-zero cost, whereas the old approach
waited for the entire batch loop (potentially minutes with many entities).
Also remove the waitForWorkflowIdle call from cleanupCreatedWorkflows @AfterEach
since it ran a query per workflow in every test unnecessarily.

* Fix spurious EntityNotFoundException errors from orphaned Flowable processes

WorkflowInstanceListener and WorkflowInstanceStageListener were firing on
Flowable process instances with numeric keys (e.g. 239376, 239373) that have
no corresponding OM workflow definition, causing EntityNotFoundException noise
in AUT runs.

- WorkflowInstanceListener: skip addWorkflowInstance/updateWorkflowInstance
  when getMainWorkflowDefinitionNameFromTrigger returns the raw process key
  unchanged (no 'Trigger' suffix to strip = not an OM trigger workflow)
- WorkflowInstanceStageListener: skip addNewStage/updateStage when the
  process key is purely numeric (isUnknownFlowableProcess guard)
- WorkflowDefinitionResourceIT: fix waitForWorkflowIdle to use
  TriggerFactory.getTriggerWorkflowId(fqn) instead of name+"Trigger%"
  so it correctly queries by FQN-based process definition key

* Fix compilation issues

* Address Copilot review feedback on PR open-metadata#27117

- WorkflowInstanceStageListener: replace fragile numeric-key guard with
  businessKey-based guard — skip stage tracking when no business key is
  present (non-OM-managed process instance) instead of using processKey
  format heuristic that could misfire on numeric-named workflows
- WorkflowDefinitionResourceIT: fix Javadoc comment ("Trigger-*" ->
  "Trigger%") and add startsWith filter after processDefinitionKeyLike
  query to prevent SQL LIKE wildcard '_' from matching unrelated PDs
- WorkflowHandler: eliminate double Flowable round-trip in termination
  logic by merging findTerminationMessageName into terminateTask helper
  that does a single singleResult() query per task

* fix: change DEBUG to WARN for null businessKey skip in WorkflowInstanceStageListener

Co-authored-by: Ram Narayan Balaji <81347100+yan-3005@users.noreply.github.com>

---------

Co-authored-by: Gitar <noreply@gitar.ai>
SaaiAravindhRaja pushed a commit to SaaiAravindhRaja/OpenMetadata that referenced this pull request Apr 12, 2026
…n deletion (open-metadata#27117)

* Fix Flowable workflow termination message lookup and async-job race on deletion

WorkflowHandler.findTerminationMessageName was splitting task definition keys
on underscore and looking for messages named '<key>_terminateProcess'. But
Workflow.getFlowableElementId joins with a dot, so actual task keys are
'<subProcess>.approvalTask' and termination messages are
'<subProcess>.terminateProcess'. The bug has been latent since the original
Custom Workflows commit - every conflict-resolution path silently failed to
terminate the old subprocess, leaving it dangling until force-delete.

Fix: split on the last dot, derive the subprocess ID, and check for the
'<subProcessId>.terminateProcess' subscription directly. Remove the dead
'terminateChangeReviewProcess' pattern and the debug scaffolding loops.

Also add getManagementService/getRepositoryService getters to WorkflowHandler
for use in integration tests.

In WorkflowDefinitionResourceIT, replace waitForWorkflowIdle to check pending
async jobs on the trigger process definition IDs instead of waiting for
process instances to drain. Async-after jobs on the fetch task are ephemeral
(<100ms) - waiting for them has near-zero cost, whereas the old approach
waited for the entire batch loop (potentially minutes with many entities).
Also remove the waitForWorkflowIdle call from cleanupCreatedWorkflows @AfterEach
since it ran a query per workflow in every test unnecessarily.

* Fix spurious EntityNotFoundException errors from orphaned Flowable processes

WorkflowInstanceListener and WorkflowInstanceStageListener were firing on
Flowable process instances with numeric keys (e.g. 239376, 239373) that have
no corresponding OM workflow definition, causing EntityNotFoundException noise
in AUT runs.

- WorkflowInstanceListener: skip addWorkflowInstance/updateWorkflowInstance
  when getMainWorkflowDefinitionNameFromTrigger returns the raw process key
  unchanged (no 'Trigger' suffix to strip = not an OM trigger workflow)
- WorkflowInstanceStageListener: skip addNewStage/updateStage when the
  process key is purely numeric (isUnknownFlowableProcess guard)
- WorkflowDefinitionResourceIT: fix waitForWorkflowIdle to use
  TriggerFactory.getTriggerWorkflowId(fqn) instead of name+"Trigger%"
  so it correctly queries by FQN-based process definition key

* Fix compilation issues

* Address Copilot review feedback on PR open-metadata#27117

- WorkflowInstanceStageListener: replace fragile numeric-key guard with
  businessKey-based guard — skip stage tracking when no business key is
  present (non-OM-managed process instance) instead of using processKey
  format heuristic that could misfire on numeric-named workflows
- WorkflowDefinitionResourceIT: fix Javadoc comment ("Trigger-*" ->
  "Trigger%") and add startsWith filter after processDefinitionKeyLike
  query to prevent SQL LIKE wildcard '_' from matching unrelated PDs
- WorkflowHandler: eliminate double Flowable round-trip in termination
  logic by merging findTerminationMessageName into terminateTask helper
  that does a single singleResult() query per task

* fix: change DEBUG to WARN for null businessKey skip in WorkflowInstanceStageListener

Co-authored-by: Ram Narayan Balaji <81347100+yan-3005@users.noreply.github.com>

---------

Co-authored-by: Gitar <noreply@gitar.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants