Skip to content

fix(workflows): make Flowable schema upgrades idempotent to survive partial migrations#27234

Merged
yan-3005 merged 10 commits intomainfrom
ram/flowable-schema-update-fix
Apr 16, 2026
Merged

fix(workflows): make Flowable schema upgrades idempotent to survive partial migrations#27234
yan-3005 merged 10 commits intomainfrom
ram/flowable-schema-update-fix

Conversation

@yan-3005
Copy link
Copy Markdown
Contributor

@yan-3005 yan-3005 commented Apr 10, 2026

Performance Note

The idempotent DDL wrapper only activates during migrate — never on server startup. Flowable was upgraded in a single OpenMetadata release, so the upgrade scripts run exactly once per installation. After the schema reaches 7.2.0.2, Flowable skips all upgrade scripts on subsequent migrations and never calls execute() on the statement — the wrapper is created but never intercepted. There is no ongoing performance cost.

Fixes #26048

Background

The original reproduction path described in the ticket (wrong Elasticsearch password causing startup failure) no longer works on current `main`. Two indexing refactor commits changed this:

  • `b59aa7fc44` / `7b6360a9ed` (Mar 3–10, 2026) — split ES startup into phases
  • `09b89d2229` (Mar 12, 2026) — introduced `createMissingIndexes()` which catches every ES connection error per-index and logs a warning instead of throwing. A wrong ES password now produces a wall of `WARN` lines but startup continues normally, so `WorkflowHandler.initialize()` is always reached.

To reproduce the stuck state today, the DB must be tampered directly — the underlying race condition is still real and can still be triggered by a crash mid-Flowable-upgrade.

How to Reproduce (tampered data method)

  1. Start fresh DB and run `./openmetadata-ops.sh migrate` — server starts fine, Flowable schema at `7.2.0.2`
  2. Connect to the DB and run:
    UPDATE ACT_GE_PROPERTY SET VALUE_ = '7.1.0.0' WHERE NAME_ = 'common.schema.version';
    This simulates the stuck state left by a mid-upgrade crash: the version row says old version but all the new DDL objects already exist.
  3. Restart the server → fails: Flowable tries to re-run upgrade DDL, hits "relation/index already exists"
  4. Run `./openmetadata-ops.sh migrate --force` → same failure, DB permanently wedged

Root Cause

`WorkflowHandler` always set `DB_SCHEMA_UPDATE_TRUE`, so the webserver ran Flowable DDL on every startup. A crash between DDL execution and the version row update left the DB in a state where neither the server nor migrate could recover — because Flowable's DDL is not idempotent (no `IF NOT EXISTS`).

Design Tradeoffs — Approaches Considered and Rejected

Option 1: Correct the version row directly

Idea: When migrate detects a stuck state, just `UPDATE ACT_GE_PROPERTY SET VALUE_ = '7.2.0.2'` to make Flowable think migration is complete.

Why rejected: Flowable sees the version as current and skips all upgrade scripts entirely. If the crash happened mid-migration — some DDL executed, some not — the missing DDL never runs. The DB is now permanently inconsistent with no way to recover without manual intervention. This trades one failure mode for a worse, silent one.

Option 2: Catch SQL state codes

Idea: Wrap Flowable's DDL execution, catch `SQLException`, inspect `getSQLState()` — `42P07` on PostgreSQL ("relation already exists"), `1061` on MySQL ("Duplicate key name") — and skip.

Why rejected: SQL state codes vary across DB versions, JDBC driver versions, and sometimes across minor releases. A code that works today can change in a patch release. Fragile across the two DBs (MySQL + PostgreSQL) this project supports.

Option 3: Match error message strings

Idea: Catch the exception and check if the message contains "already exists" / "relation already exists".

Why rejected: Error messages are not part of any SQL standard. They differ by DB engine, locale settings, and driver version. String matching on exception messages is the most fragile contract available.

Option 4: Rewrite DDL with IF NOT EXISTS

Idea: Intercept the SQL before execution and rewrite `CREATE INDEX idx ON t` → `CREATE INDEX IF NOT EXISTS idx ON t`.

Why rejected: `CREATE INDEX IF NOT EXISTS` is not supported on many MySQL versions. Would require two different rewrite paths (MySQL vs PostgreSQL) and still doesn't handle `ALTER TABLE ADD COLUMN` idempotently.

Option 5 (chosen): DatabaseMetaData pre-checks

Idea: Before executing any DDL statement, query the DB's own catalog via standard JDBC `DatabaseMetaData` to check whether the object already exists. If yes, skip. If no, execute.

Why this works:

  • `getIndexInfo()`, `getTables()`, `getColumns()` are part of the JDBC standard — identical API on MySQL and PostgreSQL
  • No SQL state codes, no error message matching, no non-portable syntax
  • Handles partial migrations correctly: objects that exist are skipped, objects that are missing are created — the migration resumes exactly where it left off
  • Only active in migration context; webserver is unaffected

Fix — 4 changes

1. Webserver no longer runs Flowable DDL

`WorkflowHandler.setDatabaseSchemaUpdate` is now conditional:

  • Webserver → `DB_SCHEMA_UPDATE_FALSE` (validate only, never run DDL)
  • Migrate CLI → `DB_SCHEMA_UPDATE_TRUE`

2. Migrate CLI guarantees Flowable DDL runs

Explicit `WorkflowHandler.initialize(config, true)` added in `OpenMetadataOperations` after `validateAndRunSystemDataMigrations`.

3. Operator-friendly error on schema mismatch

Webserver now catches `FlowableWrongDbException` and rethrows with:

"Flowable schema is not initialized or is at an unexpected version. Run `openmetadata-ops.sh migrate` before starting the server."

4. Idempotent DDL wrapper (new: `IdempotentDdlDataSource` + `IdempotentDdlStatement`)

Used exclusively in migration context. Wraps the JDBC `DataSource` via a JDK proxy. `IdempotentDdlStatement` intercepts `execute(sql)` for three DDL patterns:

Statement Pre-check
`CREATE [UNIQUE] INDEX idx ON table(...)` `DatabaseMetaData.getIndexInfo()`
`CREATE TABLE name(...)` `DatabaseMetaData.getTables()`
`ALTER TABLE name ADD [COLUMN] col ...` `DatabaseMetaData.getColumns()`

If the object already exists → log skip, return. Otherwise → execute normally. All other SQL passes through unchanged.

Verification

After the fix, repeating the tamper-and-recover steps above:

  • Step 3 (server start without migrate): fails with the actionable error message ✓
  • Step 4 (`migrate --force`): wrapper skips already-existing DDL, applies any missing DDL, version corrects to `7.2.0.2`, server starts cleanly ✓

Test Plan

  • Unit tests for schema-update mode selection (`WorkflowHandlerSchemaUpdateTest`) — 3 tests, all passing
  • Integration: fresh DB → migrate → server start
  • Integration: tamper version → migrate --force → server start (stuck-state recovery)

…artial migrations

Fixes #26048.

When the server crashed mid-startup during a Flowable schema upgrade, the DB
was left in a partially-migrated state. On restart, Flowable re-ran the same
DDL and failed on already-existing objects (indexes, tables, columns), permanently
wedging both the server and migrate --force.

Changes:
1. WorkflowHandler: webserver now uses DB_SCHEMA_UPDATE_FALSE — it validates the
   schema but never runs DDL. Only migrate CLI uses DB_SCHEMA_UPDATE_TRUE.
2. OpenMetadataOperations: explicit WorkflowHandler.initialize(config, true) inside
   the migrate command so Flowable DDL always runs during migration.
3. WorkflowHandler: catches FlowableWrongDbException on webserver startup and
   rethrows with an actionable message directing the operator to run migrate.
4. IdempotentDdlDataSource + IdempotentDdlStatement: JDBC DataSource wrapper used
   exclusively in migration context. Intercepts execute(sql) for CREATE INDEX,
   CREATE TABLE, and ALTER TABLE ADD COLUMN and pre-checks existence via standard
   DatabaseMetaData (getIndexInfo, getTables, getColumns) before executing. If the
   object already exists it logs a skip and returns — no SQL state codes, no string
   matching, works on MySQL and PostgreSQL.

Unit tests cover schema-update mode selection in both contexts.
Copilot AI review requested due to automatic review settings April 10, 2026 09:01
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@yan-3005 yan-3005 self-assigned this Apr 10, 2026
@yan-3005 yan-3005 added safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch backend labels Apr 10, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent OpenMetadata from getting permanently stuck after a partial Flowable schema upgrade by (1) stopping the webserver from running Flowable DDL on startup and (2) making Flowable’s upgrade DDL resilient/idempotent during openmetadata-ops.sh migrate.

Changes:

  • Make Flowable databaseSchemaUpdate mode conditional: runtime validates only; migration runs upgrades.
  • Ensure the migrate CLI explicitly initializes Flowable in migration mode.
  • Add an idempotent JDBC wrapper (IdempotentDdlDataSource/IdempotentDdlStatement) to skip already-existing DDL objects during migration, plus unit tests for schema-update mode selection.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/WorkflowHandler.java Switch Flowable schema-update behavior based on migration vs runtime; wrap schema mismatch with a more actionable exception.
openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/IdempotentDdlDataSource.java Introduce a migration-only DataSource that proxies Connections so statements can be made idempotent.
openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/IdempotentDdlStatement.java Implement JDBC Statement wrapper to pre-check and skip specific Flowable DDL patterns.
openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java Run Flowable initialization explicitly during migrate to guarantee schema upgrades happen in CLI context.
openmetadata-service/src/test/java/org/openmetadata/service/governance/workflows/WorkflowHandlerSchemaUpdateTest.java Add unit tests verifying schema-update mode selection and runtime error wrapping.

- Extract shouldSkip() helper; apply idempotency checks to all execute()
  and executeUpdate() overloads, not just execute(String)
- Tighten ALTER TABLE regex with negative lookahead to exclude SQL keywords
  (CONSTRAINT, PRIMARY, UNIQUE, FOREIGN, CHECK, INDEX, KEY) from being
  matched as column names
- IdempotentDdlDataSource now wraps a DataSource delegate instead of calling
  DriverManager directly; uses migrationDataSource() helper in WorkflowHandler
  to resolve from existing DataSource or JDBC params
- Fix InvocationTargetException wrapping in Connection proxy — unwrap cause
  so callers receive the original SQLException
- Wrap all createStatement() variants in the proxy, not just the no-arg form
- Contextual error message in WorkflowHandler — distinguish between server
  startup and migration context
- Add IdempotentDdlStatementTest: 11 tests covering skip/execute for
  CREATE INDEX, CREATE UNIQUE INDEX, CREATE TABLE, ALTER TABLE ADD COLUMN,
  keyword-guarded ALTER TABLE, executeUpdate overload, and pass-through
Copilot AI review requested due to automatic review settings April 10, 2026 10:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comment on lines +51 to +62
private Connection wrapConnection(Connection real) {
return (Connection)
Proxy.newProxyInstance(
real.getClass().getClassLoader(),
new Class<?>[] {Connection.class},
(proxy, method, args) -> {
if ("createStatement".equals(method.getName())) {
Statement stmt = (Statement) invokeDelegate(method, real, args);
return new IdempotentDdlStatement(stmt, real);
}
return invokeDelegate(method, real, args);
});
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IdempotentDdlDataSource is central to the idempotent-migration behavior (proxying Connection and wrapping all createStatement* results), but there are no tests validating that all createStatement overloads are wrapped and that exceptions from invokeDelegate are surfaced as the original SQLException (not UndeclaredThrowableException/InvocationTargetException). Adding a focused unit test (mock DataSource/Connection/Statement) would help prevent regressions in the proxy interception logic.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added IdempotentDdlDataSourceTest (5 tests): createStatement() no-arg, two-arg, and three-arg overloads all return IdempotentDdlStatement; SQLException from createStatement surfaces as SQLException directly (not UndeclaredThrowableException/InvocationTargetException); same for non-createStatement methods. All passing.

…robust test indexing

- shouldSkip() returns false immediately for null SQL, preserving JDBC contract
  (delegate handles null and throws the driver's own error)
- drop-create command now calls WorkflowHandler.initialize(config, true) after
  native migrations so it produces a fully startable DB including Flowable tables
- WorkflowHandlerSchemaUpdateTest: replace brittle get(1) with getLast() so the
  test is not sensitive to how many StandaloneProcessEngineConfiguration instances
  are constructed before initializeNewProcessEngine runs
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 10, 2026

🟡 Playwright Results — all passed (23 flaky)

✅ 3637 passed · ❌ 0 failed · 🟡 23 flaky · ⏭️ 89 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 478 0 2 4
🟡 Shard 2 644 0 3 7
🟡 Shard 3 645 0 8 1
🟡 Shard 4 620 0 4 27
✅ Shard 5 616 0 0 42
🟡 Shard 6 634 0 6 8
🟡 23 flaky test(s) (passed on retry)
  • Features/DataAssetRulesDisabled.spec.ts › Verify the ApiEndpoint entity item action after rules disabled (shard 1, 1 retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/ActivityFeed.spec.ts › emoji reactions can be added when feed messages exist (shard 2, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/ChangeSummaryBadge.spec.ts › Automated badge should appear on entity description with Automated source (shard 2, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 2 retries)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Create DataProducts and add remove assets (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Rename domain with tags and glossary terms preserves associations (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Multiple consecutive domain renames preserve all associations (shard 4, 1 retry)
  • Pages/HyperlinkCustomProperty.spec.ts › should display URL when no display text is provided (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Search Index (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/ServiceEntity.spec.ts › Announcement create, edit & delete (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

private Connection wrapConnection(Connection real) {
return (Connection)
Proxy.newProxyInstance(
real.getClass().getClassLoader(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be tricky: we are manipulating via reflection

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohityadav766 The alternative is a 250-line concrete Connection implementation that delegates all 40+ methods just to override 3, which is more code with no functional difference, we can go by this approach as well, ill go with your take

Copilot AI review requested due to automatic review settings April 15, 2026 18:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 16, 2026

Code Review ✅ Approved 5 resolved / 5 findings

Flowable schema upgrade logic is now idempotent to handle partial migrations safely, with five issues resolved including execute() overload fixes, connection pooling correction, ALTER TABLE regex refinement, new DDL interception unit tests, and improved exception assertion coverage.

✅ 5 resolved
Bug: execute() overloads bypass idempotent DDL checks

📄 openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/IdempotentDdlStatement.java:272-283 📄 openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/IdempotentDdlDataSource.java:75
The IdempotentDdlStatement only intercepts the single-arg execute(String sql) method. The three other execute overloads (execute(String, int), execute(String, int[]), execute(String, String[])) and all executeUpdate overloads delegate directly to the underlying statement without any idempotency checks. If Flowable (or a future Flowable version) uses any of these alternative methods for DDL execution, the idempotent protection is silently bypassed and the migration will fail with 'already exists' errors — exactly the scenario this PR is meant to fix.

Similarly, ConnectionHandler.invoke() only wraps createStatement() with zero args. The overloads createStatement(int, int) and createStatement(int, int, int) return unwrapped statements.

Edge Case: IdempotentDdlDataSource creates unpooled connections

📄 openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/IdempotentDdlDataSource.java:49-51 📄 openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/IdempotentDdlDataSource.java:54-56
The IdempotentDdlDataSource uses raw DriverManager.getConnection() on every getConnection() call with no connection pooling. During a Flowable schema upgrade, Flowable may open and close many connections. Each one creates a new TCP connection to the database, which is expensive and may exhaust connection limits on constrained environments.

Additionally, there is an implicit dependency on the JDBC driver already being registered via a previous class load — this is not guaranteed if the migration CLI code path changes in the future.

Edge Case: ALTER TABLE regex may mis-capture reserved words as column names

📄 openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/IdempotentDdlStatement.java:41-42
The ALTER_TABLE_ADD_COLUMN_PATTERN regex alter\s+table\s+(\S+)\s+add\s+(?:column\s+)?(\S+)\s captures the first non-whitespace token after ADD [COLUMN] as the column name. If Flowable ever emits DDL like ALTER TABLE t ADD CONSTRAINT ... or ALTER TABLE t ADD PRIMARY KEY ..., the regex would match with 'CONSTRAINT' or 'PRIMARY' captured as a column name, leading to a bogus columnExists check that likely returns false — so the DDL executes normally, which is correct by accident. However, it could also match ALTER TABLE t ADD INDEX ... on MySQL, where the captured 'INDEX' would pass the column check and the DDL would execute (also correct). So this is low-risk but worth noting for maintainability.

Quality: No unit tests for IdempotentDdlStatement DDL interception logic

📄 openmetadata-service/src/test/java/org/openmetadata/service/governance/workflows/WorkflowHandlerSchemaUpdateTest.java
The new test class WorkflowHandlerSchemaUpdateTest verifies the schema-update mode selection (TRUE vs FALSE) and the FlowableWrongDbException wrapping, but there are no tests for the core idempotency logic in IdempotentDdlStatement — i.e., that CREATE INDEX/TABLE/ALTER TABLE ADD COLUMN statements are correctly detected and skipped when objects already exist. This is the most critical new behavior in the PR and the most likely place for regex or metadata-query bugs to hide.

Quality: Exception identity not asserted in exception-surfacing tests

📄 openmetadata-service/src/test/java/org/openmetadata/service/governance/workflows/IdempotentDdlDataSourceTest.java:85-93 📄 openmetadata-service/src/test/java/org/openmetadata/service/governance/workflows/IdempotentDdlDataSourceTest.java:95-103
Both sqlExceptionFromConnectionSurfacesDirectly (line 85) and sqlExceptionFromNonCreateStatementMethodSurfacesDirectly (line 95) declare an expected SQLException but never assert that the caught exception is the same instance (or has the same message). The tests only verify that some SQLException is thrown, not that the original exception passes through unwrapped. If the proxy were to accidentally wrap the exception in another SQLException or UndeclaredThrowableException, these tests would still pass.

Additionally, assertInstanceOf(SQLException.class, actual) on line 92 is redundant — assertThrows(SQLException.class, ...) already guarantees the type.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

Changes have been cherry-picked to the 1.12.6 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flowable migration breaks if things go wrong during the fresh install

4 participants