fix(workflows): make Flowable schema upgrades idempotent to survive partial migrations#27234
fix(workflows): make Flowable schema upgrades idempotent to survive partial migrations#27234
Conversation
…artial migrations Fixes #26048. When the server crashed mid-startup during a Flowable schema upgrade, the DB was left in a partially-migrated state. On restart, Flowable re-ran the same DDL and failed on already-existing objects (indexes, tables, columns), permanently wedging both the server and migrate --force. Changes: 1. WorkflowHandler: webserver now uses DB_SCHEMA_UPDATE_FALSE — it validates the schema but never runs DDL. Only migrate CLI uses DB_SCHEMA_UPDATE_TRUE. 2. OpenMetadataOperations: explicit WorkflowHandler.initialize(config, true) inside the migrate command so Flowable DDL always runs during migration. 3. WorkflowHandler: catches FlowableWrongDbException on webserver startup and rethrows with an actionable message directing the operator to run migrate. 4. IdempotentDdlDataSource + IdempotentDdlStatement: JDBC DataSource wrapper used exclusively in migration context. Intercepts execute(sql) for CREATE INDEX, CREATE TABLE, and ALTER TABLE ADD COLUMN and pre-checks existence via standard DatabaseMetaData (getIndexInfo, getTables, getColumns) before executing. If the object already exists it logs a skip and returns — no SQL state codes, no string matching, works on MySQL and PostgreSQL. Unit tests cover schema-update mode selection in both contexts.
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent OpenMetadata from getting permanently stuck after a partial Flowable schema upgrade by (1) stopping the webserver from running Flowable DDL on startup and (2) making Flowable’s upgrade DDL resilient/idempotent during openmetadata-ops.sh migrate.
Changes:
- Make Flowable
databaseSchemaUpdatemode conditional: runtime validates only; migration runs upgrades. - Ensure the migrate CLI explicitly initializes Flowable in migration mode.
- Add an idempotent JDBC wrapper (
IdempotentDdlDataSource/IdempotentDdlStatement) to skip already-existing DDL objects during migration, plus unit tests for schema-update mode selection.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/WorkflowHandler.java | Switch Flowable schema-update behavior based on migration vs runtime; wrap schema mismatch with a more actionable exception. |
| openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/IdempotentDdlDataSource.java | Introduce a migration-only DataSource that proxies Connections so statements can be made idempotent. |
| openmetadata-service/src/main/java/org/openmetadata/service/governance/workflows/IdempotentDdlStatement.java | Implement JDBC Statement wrapper to pre-check and skip specific Flowable DDL patterns. |
| openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java | Run Flowable initialization explicitly during migrate to guarantee schema upgrades happen in CLI context. |
| openmetadata-service/src/test/java/org/openmetadata/service/governance/workflows/WorkflowHandlerSchemaUpdateTest.java | Add unit tests verifying schema-update mode selection and runtime error wrapping. |
- Extract shouldSkip() helper; apply idempotency checks to all execute() and executeUpdate() overloads, not just execute(String) - Tighten ALTER TABLE regex with negative lookahead to exclude SQL keywords (CONSTRAINT, PRIMARY, UNIQUE, FOREIGN, CHECK, INDEX, KEY) from being matched as column names - IdempotentDdlDataSource now wraps a DataSource delegate instead of calling DriverManager directly; uses migrationDataSource() helper in WorkflowHandler to resolve from existing DataSource or JDBC params - Fix InvocationTargetException wrapping in Connection proxy — unwrap cause so callers receive the original SQLException - Wrap all createStatement() variants in the proxy, not just the no-arg form - Contextual error message in WorkflowHandler — distinguish between server startup and migration context - Add IdempotentDdlStatementTest: 11 tests covering skip/execute for CREATE INDEX, CREATE UNIQUE INDEX, CREATE TABLE, ALTER TABLE ADD COLUMN, keyword-guarded ALTER TABLE, executeUpdate overload, and pass-through
| private Connection wrapConnection(Connection real) { | ||
| return (Connection) | ||
| Proxy.newProxyInstance( | ||
| real.getClass().getClassLoader(), | ||
| new Class<?>[] {Connection.class}, | ||
| (proxy, method, args) -> { | ||
| if ("createStatement".equals(method.getName())) { | ||
| Statement stmt = (Statement) invokeDelegate(method, real, args); | ||
| return new IdempotentDdlStatement(stmt, real); | ||
| } | ||
| return invokeDelegate(method, real, args); | ||
| }); |
There was a problem hiding this comment.
IdempotentDdlDataSource is central to the idempotent-migration behavior (proxying Connection and wrapping all createStatement* results), but there are no tests validating that all createStatement overloads are wrapped and that exceptions from invokeDelegate are surfaced as the original SQLException (not UndeclaredThrowableException/InvocationTargetException). Adding a focused unit test (mock DataSource/Connection/Statement) would help prevent regressions in the proxy interception logic.
There was a problem hiding this comment.
Added IdempotentDdlDataSourceTest (5 tests): createStatement() no-arg, two-arg, and three-arg overloads all return IdempotentDdlStatement; SQLException from createStatement surfaces as SQLException directly (not UndeclaredThrowableException/InvocationTargetException); same for non-createStatement methods. All passing.
…robust test indexing - shouldSkip() returns false immediately for null SQL, preserving JDBC contract (delegate handles null and throws the driver's own error) - drop-create command now calls WorkflowHandler.initialize(config, true) after native migrations so it produces a fully startable DB including Flowable tables - WorkflowHandlerSchemaUpdateTest: replace brittle get(1) with getLast() so the test is not sensitive to how many StandaloneProcessEngineConfiguration instances are constructed before initializeNewProcessEngine runs
🟡 Playwright Results — all passed (23 flaky)✅ 3637 passed · ❌ 0 failed · 🟡 23 flaky · ⏭️ 89 skipped
🟡 23 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
| private Connection wrapConnection(Connection real) { | ||
| return (Connection) | ||
| Proxy.newProxyInstance( | ||
| real.getClass().getClassLoader(), |
There was a problem hiding this comment.
this could be tricky: we are manipulating via reflection
There was a problem hiding this comment.
@mohityadav766 The alternative is a 250-line concrete Connection implementation that delegates all 40+ methods just to override 3, which is more code with no functional difference, we can go by this approach as well, ill go with your take
Code Review ✅ Approved 5 resolved / 5 findingsFlowable schema upgrade logic is now idempotent to handle partial migrations safely, with five issues resolved including execute() overload fixes, connection pooling correction, ALTER TABLE regex refinement, new DDL interception unit tests, and improved exception assertion coverage. ✅ 5 resolved✅ Bug: execute() overloads bypass idempotent DDL checks
✅ Edge Case: IdempotentDdlDataSource creates unpooled connections
✅ Edge Case: ALTER TABLE regex may mis-capture reserved words as column names
✅ Quality: No unit tests for IdempotentDdlStatement DDL interception logic
✅ Quality: Exception identity not asserted in exception-surfacing tests
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
|
Changes have been cherry-picked to the 1.12.6 branch. |



Performance Note
The idempotent DDL wrapper only activates during
migrate— never on server startup. Flowable was upgraded in a single OpenMetadata release, so the upgrade scripts run exactly once per installation. After the schema reaches7.2.0.2, Flowable skips all upgrade scripts on subsequent migrations and never callsexecute()on the statement — the wrapper is created but never intercepted. There is no ongoing performance cost.Fixes #26048
Background
The original reproduction path described in the ticket (wrong Elasticsearch password causing startup failure) no longer works on current `main`. Two indexing refactor commits changed this:
To reproduce the stuck state today, the DB must be tampered directly — the underlying race condition is still real and can still be triggered by a crash mid-Flowable-upgrade.
How to Reproduce (tampered data method)
Root Cause
`WorkflowHandler` always set `DB_SCHEMA_UPDATE_TRUE`, so the webserver ran Flowable DDL on every startup. A crash between DDL execution and the version row update left the DB in a state where neither the server nor migrate could recover — because Flowable's DDL is not idempotent (no `IF NOT EXISTS`).
Design Tradeoffs — Approaches Considered and Rejected
Option 1: Correct the version row directly
Idea: When migrate detects a stuck state, just `UPDATE ACT_GE_PROPERTY SET VALUE_ = '7.2.0.2'` to make Flowable think migration is complete.
Why rejected: Flowable sees the version as current and skips all upgrade scripts entirely. If the crash happened mid-migration — some DDL executed, some not — the missing DDL never runs. The DB is now permanently inconsistent with no way to recover without manual intervention. This trades one failure mode for a worse, silent one.
Option 2: Catch SQL state codes
Idea: Wrap Flowable's DDL execution, catch `SQLException`, inspect `getSQLState()` — `42P07` on PostgreSQL ("relation already exists"), `1061` on MySQL ("Duplicate key name") — and skip.
Why rejected: SQL state codes vary across DB versions, JDBC driver versions, and sometimes across minor releases. A code that works today can change in a patch release. Fragile across the two DBs (MySQL + PostgreSQL) this project supports.
Option 3: Match error message strings
Idea: Catch the exception and check if the message contains "already exists" / "relation already exists".
Why rejected: Error messages are not part of any SQL standard. They differ by DB engine, locale settings, and driver version. String matching on exception messages is the most fragile contract available.
Option 4: Rewrite DDL with IF NOT EXISTS
Idea: Intercept the SQL before execution and rewrite `CREATE INDEX idx ON t` → `CREATE INDEX IF NOT EXISTS idx ON t`.
Why rejected: `CREATE INDEX IF NOT EXISTS` is not supported on many MySQL versions. Would require two different rewrite paths (MySQL vs PostgreSQL) and still doesn't handle `ALTER TABLE ADD COLUMN` idempotently.
Option 5 (chosen): DatabaseMetaData pre-checks
Idea: Before executing any DDL statement, query the DB's own catalog via standard JDBC `DatabaseMetaData` to check whether the object already exists. If yes, skip. If no, execute.
Why this works:
Fix — 4 changes
1. Webserver no longer runs Flowable DDL
`WorkflowHandler.setDatabaseSchemaUpdate` is now conditional:
2. Migrate CLI guarantees Flowable DDL runs
Explicit `WorkflowHandler.initialize(config, true)` added in `OpenMetadataOperations` after `validateAndRunSystemDataMigrations`.
3. Operator-friendly error on schema mismatch
Webserver now catches `FlowableWrongDbException` and rethrows with:
4. Idempotent DDL wrapper (new: `IdempotentDdlDataSource` + `IdempotentDdlStatement`)
Used exclusively in migration context. Wraps the JDBC `DataSource` via a JDK proxy. `IdempotentDdlStatement` intercepts `execute(sql)` for three DDL patterns:
If the object already exists → log skip, return. Otherwise → execute normally. All other SQL passes through unchanged.
Verification
After the fix, repeating the tamper-and-recover steps above:
Test Plan