Skip to content

fix(ingestion): replace MetaData.reflect() with direct SQL in Greenplum DDL extraction#27445

Open
ShivamChavan01 wants to merge 2 commits intoopen-metadata:mainfrom
ShivamChavan01:fix/greenplum-ddl-metadata-reflect-27405
Open

fix(ingestion): replace MetaData.reflect() with direct SQL in Greenplum DDL extraction#27445
ShivamChavan01 wants to merge 2 commits intoopen-metadata:mainfrom
ShivamChavan01:fix/greenplum-ddl-metadata-reflect-27405

Conversation

@ShivamChavan01
Copy link
Copy Markdown

Fixes #27405

  • Replace MetaData.reflect() in get_all_table_ddls() with single pg_catalog SQL query, reduces ~16,000 catalog queries per schema down to 1
  • Use pg_get_expr(adbin, adrelid) for defaults, adsrc removed in PG12+
  • Add GREENPLUM_TABLE_DDLS in queries.py with DISTRIBUTED BY via gp_distribution_policy and WITH (reloptions) for Greenplum-specific DDL
  • Wire query through utils.py following the get_view_definition pattern
  • Remove dead CreateTable and MetaData imports from sqlalchemy_utils.py

Describe your changes:

Fixes #27405

  • get_table_ddl in sqlalchemy_utils.py was passing query=None into get_table_ddl_wrapper, which fell through to MetaData.reflect().This fires ~8 catalog queries per table, so ~16,000 per schema with 2000 tables. With 24+ schemas and the connection never committed, AccessShareLock piles up for hours blocking other users.

  • Fixed by wiring up the query param that was already in the signature but swallowed by **kw. Added GREENPLUM_TABLE_DDLS in queries.py and passed it through utils.py following the same pattern get_view_definition uses.

  • Used pg_get_expr(d.adbin, d.adrelid) instead of d.adsrc (removed in PG12). GREENPLUM_TABLE_DDLS includes WITH (reloptions) and DISTRIBUTED BY via gp_distribution_policy. Removed dead CreateTable and MetaData imports.

Tested against local PostgreSQL 14. @KrasnovidKE confirmed pg_get_expr works on their live Greenplum 6 instance.

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes #27405: Fix Greenplum DDL extraction causing thousands of catalog queries per schema
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
  • I have added a test that covers the exact scenario we are fixing. For complex issues, comment the issue number in the test for future reference.
  • I have added tests around the new logic.
  • For connector/ingestion changes: I updated the documentation.

…m DDL

Fixes open-metadata#27405

- Replace MetaData.reflect() in get_all_table_ddls() with single pg_catalog
  SQL query, reduces ~16,000 catalog queries per schema down to 1
- Use pg_get_expr(adbin, adrelid) for defaults, adsrc removed in PG12+
- Add GREENPLUM_TABLE_DDLS in queries.py with DISTRIBUTED BY via
  gp_distribution_policy and WITH (reloptions) for Greenplum-specific DDL
- Wire query through utils.py following the get_view_definition pattern
- Remove dead CreateTable and MetaData imports from sqlalchemy_utils.py
@ShivamChavan01 ShivamChavan01 requested a review from a team as a code owner April 16, 2026 22:01
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@ShivamChavan01 ShivamChavan01 changed the title fix(ingestion): replace MetaData.reflect() with bulk SQL for Greenplu… fix(ingestion): replace MetaData.reflect() with direct SQL in Greenplum DDL extraction Apr 16, 2026
athena, mysql, exasol pass query=None via the default get_table_ddl in
sqlalchemy_utils.py. Without this guard, connection.execute(None) crashes.
Early return preserves existing behaviour for connectors without a custom
DDL query.
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 16, 2026

Code Review ✅ Approved 1 resolved / 1 findings

Refactors Greenplum DDL extraction to use direct SQL instead of MetaData.reflect(), resolving the regression that previously broke DDL generation for other connectors.

✅ 1 resolved
Bug: Removing MetaData.reflect() fallback breaks DDL for other connectors

📄 ingestion/src/metadata/utils/sqlalchemy_utils.py:146-160 📄 ingestion/src/metadata/utils/sqlalchemy_utils.py:171-180
The default get_table_ddl in sqlalchemy_utils.py:171 passes query=None to get_table_ddl_wrapper, which forwards it to get_all_table_ddls. With this PR, get_all_table_ddls now calls connection.execute(text(query)) — but when query is None, isinstance(None, str) is False, so it falls to connection.execute(None), which will crash.

Three connectors import this default get_table_ddl from sqlalchemy_utils: athena, mysql, and exasol. Previously, get_all_table_ddls used MetaData.reflect() as a fallback (ignoring the query param entirely), so DDL extraction worked for these connectors. After this change, DDL extraction silently fails for all of them (the caller in common_db_source.py:486 catches the exception and returns None).

This is a regression for any connector that doesn't provide its own get_table_ddl override with a custom query.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@ShivamChavan01
Copy link
Copy Markdown
Author

PR up at #27445 addresses the MetaData.reflect() regression and the query=None fallback for other connectors (athena, mysql, exasol). Ready for review when you get a chance @PubChimps

@ShivamChavan01
Copy link
Copy Markdown
Author

Hey @PubChimps, I've addressed the query=None regression flagged by the bot (fallback added for athena/mysql/exasol). Could you add the safe to test label when you get a chance so CI can run? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] MetaData.reflect() in get_all_table_ddls causes excessive catalog queries and long-lived idle transactions when includeDDL is enabled

1 participant