Fixes #24348: Strip URL scheme from hostPort to prevent ValueError#27191
Fixes #24348: Strip URL scheme from hostPort to prevent ValueError#27191RajdeepKushwaha5 wants to merge 8 commits intoopen-metadata:mainfrom
Conversation
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
1 similar comment
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
This PR prevents ingestion crashes when users mistakenly enter a full URL (e.g. http://localhost:3306) in hostPort by introducing a shared sanitization helper and applying it before hostPort parsing/URL construction, along with added test coverage.
Changes:
- Added
clean_host_port()utility to strip URL schemes (and trailing slashes) before parsinghostPort. - Updated multiple connectors and shared builders to call
clean_host_port()before splitting/usinghostPort. - Added/extended unit tests to cover scheme-prefixed
hostPortinputs and connection URL generation.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| ingestion/src/metadata/utils/db_utils.py | Adds clean_host_port() and uses it in get_host_from_host_port() to sanitize scheme-prefixed inputs. |
| ingestion/src/metadata/ingestion/connections/builders.py | Applies clean_host_port() when generating DB URLs and when splitting host/port for IAM token generation. |
| ingestion/src/metadata/ingestion/source/database/cassandra/connection.py | Uses clean_host_port() before splitting Cassandra hostPort. |
| ingestion/src/metadata/ingestion/source/database/db2/connection.py | Uses clean_host_port() for DB2 host extraction and port parsing. |
| ingestion/src/metadata/ingestion/source/database/databricks/client.py | Uses clean_host_port() to derive the Databricks workspace base host. |
| ingestion/src/metadata/ingestion/source/database/databricks/auth.py | Uses clean_host_port() when building Databricks SDK host for auth flows. |
| ingestion/src/metadata/ingestion/source/database/redshift/connection.py | Uses clean_host_port() when deriving host for IAM flow type detection. |
| ingestion/tests/unit/test_db_utils.py | Adds tests for clean_host_port() and scheme-prefixed get_host_from_host_port() inputs. |
| ingestion/tests/unit/test_build_connection_url.py | Adds integration-style tests ensuring MySQL/Postgres URL building works with scheme-prefixed hostPort. |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
8cec367 to
fcfd026
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
fcfd026 to
0b955fb
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
0b955fb to
6a779c3
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
ingestion/src/metadata/ingestion/source/database/db2/connection.py:90
- In
_get_ibmi_connection_args(),port_str = host_port.split(":")[1]is fragile: it will fail for bracketed IPv6 ([::1]:50000) and will also misbehave ifclean_host_port()ever returns an unbracketed IPv6 host (e.g. fromhttp://[::1]:50000). Prefer splitting from the right once (e.g.rsplit(':', 1)) and/or explicitly handling[...]:portto reliably extract the port.
args = get_connection_args_common(connection)
host_port = clean_host_port(connection.hostPort)
if ":" in host_port:
port_str = host_port.split(":")[1]
try:
args["port"] = int(port_str)
|
ping @harshach |
| url = f"{connection.scheme.value}://" | ||
| url += f"{quote_plus(username)}:{quote_plus(password)}@" | ||
| url += connection.hostPort | ||
| url += clean_host_port(connection.hostPort) |
There was a problem hiding this comment.
what happens to other connectors? why are we making changes only to few connectors that to in a connection.py?
6a779c3 to
8ed98aa
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
@harshach thanks for the feedback, you're absolutely right. Patching individual connector files was the wrong approach. I've reworked this completely. The fix is now centralised in What changed:
The diff is now 4 files, 145 additions, 6 deletions (down from 9 files). Verified locally: |
c654b9e to
31c851f
Compare
|
|
||
| # HTTP prefix is stripped | ||
| self.assertEqual(clean_host_port("http://localhost:3306"), "localhost:3306") | ||
| self.assertEqual(clean_host_port("http://example.com:8080"), "example.com:8080") |
There was a problem hiding this comment.
New assertions in this test exceed Black's default formatting width and will cause make py_format_check (black --check) to fail. Please run make py_format (or manually wrap these calls) so Black can reformat them consistently.
| self.assertEqual(clean_host_port("http://example.com:8080"), "example.com:8080") | |
| self.assertEqual( | |
| clean_host_port("http://example.com:8080"), "example.com:8080" | |
| ) |
| export const ComboBox = ({ | ||
| placeholder = 'Search', | ||
| shortcut = true, | ||
| size = 'sm', | ||
| fontSize = 'md', |
There was a problem hiding this comment.
The PR description says this change is limited to 4 ingestion files, but this PR also modifies UI core components (openmetadata-ui-core-components/...). Please either update the PR description to reflect the UI changes (and why they are needed for this fix) or split the UI changes into a separate PR to keep the scope focused.
| Raises ValueError if the scheme carries a non-numeric port so the user | ||
| gets a clear error instead of a silently broken hostPort. | ||
| """ |
There was a problem hiding this comment.
The _strip_hostport_scheme docstring says it “Raises ValueError if the scheme carries a non-numeric port”, but the JDBC-style not hostname fallback returns the stripped tail without port validation. Please either validate the port in that branch as well, or clarify the docstring to match the behavior.
| for sep in ("/", "?", "#"): | ||
| tail = tail.split(sep, 1)[0] | ||
| if "@" in tail: | ||
| tail = tail.rsplit("@", 1)[-1] |
There was a problem hiding this comment.
In the not hostname fallback branch, the function returns the stripped tail without validating the port component. For inputs like jdbc:postgresql://host:abc/db, this will return host:abc and the original int(port) failure can still happen later, which undermines the goal of producing a clearer error for non-numeric ports (and also makes the docstring claim about raising on non-numeric ports incomplete). Consider parsing tail for a host:port pattern and raising a ValueError when the port is present but non-numeric.
| tail = tail.rsplit("@", 1)[-1] | |
| tail = tail.rsplit("@", 1)[-1] | |
| fallback_port = None | |
| if tail.startswith("["): | |
| closing_bracket = tail.find("]") | |
| if closing_bracket != -1 and len(tail) > closing_bracket + 1: | |
| if tail[closing_bracket + 1] == ":": | |
| fallback_port = tail[closing_bracket + 2 :] | |
| elif ":" in tail: | |
| _, fallback_port = tail.rsplit(":", 1) | |
| if fallback_port and not fallback_port.isdigit(): | |
| raise ValueError( | |
| f"Invalid hostPort '{safe_label}'. Expected format is " | |
| "'hostname[:port]' (e.g. 'localhost:3306')." | |
| ) |
The broad 'except Exception' in model_post_init() was swallowing the intentional ValueError that clean_host_port raises for invalid ports (e.g. 'http://localhost:abc'), leaving a malformed hostPort on the connection object and turning a clear user error into an obscure downstream failure. The local import of clean_host_port is from a sibling module that always exists, so the try/except was unnecessary defence. Removed it so pydantic surfaces validation errors where they belong.
…r import The model_post_init hook added in the previous commit imported `clean_host_port` from `metadata.utils.db_utils` at runtime. That module has a heavy dependency graph (OpenMetadata client, LineageParser, generated schemas), so every *Connection class construction would drag it in and cause circular-import failures during integration test bootstrap. The file docstring explicitly requires BaseModel to be 'self-sufficient with only pydantic at import time', so the hostPort sanitisation is now handled by a stdlib-only helper (_strip_hostport_scheme) colocated with BaseModel. `clean_host_port` remains the public API in db_utils and delegates to the same helper, preserving behaviour for all existing callers and unit tests.
- Renames _strip_hostport_scheme to public strip_hostport_scheme so cross-module imports (db_utils.py) no longer depend on a private symbol; private alias kept for back-compat. - Validates the port in the JDBC-style fallback branch so inputs like 'jdbc:postgresql://host:abc/db' raise ValueError with a clear message, matching the docstring contract instead of silently passing a broken hostPort downstream. - Extends test_clean_host_port with the new fallback raise cases.
| # Validate the port in the fallback path so the same ValueError | ||
| # contract holds for JDBC-style URLs (e.g. 'jdbc:postgresql://host:abc'). | ||
| fallback_port: Optional[str] = None | ||
| if tail.startswith("["): | ||
| closing = tail.find("]") | ||
| if closing != -1 and len(tail) > closing + 1 and tail[closing + 1] == ":": | ||
| fallback_port = tail[closing + 2 :] | ||
| elif ":" in tail: | ||
| fallback_port = tail.rsplit(":", 1)[1] | ||
| if fallback_port and not fallback_port.isdigit(): | ||
| raise ValueError( | ||
| f"Invalid hostPort '{safe_label}'. Expected format is " | ||
| "'hostname[:port]' (e.g. 'localhost:3306')." | ||
| ) |
There was a problem hiding this comment.
In the hostname-empty fallback branch, malformed inputs like http:// or http:///localhost:3306 will produce an empty tail and be returned as an empty hostPort, which can lead to confusing downstream errors. It would be safer to validate that the extracted tail is non-empty (and ideally contains a hostname) and raise the same ValueError if it can't be parsed into hostname[:port].
| # Validate the port in the fallback path so the same ValueError | |
| # contract holds for JDBC-style URLs (e.g. 'jdbc:postgresql://host:abc'). | |
| fallback_port: Optional[str] = None | |
| if tail.startswith("["): | |
| closing = tail.find("]") | |
| if closing != -1 and len(tail) > closing + 1 and tail[closing + 1] == ":": | |
| fallback_port = tail[closing + 2 :] | |
| elif ":" in tail: | |
| fallback_port = tail.rsplit(":", 1)[1] | |
| if fallback_port and not fallback_port.isdigit(): | |
| raise ValueError( | |
| f"Invalid hostPort '{safe_label}'. Expected format is " | |
| "'hostname[:port]' (e.g. 'localhost:3306')." | |
| ) | |
| if not tail: | |
| raise ValueError( | |
| f"Invalid hostPort '{safe_label}'. Expected format is " | |
| "'hostname[:port]' (e.g. 'localhost:3306')." | |
| ) | |
| # Validate the fallback path so the same ValueError contract holds for | |
| # JDBC-style URLs and malformed inputs that urlparse could not map to | |
| # a standard hostname. | |
| fallback_parsed = urlparse(f"//{tail}") | |
| try: | |
| fallback_hostname = fallback_parsed.hostname or "" | |
| _ = fallback_parsed.port | |
| except ValueError as exc: | |
| raise ValueError( | |
| f"Invalid hostPort '{safe_label}'. Expected format is " | |
| "'hostname[:port]' (e.g. 'localhost:3306')." | |
| ) from exc | |
| if not fallback_hostname: | |
| raise ValueError( | |
| f"Invalid hostPort '{safe_label}'. Expected format is " | |
| "'hostname[:port]' (e.g. 'localhost:3306')." | |
| ) |
…common in builders.py The model_post_init hook in custom_pydantic.py was too broad — it stripped URL schemes from ALL *Connection classes including AirflowConnection and OpenMetadataConnection where http://localhost:8080 is a legitimate value. This broke all py-tests integration tests. Fix: remove stripping from model_post_init; apply strip_hostport_scheme() only in get_connection_url_common() in builders.py, which is called exclusively by database source connectors. Also address two Copilot review issues in strip_hostport_scheme: - safe_label now uses a sanitised tail for JDBC/empty-hostname inputs instead of the generic 'URL with scheme' string - Empty tail after scheme strip raises ValueError instead of silently returning an empty hostPort - Fallback port validation now uses urlparse('//tail') for consistent IPv6 handling, matching the Copilot suggested fix
…by stripping in model_post_init Addresses gitar-bot review feedback on PR open-metadata#27191: moving strip_hostport_scheme to get_connection_url_common left several connectors that parse hostPort manually still vulnerable to the original crash (open-metadata#24348): - cassandra/connection.py: host, port = connection.hostPort.split(':') - db2/connection.py: connection.hostPort.split(':')[0] - databricks/auth.py: connection.hostPort.split(':')[0] - redshift/connection.py: connection.hostPort.split(':')[0] - microsoftfabric/connection.py: connection.hostPort.split(':')[0] - builders.py (IAM auth): connection.hostPort.split(':') Restore strip_hostport_scheme call to model_post_init so scheme stripping happens at construction time for ALL *Connection classes, protecting every downstream split site automatically. Exclude AirflowConnection and OpenMetadataConnection via _HOSTPORT_URL_ALLOWED since their hostPort field legitimately stores a full URL (e.g. http://localhost:8080 and http://localhost:8585/api). The call in get_connection_url_common is kept for defence-in-depth; it is idempotent because strip_hostport_scheme short-circuits when '://' is absent. Add three new unit tests to test_connection_builders.py covering: - URL scheme stripping at model construction - ValueError raised for invalid port at construction time - URL-allowed connections are not stripped
| # (e.g. "http://airflow:8080" or "http://localhost:8585/api") and must NOT | ||
| # have their scheme stripped by model_post_init. | ||
| _HOSTPORT_URL_ALLOWED: frozenset = frozenset( | ||
| {"AirflowConnection", "OpenMetadataConnection"} | ||
| ) |
There was a problem hiding this comment.
_HOSTPORT_URL_ALLOWED is far too narrow for model_post_init scheme stripping. Many non-database *Connection models define hostPort as a full URL (schema format: "uri"), e.g. GrafanaConnection ("URL to the Grafana instance"), AirbyteConnection ("Pipeline Service Management/UI URL"), ElasticSearchConnection, etc. With the current allowlist, constructing those connections will silently mutate hostPort by stripping the scheme and potentially breaking downstream clients that require a URL. Consider restricting scheme stripping to database connection models only (e.g. self.class.module contains ".services.connections.database.") or switching to an explicit allowlist of models where hostPort is not a URL.
| if ( | ||
| hasattr(self, "hostPort") | ||
| and self.__class__.__name__ not in _HOSTPORT_URL_ALLOWED | ||
| ): | ||
| self.hostPort = strip_hostport_scheme(self.hostPort) |
There was a problem hiding this comment.
model_post_init calls strip_hostport_scheme(self.hostPort) whenever the model has a hostPort attribute. If hostPort is optional and left unset/None (several schemas do not list hostPort in "required"), this will raise an AttributeError at model construction time. Guard the call with an isinstance(self.hostPort, str) check (and ideally only run when '://' is present) so Connection models without hostPort can still be instantiated.
| if ( | |
| hasattr(self, "hostPort") | |
| and self.__class__.__name__ not in _HOSTPORT_URL_ALLOWED | |
| ): | |
| self.hostPort = strip_hostport_scheme(self.hostPort) | |
| host_port = getattr(self, "hostPort", None) | |
| if ( | |
| isinstance(host_port, str) | |
| and "://" in host_port | |
| and self.__class__.__name__ not in _HOSTPORT_URL_ALLOWED | |
| ): | |
| self.hostPort = strip_hostport_scheme(host_port) |
…or hostPort stripping - Replace _HOSTPORT_URL_ALLOWED frozenset (only 2 entries) with _DATABASE_CONNECTION_MODULE_MARKER that checks the class module path. Only classes under .services.connections.database. are subject to scheme stripping; dashboard, pipeline, metadata, and all other connector categories pass through unchanged because their hostPort legitimately holds a full URL (e.g. http://grafana:3000). - Add isinstance(host_port, str) guard in model_post_init to prevent AttributeError when hostPort is Optional and left unset (None). - Update test_connection_builders.py: replace _HOSTPORT_URL_ALLOWED import with _DATABASE_CONNECTION_MODULE_MARKER, rename and update the allowlist test to reflect the new module-path strategy, and add a regression test for the None hostPort crash (CassandraConnection with default hostPort). Fixes gitar-bot Bug 1 (None crash) and Bug 2 (incomplete class-name allowlist that failed to cover non-database URL-based connectors).
Code Review ✅ Approved 8 resolved / 8 findingsConsolidates hostPort parsing logic to safely strip URL schemes and prevent crashes. All reported exceptions in the JDBC, Cassandra, and IPv6 connector paths have been resolved. ✅ 8 resolved✅ Edge Case: Cassandra split(":") crashes when hostPort has no port
✅ Edge Case: JDBC fallback path preserves userinfo credentials in output
✅ Edge Case: Unguarded clean_host_port in model_post_init can crash model construction
✅ Edge Case: clean_host_port drops IPv6 brackets, breaking downstream splits
✅ Edge Case: Broad except swallows intentional ValueError for invalid ports
...and 3 more resolved from earlier reviews OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
|



Describe your changes:
Fixes #24348
When users enter a full URL like
http://localhost:3306orhttps://mydb.example.com:5432in thehostPortconnection field instead of justlocalhost:3306, the ingestion framework crashes with an unhelpfulValueError: invalid literal for int() with base 10.Root cause: Code throughout the ingestion framework calls
hostPort.split(":")expecting at most 2 parts (host:port), but a URL likehttp://localhost:3306splits into 3 parts (["http", "//localhost", "3306"]), causing either unpacking errors orint("//localhost")failures.Fix: Centralised
hostPortsanitisation inBaseModel.model_post_init()(custom_pydantic.py), the base class for every generated Pydantic model. When any*Connectionclass is constructed with ahostPortcontaining://, the scheme prefix is automatically stripped before any connector code ever accesses it.This covers all 38+ connectors in a single place — no per-connector patches needed, and future connectors are automatically protected.
Changes (4 files):
custom_pydantic.py— AddedhostPortsanitisation tomodel_post_init()(~15 lines)db_utils.py— Addedclean_host_port()utility and updatedget_host_from_host_port()to use it (defence-in-depth)test_db_utils.py— Unit tests forclean_host_port()(15+ assertions) and extendedget_host_from_host_port()teststest_build_connection_url.py— Integration tests for MySQL and Postgres with scheme-prefixedhostPortHow I tested:
CassandraConnection(hostPort='http://localhost:9042')→hostPortbecomes'localhost:9042'✓hostPortvalues pass through unchanged ✓http://localhost:abc,jdbc:postgresql://host:abc/db,jdbc:postgresql://[::1]:abc) raiseValueError✓py_format_check(Black) clean ✓Type of change:
Checklist:
Fixes <issue-number>: <short explanation>Summary by Gitar
model_post_initapproach with a centralizedstrip_hostport_scheme()call inget_connection_url_common()to prevent side effects on non-database connections.clean_host_port()including IPv6 handling and raisingValueErrorfor invalid hostports instead of silent failure.db_utils.py,custom_pydantic.py,builders.py, and test files to improve readability.This will update automatically on new commits.