Skip to content

Fixes #24348: Strip URL scheme from hostPort to prevent ValueError#27191

Open
RajdeepKushwaha5 wants to merge 8 commits intoopen-metadata:mainfrom
RajdeepKushwaha5:fix/hostport-url-scheme-valueerror-24348
Open

Fixes #24348: Strip URL scheme from hostPort to prevent ValueError#27191
RajdeepKushwaha5 wants to merge 8 commits intoopen-metadata:mainfrom
RajdeepKushwaha5:fix/hostport-url-scheme-valueerror-24348

Conversation

@RajdeepKushwaha5
Copy link
Copy Markdown
Contributor

@RajdeepKushwaha5 RajdeepKushwaha5 commented Apr 9, 2026

Describe your changes:

Fixes #24348

When users enter a full URL like http://localhost:3306 or https://mydb.example.com:5432 in the hostPort connection field instead of just localhost:3306, the ingestion framework crashes with an unhelpful ValueError: invalid literal for int() with base 10.

Root cause: Code throughout the ingestion framework calls hostPort.split(":") expecting at most 2 parts (host:port), but a URL like http://localhost:3306 splits into 3 parts (["http", "//localhost", "3306"]), causing either unpacking errors or int("//localhost") failures.

Fix: Centralised hostPort sanitisation in BaseModel.model_post_init() (custom_pydantic.py), the base class for every generated Pydantic model. When any *Connection class is constructed with a hostPort containing ://, the scheme prefix is automatically stripped before any connector code ever accesses it.

This covers all 38+ connectors in a single place — no per-connector patches needed, and future connectors are automatically protected.

Changes (4 files):

  • custom_pydantic.py — Added hostPort sanitisation to model_post_init() (~15 lines)
  • db_utils.py — Added clean_host_port() utility and updated get_host_from_host_port() to use it (defence-in-depth)
  • test_db_utils.py — Unit tests for clean_host_port() (15+ assertions) and extended get_host_from_host_port() tests
  • test_build_connection_url.py — Integration tests for MySQL and Postgres with scheme-prefixed hostPort

How I tested:

  • Verified locally: CassandraConnection(hostPort='http://localhost:9042')hostPort becomes 'localhost:9042'
  • Clean hostPort values pass through unchanged ✓
  • Warning logged when scheme is detected ✓
  • All 24 helper cases pass; non-numeric ports (http://localhost:abc, jdbc:postgresql://host:abc/db, jdbc:postgresql://[::1]:abc) raise ValueError
  • make py_format_check (Black) clean ✓
  • All existing tests continue to pass ✓

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have added a test that covers the exact scenario we are fixing. For complex issues, comment the issue number in the test for future reference.

Summary by Gitar

  • Refined URL scheme stripping:
    • Replaced the initial model_post_init approach with a centralized strip_hostport_scheme() call in get_connection_url_common() to prevent side effects on non-database connections.
    • Added robust validation to clean_host_port() including IPv6 handling and raising ValueError for invalid hostports instead of silent failure.
  • Code maintenance:
    • Cleaned up extensive formatting and whitespace changes across db_utils.py, custom_pydantic.py, builders.py, and test files to improve readability.

This will update automatically on new commits.

@RajdeepKushwaha5 RajdeepKushwaha5 requested a review from a team as a code owner April 9, 2026 05:43
Copilot AI review requested due to automatic review settings April 9, 2026 05:43
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 9, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 9, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents ingestion crashes when users mistakenly enter a full URL (e.g. http://localhost:3306) in hostPort by introducing a shared sanitization helper and applying it before hostPort parsing/URL construction, along with added test coverage.

Changes:

  • Added clean_host_port() utility to strip URL schemes (and trailing slashes) before parsing hostPort.
  • Updated multiple connectors and shared builders to call clean_host_port() before splitting/using hostPort.
  • Added/extended unit tests to cover scheme-prefixed hostPort inputs and connection URL generation.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
ingestion/src/metadata/utils/db_utils.py Adds clean_host_port() and uses it in get_host_from_host_port() to sanitize scheme-prefixed inputs.
ingestion/src/metadata/ingestion/connections/builders.py Applies clean_host_port() when generating DB URLs and when splitting host/port for IAM token generation.
ingestion/src/metadata/ingestion/source/database/cassandra/connection.py Uses clean_host_port() before splitting Cassandra hostPort.
ingestion/src/metadata/ingestion/source/database/db2/connection.py Uses clean_host_port() for DB2 host extraction and port parsing.
ingestion/src/metadata/ingestion/source/database/databricks/client.py Uses clean_host_port() to derive the Databricks workspace base host.
ingestion/src/metadata/ingestion/source/database/databricks/auth.py Uses clean_host_port() when building Databricks SDK host for auth flows.
ingestion/src/metadata/ingestion/source/database/redshift/connection.py Uses clean_host_port() when deriving host for IAM flow type detection.
ingestion/tests/unit/test_db_utils.py Adds tests for clean_host_port() and scheme-prefixed get_host_from_host_port() inputs.
ingestion/tests/unit/test_build_connection_url.py Adds integration-style tests ensuring MySQL/Postgres URL building works with scheme-prefixed hostPort.

Comment thread ingestion/src/metadata/utils/db_utils.py Outdated
Comment thread ingestion/src/metadata/ingestion/connections/builders.py
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 9, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot AI review requested due to automatic review settings April 9, 2026 07:55
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 9, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Comment thread ingestion/src/metadata/ingestion/connections/builders.py
Comment thread ingestion/src/metadata/utils/db_utils.py Outdated
Comment thread ingestion/src/metadata/utils/db_utils.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot AI review requested due to automatic review settings April 11, 2026 20:00
@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/hostport-url-scheme-valueerror-24348 branch from 8cec367 to fcfd026 Compare April 11, 2026 20:00
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/ingestion/source/database/cassandra/connection.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comment thread ingestion/src/metadata/utils/db_utils.py Outdated
Comment thread ingestion/src/metadata/utils/db_utils.py Outdated
@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/hostport-url-scheme-valueerror-24348 branch from fcfd026 to 0b955fb Compare April 11, 2026 20:15
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot AI review requested due to automatic review settings April 11, 2026 23:17
@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/hostport-url-scheme-valueerror-24348 branch from 0b955fb to 6a779c3 Compare April 11, 2026 23:17
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

ingestion/src/metadata/ingestion/source/database/db2/connection.py:90

  • In _get_ibmi_connection_args(), port_str = host_port.split(":")[1] is fragile: it will fail for bracketed IPv6 ([::1]:50000) and will also misbehave if clean_host_port() ever returns an unbracketed IPv6 host (e.g. from http://[::1]:50000). Prefer splitting from the right once (e.g. rsplit(':', 1)) and/or explicitly handling [...]:port to reliably extract the port.
    args = get_connection_args_common(connection)
    host_port = clean_host_port(connection.hostPort)
    if ":" in host_port:
        port_str = host_port.split(":")[1]
        try:
            args["port"] = int(port_str)

Comment thread ingestion/src/metadata/utils/db_utils.py Outdated
Comment thread ingestion/src/metadata/utils/db_utils.py Outdated
Comment thread ingestion/src/metadata/utils/db_utils.py Outdated
@RajdeepKushwaha5
Copy link
Copy Markdown
Contributor Author

ping @harshach

url = f"{connection.scheme.value}://"
url += f"{quote_plus(username)}:{quote_plus(password)}@"
url += connection.hostPort
url += clean_host_port(connection.hostPort)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens to other connectors? why are we making changes only to few connectors that to in a connection.py?

@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/hostport-url-scheme-valueerror-24348 branch from 6a779c3 to 8ed98aa Compare April 12, 2026 04:05
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@RajdeepKushwaha5
Copy link
Copy Markdown
Contributor Author

RajdeepKushwaha5 commented Apr 12, 2026

@harshach thanks for the feedback, you're absolutely right. Patching individual connector files was the wrong approach.

I've reworked this completely. The fix is now centralised in BaseModel.model_post_init() (custom_pydantic.py), which is the base class for every generated Pydantic model. When any *Connection class is constructed with a hostPort that contains ://, the scheme is automatically stripped before any connector code ever sees it.

What changed:

  • Removed all 6 per-connector patches (builders.py, cassandra, databricks, db2, redshift)
  • Added ~15 lines to model_post_init() in custom_pydantic.py — this covers all 38+ connectors automatically
  • Kept clean_host_port() in db_utils.py as a public utility (also used by get_host_from_host_port() as defence-in-depth)
  • Tests unchanged — they verify the end-to-end behaviour

The diff is now 4 files, 145 additions, 6 deletions (down from 9 files).

Verified locally: CassandraConnection(hostPort='http://localhost:9042')hostPort becomes 'localhost:9042'

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


# HTTP prefix is stripped
self.assertEqual(clean_host_port("http://localhost:3306"), "localhost:3306")
self.assertEqual(clean_host_port("http://example.com:8080"), "example.com:8080")
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New assertions in this test exceed Black's default formatting width and will cause make py_format_check (black --check) to fail. Please run make py_format (or manually wrap these calls) so Black can reformat them consistently.

Suggested change
self.assertEqual(clean_host_port("http://example.com:8080"), "example.com:8080")
self.assertEqual(
clean_host_port("http://example.com:8080"), "example.com:8080"
)

Copilot uses AI. Check for mistakes.
Comment thread ingestion/src/metadata/utils/db_utils.py Outdated
Comment on lines 138 to +142
export const ComboBox = ({
placeholder = 'Search',
shortcut = true,
size = 'sm',
fontSize = 'md',
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says this change is limited to 4 ingestion files, but this PR also modifies UI core components (openmetadata-ui-core-components/...). Please either update the PR description to reflect the UI changes (and why they are needed for this fix) or split the UI changes into a separate PR to keep the scope focused.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comment on lines +46 to +48
Raises ValueError if the scheme carries a non-numeric port so the user
gets a clear error instead of a silently broken hostPort.
"""
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _strip_hostport_scheme docstring says it “Raises ValueError if the scheme carries a non-numeric port”, but the JDBC-style not hostname fallback returns the stripped tail without port validation. Please either validate the port in that branch as well, or clarify the docstring to match the behavior.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

for sep in ("/", "?", "#"):
tail = tail.split(sep, 1)[0]
if "@" in tail:
tail = tail.rsplit("@", 1)[-1]
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the not hostname fallback branch, the function returns the stripped tail without validating the port component. For inputs like jdbc:postgresql://host:abc/db, this will return host:abc and the original int(port) failure can still happen later, which undermines the goal of producing a clearer error for non-numeric ports (and also makes the docstring claim about raising on non-numeric ports incomplete). Consider parsing tail for a host:port pattern and raising a ValueError when the port is present but non-numeric.

Suggested change
tail = tail.rsplit("@", 1)[-1]
tail = tail.rsplit("@", 1)[-1]
fallback_port = None
if tail.startswith("["):
closing_bracket = tail.find("]")
if closing_bracket != -1 and len(tail) > closing_bracket + 1:
if tail[closing_bracket + 1] == ":":
fallback_port = tail[closing_bracket + 2 :]
elif ":" in tail:
_, fallback_port = tail.rsplit(":", 1)
if fallback_port and not fallback_port.isdigit():
raise ValueError(
f"Invalid hostPort '{safe_label}'. Expected format is "
"'hostname[:port]' (e.g. 'localhost:3306')."
)

Copilot uses AI. Check for mistakes.
Comment thread ingestion/src/metadata/ingestion/models/custom_pydantic.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

The broad 'except Exception' in model_post_init() was swallowing the

intentional ValueError that clean_host_port raises for invalid ports

(e.g. 'http://localhost:abc'), leaving a malformed hostPort on the

connection object and turning a clear user error into an obscure

downstream failure.

The local import of clean_host_port is from a sibling module that

always exists, so the try/except was unnecessary defence. Removed it

so pydantic surfaces validation errors where they belong.
…r import

The model_post_init hook added in the previous commit imported
`clean_host_port` from `metadata.utils.db_utils` at runtime. That
module has a heavy dependency graph (OpenMetadata client, LineageParser,
generated schemas), so every *Connection class construction would drag
it in and cause circular-import failures during integration test
bootstrap.

The file docstring explicitly requires BaseModel to be 'self-sufficient
with only pydantic at import time', so the hostPort sanitisation is now
handled by a stdlib-only helper (_strip_hostport_scheme) colocated with
BaseModel. `clean_host_port` remains the public API in db_utils and
delegates to the same helper, preserving behaviour for all existing
callers and unit tests.
- Renames _strip_hostport_scheme to public strip_hostport_scheme so cross-module imports (db_utils.py) no longer depend on a private symbol; private alias kept for back-compat.

- Validates the port in the JDBC-style fallback branch so inputs like 'jdbc:postgresql://host:abc/db' raise ValueError with a clear message, matching the docstring contract instead of silently passing a broken hostPort downstream.

- Extends test_clean_host_port with the new fallback raise cases.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread ingestion/src/metadata/ingestion/models/custom_pydantic.py Outdated
Comment on lines +84 to +97
# Validate the port in the fallback path so the same ValueError
# contract holds for JDBC-style URLs (e.g. 'jdbc:postgresql://host:abc').
fallback_port: Optional[str] = None
if tail.startswith("["):
closing = tail.find("]")
if closing != -1 and len(tail) > closing + 1 and tail[closing + 1] == ":":
fallback_port = tail[closing + 2 :]
elif ":" in tail:
fallback_port = tail.rsplit(":", 1)[1]
if fallback_port and not fallback_port.isdigit():
raise ValueError(
f"Invalid hostPort '{safe_label}'. Expected format is "
"'hostname[:port]' (e.g. 'localhost:3306')."
)
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the hostname-empty fallback branch, malformed inputs like http:// or http:///localhost:3306 will produce an empty tail and be returned as an empty hostPort, which can lead to confusing downstream errors. It would be safer to validate that the extracted tail is non-empty (and ideally contains a hostname) and raise the same ValueError if it can't be parsed into hostname[:port].

Suggested change
# Validate the port in the fallback path so the same ValueError
# contract holds for JDBC-style URLs (e.g. 'jdbc:postgresql://host:abc').
fallback_port: Optional[str] = None
if tail.startswith("["):
closing = tail.find("]")
if closing != -1 and len(tail) > closing + 1 and tail[closing + 1] == ":":
fallback_port = tail[closing + 2 :]
elif ":" in tail:
fallback_port = tail.rsplit(":", 1)[1]
if fallback_port and not fallback_port.isdigit():
raise ValueError(
f"Invalid hostPort '{safe_label}'. Expected format is "
"'hostname[:port]' (e.g. 'localhost:3306')."
)
if not tail:
raise ValueError(
f"Invalid hostPort '{safe_label}'. Expected format is "
"'hostname[:port]' (e.g. 'localhost:3306')."
)
# Validate the fallback path so the same ValueError contract holds for
# JDBC-style URLs and malformed inputs that urlparse could not map to
# a standard hostname.
fallback_parsed = urlparse(f"//{tail}")
try:
fallback_hostname = fallback_parsed.hostname or ""
_ = fallback_parsed.port
except ValueError as exc:
raise ValueError(
f"Invalid hostPort '{safe_label}'. Expected format is "
"'hostname[:port]' (e.g. 'localhost:3306')."
) from exc
if not fallback_hostname:
raise ValueError(
f"Invalid hostPort '{safe_label}'. Expected format is "
"'hostname[:port]' (e.g. 'localhost:3306')."
)

Copilot uses AI. Check for mistakes.
…common in builders.py The model_post_init hook in custom_pydantic.py was too broad — it stripped URL schemes from ALL *Connection classes including AirflowConnection and OpenMetadataConnection where http://localhost:8080 is a legitimate value. This broke all py-tests integration tests. Fix: remove stripping from model_post_init; apply strip_hostport_scheme() only in get_connection_url_common() in builders.py, which is called exclusively by database source connectors. Also address two Copilot review issues in strip_hostport_scheme: - safe_label now uses a sanitised tail for JDBC/empty-hostname inputs instead of the generic 'URL with scheme' string - Empty tail after scheme strip raises ValueError instead of silently returning an empty hostPort - Fallback port validation now uses urlparse('//tail') for consistent IPv6 handling, matching the Copilot suggested fix
Comment thread ingestion/src/metadata/ingestion/connections/builders.py
…by stripping in model_post_init

Addresses gitar-bot review feedback on PR open-metadata#27191: moving strip_hostport_scheme
to get_connection_url_common left several connectors that parse hostPort manually
still vulnerable to the original crash (open-metadata#24348):

  - cassandra/connection.py: host, port = connection.hostPort.split(':')
  - db2/connection.py: connection.hostPort.split(':')[0]
  - databricks/auth.py: connection.hostPort.split(':')[0]
  - redshift/connection.py: connection.hostPort.split(':')[0]
  - microsoftfabric/connection.py: connection.hostPort.split(':')[0]
  - builders.py (IAM auth): connection.hostPort.split(':')

Restore strip_hostport_scheme call to model_post_init so scheme stripping
happens at construction time for ALL *Connection classes, protecting every
downstream split site automatically.

Exclude AirflowConnection and OpenMetadataConnection via _HOSTPORT_URL_ALLOWED
since their hostPort field legitimately stores a full URL (e.g.
http://localhost:8080 and http://localhost:8585/api).

The call in get_connection_url_common is kept for defence-in-depth; it is
idempotent because strip_hostport_scheme short-circuits when '://' is absent.

Add three new unit tests to test_connection_builders.py covering:
- URL scheme stripping at model construction
- ValueError raised for invalid port at construction time
- URL-allowed connections are not stripped
Comment thread ingestion/src/metadata/ingestion/models/custom_pydantic.py Outdated
Comment thread ingestion/src/metadata/ingestion/models/custom_pydantic.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment on lines +126 to +130
# (e.g. "http://airflow:8080" or "http://localhost:8585/api") and must NOT
# have their scheme stripped by model_post_init.
_HOSTPORT_URL_ALLOWED: frozenset = frozenset(
{"AirflowConnection", "OpenMetadataConnection"}
)
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_HOSTPORT_URL_ALLOWED is far too narrow for model_post_init scheme stripping. Many non-database *Connection models define hostPort as a full URL (schema format: "uri"), e.g. GrafanaConnection ("URL to the Grafana instance"), AirbyteConnection ("Pipeline Service Management/UI URL"), ElasticSearchConnection, etc. With the current allowlist, constructing those connections will silently mutate hostPort by stripping the scheme and potentially breaking downstream clients that require a URL. Consider restricting scheme stripping to database connection models only (e.g. self.class.module contains ".services.connections.database.") or switching to an explicit allowlist of models where hostPort is not a URL.

Copilot uses AI. Check for mistakes.
Comment on lines +162 to +166
if (
hasattr(self, "hostPort")
and self.__class__.__name__ not in _HOSTPORT_URL_ALLOWED
):
self.hostPort = strip_hostport_scheme(self.hostPort)
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_post_init calls strip_hostport_scheme(self.hostPort) whenever the model has a hostPort attribute. If hostPort is optional and left unset/None (several schemas do not list hostPort in "required"), this will raise an AttributeError at model construction time. Guard the call with an isinstance(self.hostPort, str) check (and ideally only run when '://' is present) so Connection models without hostPort can still be instantiated.

Suggested change
if (
hasattr(self, "hostPort")
and self.__class__.__name__ not in _HOSTPORT_URL_ALLOWED
):
self.hostPort = strip_hostport_scheme(self.hostPort)
host_port = getattr(self, "hostPort", None)
if (
isinstance(host_port, str)
and "://" in host_port
and self.__class__.__name__ not in _HOSTPORT_URL_ALLOWED
):
self.hostPort = strip_hostport_scheme(host_port)

Copilot uses AI. Check for mistakes.
RajdeepKushwaha5 and others added 2 commits April 27, 2026 21:25
…or hostPort stripping

- Replace _HOSTPORT_URL_ALLOWED frozenset (only 2 entries) with
  _DATABASE_CONNECTION_MODULE_MARKER that checks the class module path.
  Only classes under .services.connections.database. are subject to
  scheme stripping; dashboard, pipeline, metadata, and all other connector
  categories pass through unchanged because their hostPort legitimately
  holds a full URL (e.g. http://grafana:3000).

- Add isinstance(host_port, str) guard in model_post_init to prevent
  AttributeError when hostPort is Optional and left unset (None).

- Update test_connection_builders.py: replace _HOSTPORT_URL_ALLOWED import
  with _DATABASE_CONNECTION_MODULE_MARKER, rename and update the allowlist
  test to reflect the new module-path strategy, and add a regression test
  for the None hostPort crash (CassandraConnection with default hostPort).

Fixes gitar-bot Bug 1 (None crash) and Bug 2 (incomplete class-name
allowlist that failed to cover non-database URL-based connectors).
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 27, 2026

Code Review ✅ Approved 8 resolved / 8 findings

Consolidates hostPort parsing logic to safely strip URL schemes and prevent crashes. All reported exceptions in the JDBC, Cassandra, and IPv6 connector paths have been resolved.

✅ 8 resolved
Edge Case: Cassandra split(":") crashes when hostPort has no port

📄 ingestion/src/metadata/ingestion/source/database/cassandra/connection.py:75 📄 ingestion/src/metadata/ingestion/connections/builders.py:158-163 📄 ingestion/src/metadata/utils/db_utils.py:76
clean_host_port() can return just a hostname without a colon (e.g., "localhost") when no port is present — see db_utils.py:76. At cassandra/connection.py:75, the tuple unpacking host, port = clean_host_port(...).split(":") will raise ValueError: not enough values to unpack in that case.

This is the same class of crash the PR aims to fix. The builders.py change at line 158-162 correctly handles this by checking ":" not in cleaned before splitting, and the Databricks client uses *_ to absorb extra parts. The Cassandra path should follow the same defensive pattern.

Edge Case: JDBC fallback path preserves userinfo credentials in output

📄 ingestion/src/metadata/utils/db_utils.py:82-88
The JDBC-style URL fallback (lines 81-88) uses simple string splitting (rsplit("://", 1)[-1]) which does not strip the userinfo portion of the authority. For an input like jdbc:postgresql://user:pass@host:5432/db, the function returns user:pass@host:5432 instead of host:5432.

While JDBC URLs with embedded credentials in the hostPort field are unlikely in practice, this is a defense-in-depth path that should handle the case correctly.

Edge Case: Unguarded clean_host_port in model_post_init can crash model construction

📄 ingestion/src/metadata/ingestion/models/custom_pydantic.py:55-60
The clean_host_port() call in model_post_init (line 60) is not wrapped in a try/except, unlike the FilterPattern logic below it (line 62+). If a user provides an unusual hostPort value that causes clean_host_port to raise (e.g. http://localhost:abc raises ValueError), Pydantic model construction will fail entirely.

While failing fast on clearly invalid input is reasonable, this is inconsistent with the defensive pattern already established in the same method for FilterPattern. An unexpected edge-case input could prevent model instantiation when the existing connector code might still have handled it (or produced a more contextual error).

Edge Case: clean_host_port drops IPv6 brackets, breaking downstream splits

📄 ingestion/src/metadata/utils/db_utils.py:92
For a scheme-prefixed IPv6 hostPort like http://[::1]:3306, urlparse(...).hostname returns ::1 (without brackets), so clean_host_port returns ::1:3306. Downstream code that does split(":")[0] (e.g. get_host_from_host_port) would then extract an empty string instead of the IPv6 address.

This is a pre-existing limitation of the split(":") approach, but the new code could avoid making it worse by re-adding brackets when an IPv6 hostname is detected.

Edge Case: Broad except swallows intentional ValueError for invalid ports

📄 ingestion/src/metadata/ingestion/models/custom_pydantic.py:58-66 📄 ingestion/src/metadata/utils/db_utils.py:75-79
In model_post_init, the except Exception block (line 62) catches all exceptions from clean_host_port, including the intentional ValueError it raises when the port is genuinely invalid (e.g. http://localhost:abc). The handler only logs a warning and leaves the malformed hostPort value in place. This means downstream connector code will still crash with the same unhelpful ValueError: invalid literal for int() that this PR aims to fix.

For truly invalid input, the user should get a clear error at construction time rather than a confusing one later in the pipeline.

...and 3 more resolved from earlier reviews

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ValueError: invalid literal for int() with base 10

4 participants