Skip to content

Fixes #25721: Resolve Pinot JSON ingestion crash using custom PinotJSONType#27722

Open
mohitjeswani01 wants to merge 3 commits intoopen-metadata:mainfrom
mohitjeswani01:fix/25721-pinot-json-ingestion-crash
Open

Fixes #25721: Resolve Pinot JSON ingestion crash using custom PinotJSONType#27722
mohitjeswani01 wants to merge 3 commits intoopen-metadata:mainfrom
mohitjeswani01:fix/25721-pinot-json-ingestion-crash

Conversation

@mohitjeswani01
Copy link
Copy Markdown

@mohitjeswani01 mohitjeswani01 commented Apr 24, 2026

Description:

Fixes #25721

I worked on resolving the TypeError: expected bytes, str found that occurs when ingesting sample data from Apache Pinot tables containing JSON columns.

The Problem:
Apache Pinot (specifically the Multistage Engine) returns JSON columns as raw Python str. However, SQLAlchemy 2.0's built-in JSON type expects a byte stream for its internal result processor. This leads to a crash during the sample data fetch before the ingestion framework can even process the row.

The Fix:
I implemented a custom PinotJSONType using a TypeDecorator with impl = String.

  • Bypassing the Crash: By using String as the base implementation, we completely sidestep SQLAlchemy's default JSON bytes-processor.
  • Engine Compatibility: The process_result_value logic is designed to handle both raw JSON strings (Multistage Engine) and already-deserialized dictionaries/lists (Single-stage Engine).
  • Defensive Handling: Added specific checks for Pinot-specific edge cases where null or empty columns might return False or empty strings, ensuring they map safely to None without killing the ingestion pipeline.
  • Framework Integration: Updated column_type_parser.py with a guarded import to ensure the Pinot-specific logic is only loaded when the connector is active, maintaining the architectural integrity of the framework.

How did you test your changes?
I wrote an exhaustive regression suite in ingestion/tests/unit/topology/database/test_pinotdb.py covering 21 new test cases. These verify:

  • Correct parsing of JSON objects and arrays from raw strings.
  • Pass-through for already-deserialized containers.
  • Decoding for binary/bytearray inputs.
  • Safe fallbacks for malformed JSON strings (no-crash fallback).
  • Correct framework-level type resolution in column_type_parser.

Test Results:
The suite is 100% green with 31/31 tests passed (including the original mappings and new regression cases).

Screenshots:
Successful test run showing exhaustive coverage of Pinot JSON scenarios:
image

Type of change:

  • Bug fix

Checklist:

  • I have read the CONTRIBUTING document.

  • My PR title is Fixes #25721: Resolve Pinot JSON ingestion crash using custom PinotJSONType

  • I have commented on my code, particularly in hard-to-understand areas.

  • I have added a test that covers the exact scenario we are fixing. For complex issues, comment the issue number in the test for future reference.


Summary by Gitar

  • Code cleanup and linting:
    • Refactored column_type_parser.py and pinotdb/metadata.py to include multiple Ruff and PEP-8 compliant linting fixes (noqa annotations, type hint updates).

This will update automatically on new commits.

Copilot AI review requested due to automatic review settings April 24, 2026 16:45
@mohitjeswani01 mohitjeswani01 requested a review from a team as a code owner April 24, 2026 16:45
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/tests/unit/topology/database/test_pinotdb.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Resolves a Pinot connector sample-data ingestion crash for JSON columns by introducing a Pinot-specific SQLAlchemy TypeDecorator that bypasses SQLAlchemy JSON’s bytes-oriented result processor while still normalizing Pinot driver outputs.

Changes:

  • Added PinotJSONType (TypeDecorator over String) to safely deserialize/normalize Pinot JSON column values.
  • Updated Pinot type mapping to return PinotJSONType for json columns.
  • Registered PinotJSONType in ColumnTypeParser and expanded Pinot unit tests to cover JSON normalization + framework wiring.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
ingestion/src/metadata/ingestion/source/database/pinotdb/custom_types.py Introduces PinotJSONType to normalize Pinot JSON values and avoid SQLAlchemy JSON bytes-processing.
ingestion/src/metadata/ingestion/source/database/pinotdb/metadata.py Switches Pinot json type mapping from types.JSON to PinotJSONType.
ingestion/src/metadata/ingestion/source/database/column_type_parser.py Registers PinotJSONType to resolve to OM JSON at the framework parser layer.
ingestion/tests/unit/topology/database/test_pinotdb.py Adds/extends unit coverage for Pinot JSON normalization behavior and mapping integration.

Comment thread ingestion/src/metadata/ingestion/source/database/pinotdb/custom_types.py Outdated
Comment thread ingestion/tests/unit/topology/database/test_pinotdb.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@mohitjeswani01
Copy link
Copy Markdown
Author

Hi @harshach and @PubChimps Sir
I’ve implemented a complete fix for the Pinot JSON crash with a full regression suite (31/31 tests passed). I’m aware of the hackathon assignment guidelines, but since the current assignee has been inactive and this is a blocker for sample data, I wanted to provide a ready-to-merge solution. Could you please review and add the safe to test label? thanks you !🙏

Copilot AI review requested due to automatic review settings April 30, 2026 09:55
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@mohitjeswani01
Copy link
Copy Markdown
Author

Hi @PubChimps @harshach sir resolved the merge conflict here successfully could you please review and add a safe to test label so i can monitor the ci ? thanks 🙏

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 30, 2026

Code Review ✅ Approved 1 resolved / 1 findings

Custom PinotJSONType implementation fixes the JSON ingestion crash by correcting the skip logic in pytest. No issues found.

✅ 1 resolved
Edge Case: pytest.importorskip skips PinotJSONType tests unnecessarily

📄 ingestion/tests/unit/topology/database/test_pinotdb.py:31-38
The module-level pytest.importorskip("pinotdb") at line 31 causes the entire test file to be skipped when pinotdb is not installed. However, PinotJSONType (from custom_types.py) has no dependency on pinotdb — it only uses sqlalchemy and json. This means TestPinotJSONTypeContract, TestPinotJSONResultProcessor, and TestPinotJSONFrameworkIntegration could all run in CI environments that lack pinotdb, but are currently being skipped unnecessarily.

The importorskip guard is only needed for the get_type_custom import from metadata.py, which transitively imports pinotdb.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment on lines +111 to +113
def test_python_type_contract(self):
assert PinotJSONType().python_type is Any

Comment on lines +28 to +30
bytes-processor is never invoked. The pinotdb driver always delivers
JSON values as Python str (multistage engine) or already-deserialized
containers (single-stage engine). We normalize both shapes here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error while ingesting sample data for tables with JSON columns in Apache Pinot

2 participants