feat: complete Arrow type mapping for parametric and missing types#16
Merged
Conversation
Add support for all common Arrow type strings returned by Hotdata's information_schema when tables are loaded from Parquet/Arrow sources. Simple additions to _ARROW_TYPE_MAP: - int8 → dt.Int8 (Postgres parser wrongly returns Int64 for "int8") - halffloat → dt.Float16 (PyArrow's str() name for float16 - large_string → dt.String (PyArrow large-offset string variant) New _parse_parametric_arrow_type() for types with embedded parameters: - timestamp[us] / timestamp[us, tz=UTC] → dt.Timestamp(timezone=...) - duration[ms] → dt.Interval(unit='ms') - decimal128(10, 3) / decimal(5, 2) → dt.Decimal(precision, scale) - list<item: T> / large_list<item: T> → dt.Array(T) (recursive) Adds 27 new test cases covering all new patterns including timezone variants, all 4 time units, case-insensitivity, and recursive list element type resolution. EOF )
There was a problem hiding this comment.
Review
Blocking Issues
src/ibis_hotdata/types.py:63-66—_TIMESTAMP_REcaptures the Arrow time unit in group 1 but the handler discards it, so every parametric timestamp (timestamp[s],timestamp[ms],timestamp[us],timestamp[ns]) is constructed asdt.Timestamp()with default scale.dt.Timestampis a parametric type — differentscalevalues produce different types — so the unit information this PR claims to recover is silently lost. The existing tests pass only because they don't assert onscale.
Action Required
- Map the captured unit to an Ibis
scale(s→0, ms→3, us→6, ns→9) and pass it todt.Timestamp(...). - Extend
test_dtype_from_hotdata_arrow_timestampto assert the resultingscaleso this regression is caught.
See inline comments for the suggested patch and two non-blocking nits (decimal regex shape, silent unknown-unit fallback).
Two bugs found by Codex review: 1. Timestamp scale was discarded: Arrow timestamp[ms] should map to dt.Timestamp(scale=3), not dt.Timestamp(). Add unit-to-scale mapping s=0, ms=3, us=6, ns=9, matching PyArrow convention. 2. Non-nullable list items mis-parsed: PyArrow emits 'list<item: int32 not null>' for non-nullable item fields. Strip the ' not null' suffix and pass nullable=False to the recursive call so the element type is correctly typed instead of falling back to Unknown.
Two nits from PR review: - decimal regex 'decimal128?' made only the trailing 8 optional, so it matched the non-existent 'decimal12' form. Changed to 'decimal(?:128|256)?' to accept decimal, decimal128, or decimal256. - _ARROW_UNIT_TO_IBIS.get(..., 's') silently mapped any unrecognised duration unit (e.g. duration[foo]) to Interval(s). Now returns None so the caller falls through to the Postgres parser / String fallback.
Instead of constructing Ibis types directly (with hand-coded unit->scale mappings etc.), resolve Arrow type strings to PyArrow types first and convert via ibis.formats.pyarrow.PyArrowType.to_ibis(). - Replace _ARROW_TYPE_MAP (str -> Ibis class) with _PA_TYPE_MAP (str -> pa.DataType instance), covering all scalar types that can appear as list element types - Merge _parse_parametric_arrow_type() into _pa_type_from_arrow_str() which returns a pa.DataType or None - Remove _ARROW_UNIT_TO_IBIS and _ARROW_UNIT_TO_TIMESTAMP_SCALE — PyArrow encodes unit/scale semantics natively - Fix decimal256: pa.decimal128 rejects precision > 38; fall back to pa.decimal256 for wider values - dtype_from_hotdata_sql_type() simplified to: try Arrow path via PyArrow bridge, then Postgres parser, then String fallback
- backend.py: replace walrus-then-overwrite pattern with plain assignments in _to_catalog_db_tuple - backend.py: getattr(self, '_http', None) is not None -> hasattr(self, '_http') - backend.py: add explanatory comments to the silent pass in _resolve_database_connection_id and the no-op _register_in_memory_table - http.py: replace magic HTTP status integers with http.HTTPStatus constants - http.py: replace range(len(columns)) index loop with direct field iteration - types.py: cache m.group(2) to avoid calling it twice - Remove managed.py (single-constant module); inline 'public' directly at the two call sites in backend.py and http.py
- Correct ibis-framework version requirement to >=10,<11 - Document all connect() optional parameters with inline comments - Add URL query string example with optional parameters - Document schema-only create_table (empty table from schema) - Document force=True on drop operations - Note to_pyarrow_batches() downloads full result then splits locally - Add in-memory tables (unsupported) and Arrow type mapping to feature table
- Change default schema public to main to match runtimedb DEFAULT_SCHEMA_NAME;
runtimedb auto-inserts main into every managed database, so using public
silently created a spurious empty main schema alongside the declared one
- Trim _IN_FLIGHT to {running} — runtimedb QueryRunStatus only emits
running, succeeded, and failed; queued and pending are result statuses
- Update README examples to use main schema throughout
There was a problem hiding this comment.
Prior review feedback addressed:
- Timestamp scale now preserved via the PyArrow→Ibis bridge (
pa.timestamp(unit, tz=tz)+PyArrowType.to_ibis()), with tests assertingout.scaleper unit. - Decimal regex tightened to
decimal(?:128|256)?, with a regression test fordecimal12. - Unknown duration units now return None (via
pa.durationraising) and fall through instead of silently mapping to seconds, with a regression test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Before this change, Parquet-loaded tables with parametric Arrow column types would silently fall back to `String` or `Unknown`, losing type information.
Test plan