Skip to content

Commit 0c10d7b

Browse files
feat(backend/kernel): honour _use_arrow_native_complex_types via kernel's complex_types_as_json post-processor (#795)
feat(backend/kernel): plumb _use_arrow_native_complex_types to kernel's complex_types_as_json The connector's `_use_arrow_native_complex_types` toggle is honoured by the Thrift backend (forwarded server-side as `complexTypesAsArrow`) but was silently ignored by the kernel backend — the kernel always returned native Arrow `List` / `Map` / `Struct` regardless. This was the root cause of the 5 `THRIFT_VS_KERNEL_COMPLEX_DISABLED` diffs in the comparator's COMPLEX_TYPES suite. The kernel side gained an opt-in `complex_types_as_json` post- processor (kernel PR #36) that rewrites complex columns to `Utf8` columns of compact JSON text, matching the Thrift wire format byte-for-byte. This change wires the connector's existing kwarg through to that flag: - `session.py`: pass `_use_arrow_native_complex_types` to the kernel client (it was being dropped on the floor for the kernel branch). - `backend/kernel/client.py`: read it from kwargs (default `True`, matching the connector-wide default), invert at the boundary, and set `complex_types_as_json=not _use_arrow_native_complex_types` on the kernel `Session()` constructor. - `backend/kernel/type_mapping.py`: extend `_databricks_type_for_field` to honour `databricks.type_name` for `ARRAY` / `MAP` / `STRUCT` (it already did this for `VARIANT`). When the kernel JSON path is on, the columns arrive as `Utf8` but the kernel preserves the original SQL type name in metadata; `description` should report `array` / `map` / `struct`, matching what the Thrift backend reports under `complexTypesAsArrow=False`. Verified end-to-end against the pecotesting comparator workspace: the `THRIFT_VS_KERNEL_COMPLEX_DISABLED` suite drops from 5 type-shape diffs + 1 row diff to 1 row diff. The remaining row diff is a Thrift server-side bug — Thrift emits invalid JSON for map values containing embedded `"` characters (`{"k":"val with "quote""}` — unescaped inner quote), while the kernel emits the correctly-escaped form (`{"k":"val with \"quote\""}`). The kernel is right here; matching Thrift would mean deliberately producing un-parseable output. Unit tests: - Parametrised test of `_use_arrow_native_complex_types` (default / True / False) → kernel `Session(complex_types_as_json=…)`. - Parametrised test of `description_from_arrow_schema` recovering `array` / `map` / `struct` from metadata, case-insensitively. - Negative test that an unknown `databricks.type_name` defers to the Arrow type rather than corrupting the description. 85 → 94 kernel unit tests; full suite green; black-formatted. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
1 parent c3cae63 commit 0c10d7b

5 files changed

Lines changed: 128 additions & 5 deletions

File tree

src/databricks/sql/backend/kernel/client.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,17 @@ def __init__(
9898
self._auth_provider = auth_provider
9999
self._catalog = catalog
100100
self._schema = schema
101+
# ``_use_arrow_native_complex_types`` is the connector-side
102+
# toggle for whether complex columns (ARRAY / MAP / STRUCT)
103+
# are surfaced as native Arrow shapes or as compact JSON
104+
# strings. The Thrift backend forwards it server-side
105+
# (``complexTypesAsArrow``); the kernel doesn't have a wire
106+
# equivalent, so we flip the kernel's client-side
107+
# ``complex_types_as_json`` post-processor to match. Default
108+
# ``True`` mirrors the connector's existing default.
109+
self._use_arrow_native_complex_types = kwargs.get(
110+
"_use_arrow_native_complex_types", True
111+
)
101112
# NB: don't call ``kernel_auth_kwargs`` here. That call
102113
# materialises the bearer token in-process; keeping a
103114
# cleartext copy on a long-lived connector object that may
@@ -155,6 +166,7 @@ def open_session(
155166
catalog=catalog or self._catalog,
156167
schema=schema or self._schema,
157168
session_conf=session_conf,
169+
complex_types_as_json=not self._use_arrow_native_complex_types,
158170
**auth_kwargs,
159171
)
160172
except Exception as exc:

src/databricks/sql/backend/kernel/type_mapping.py

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -123,15 +123,35 @@ def _databricks_type_for_field(field: pyarrow.Field) -> str:
123123
Consults the field's Arrow metadata under
124124
``databricks.type_name`` (written by the kernel from the SEA
125125
response's column type) so types that collapse onto a generic
126-
Arrow shape can still be distinguished. Today only ``VARIANT``
127-
is mapped; everything else delegates to
128-
``_arrow_type_to_dbapi_string``.
126+
Arrow shape can still be distinguished. This matters in two
127+
cases:
128+
129+
- ``VARIANT`` (always ``Utf8`` on the wire — no Arrow shape
130+
distinguishes it from ``STRING``).
131+
- The ``complex_types_as_json`` post-processor rewrites
132+
``ARRAY`` / ``MAP`` / ``STRUCT`` columns to ``Utf8`` carrying
133+
compact JSON text. The Thrift backend reports the original
134+
SQL type in ``description`` even when ``complexTypesAsArrow``
135+
is off and the wire payload is a JSON string; we match that
136+
by recovering the type name from manifest metadata.
129137
"""
130138
md = field.metadata or {}
131139
# `databricks.type_name` is bytes (Arrow metadata is always
132140
# bytes); compare against bytes to avoid one encode per field.
133-
if md.get(b"databricks.type_name") == b"VARIANT":
134-
return "variant"
141+
type_name = md.get(b"databricks.type_name")
142+
if type_name is not None:
143+
# Lowercase to match the canonical SqlType slugs the Thrift
144+
# backend produces (``"array"`` / ``"map"`` / ``"struct"`` /
145+
# ``"variant"``). Other server-reported names (``"INT"`` etc.)
146+
# would also pass through this branch but we deliberately
147+
# don't honour them — the Arrow shape is the authoritative
148+
# source for primitives, and the kernel's own type-name
149+
# mapping (`map_databricks_type`) is conservative on some
150+
# types (e.g. ``DECIMAL`` arrives as ``decimal`` on the
151+
# Arrow side, which matches Thrift).
152+
decoded = type_name.decode("ascii", errors="replace").lower()
153+
if decoded in {"variant", "array", "map", "struct"}:
154+
return decoded
135155
return _arrow_type_to_dbapi_string(field.type)
136156

137157

src/databricks/sql/session.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,7 @@ def _create_backend(
146146
http_client=self.http_client,
147147
catalog=kwargs.get("catalog"),
148148
schema=kwargs.get("schema"),
149+
_use_arrow_native_complex_types=_use_arrow_native_complex_types,
149150
)
150151

151152
databricks_client_class: Type[DatabricksClient]

tests/unit/test_kernel_client.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,44 @@ def test_open_session_rejects_double_open(monkeypatch):
241241
c.open_session(session_configuration=None, catalog=None, schema=None)
242242

243243

244+
@pytest.mark.parametrize(
245+
"kwargs, expected_flag",
246+
[
247+
({}, False), # default → arrow-native → kernel JSON off
248+
({"_use_arrow_native_complex_types": True}, False),
249+
({"_use_arrow_native_complex_types": False}, True),
250+
],
251+
)
252+
def test_open_session_passes_complex_types_as_json_to_kernel(
253+
monkeypatch, kwargs, expected_flag
254+
):
255+
"""``_use_arrow_native_complex_types=False`` flips the kernel's
256+
``complex_types_as_json`` post-processor on; the default and
257+
explicit ``True`` both leave it off. The flag is inverted at the
258+
boundary because the connector's option is "native Arrow"-shaped
259+
and the kernel's is "rewrite to JSON strings"-shaped."""
260+
captured = {}
261+
262+
def fake_session(**kw):
263+
captured.update(kw)
264+
sess = MagicMock()
265+
sess.session_id = "sess-id"
266+
return sess
267+
268+
monkeypatch.setattr(kernel_client._kernel, "Session", fake_session)
269+
270+
c = kernel_client.KernelDatabricksClient(
271+
server_hostname="example.cloud.databricks.com",
272+
http_path="/sql/1.0/warehouses/abc",
273+
auth_provider=AccessTokenAuthProvider("dapi-test"),
274+
ssl_options=None,
275+
**kwargs,
276+
)
277+
c.open_session(session_configuration=None, catalog=None, schema=None)
278+
279+
assert captured.get("complex_types_as_json") is expected_flag
280+
281+
244282
def test_execute_command_forwards_parameters_to_bind_param():
245283
"""``execute_command(parameters=[...])`` routes each parameter
246284
through ``bind_tspark_params`` onto the kernel statement before

tests/unit/test_kernel_type_mapping.py

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,58 @@ def test_description_uses_databricks_type_name_for_variant():
114114
assert desc[1][1] == "string"
115115

116116

117+
@pytest.mark.parametrize(
118+
"metadata_value, expected",
119+
[
120+
(b"ARRAY", "array"),
121+
(b"MAP", "map"),
122+
(b"STRUCT", "struct"),
123+
# Lowercase / mixed case both fine — server may report either.
124+
(b"array", "array"),
125+
(b"Struct", "struct"),
126+
],
127+
)
128+
def test_description_recovers_complex_type_name_from_metadata(metadata_value, expected):
129+
"""When ``complex_types_as_json`` rewrites a complex column to
130+
``Utf8``, the kernel preserves the original SQL type name under
131+
``databricks.type_name``. ``description`` must report that name
132+
(matching the Thrift backend's behaviour with
133+
``complexTypesAsArrow=False``), not the post-processed ``string``.
134+
"""
135+
schema = pa.schema(
136+
[
137+
pa.field(
138+
"c",
139+
pa.string(),
140+
metadata={b"databricks.type_name": metadata_value},
141+
),
142+
]
143+
)
144+
desc = description_from_arrow_schema(schema)
145+
assert desc[0][1] == expected
146+
147+
148+
def test_description_passes_through_unknown_databricks_type_name():
149+
"""Server-reported names other than the handful we explicitly
150+
recognise (VARIANT / ARRAY / MAP / STRUCT) defer to the Arrow
151+
shape — the Arrow type is the authoritative source for primitives
152+
and the kernel's own type mapping is conservative there. Confirms
153+
we don't accidentally claim ``int`` from metadata when the Arrow
154+
column is something concrete like ``int64``."""
155+
schema = pa.schema(
156+
[
157+
pa.field(
158+
"n",
159+
pa.int64(),
160+
metadata={b"databricks.type_name": b"INT"},
161+
),
162+
]
163+
)
164+
desc = description_from_arrow_schema(schema)
165+
# `int64` Arrow → "bigint" via the existing arrow-type mapper.
166+
assert desc[0][1] == "bigint"
167+
168+
117169
# ─── bind_tspark_params ──────────────────────────────────────────────────
118170

119171

0 commit comments

Comments
 (0)