Skip to content

test: UTF-8 round-trip regression for dbt seed#1416

Merged
sd-db merged 2 commits intomainfrom
sd-db/test/seed-utf8-roundtrip
Apr 27, 2026
Merged

test: UTF-8 round-trip regression for dbt seed#1416
sd-db merged 2 commits intomainfrom
sd-db/test/seed-utf8-roundtrip

Conversation

@sd-db
Copy link
Copy Markdown
Collaborator

@sd-db sd-db commented Apr 21, 2026

Summary

Adds a functional test that seeds a CSV containing Arabic, Greek, Chinese, emoji, and accented Latin, then asserts every cell round-trips byte-exact through the seed path. Covers both V1 and V2 materialization.

This guards the fix for #332 (non-ASCII characters being mangled on seed, reported against 1.4.3). The current parameterized-insert path (adapter.add_query(sql, bindings=bindings, ...) in dbt/include/databricks/macros/materializations/seeds/helpers.sql) handles UTF-8 correctly — this test locks that in.

Test plan

  • hatch run pytest tests/functional/adapter/simple_seed/test_utf8_seed.py -v — 2 passed in ~36s against a live Databricks cluster (V1 + V2)

Closes #332.

This pull request and its description were written by Isaac.

Seeds a CSV containing Arabic, Greek, Chinese, emoji, and accented Latin
then asserts each cell round-trips byte-exact through the seed path.
Covers both V1 and V2 materialization.

Guards the fix for #332, which reported non-ASCII characters being
mangled on seed in 1.4.x. The current parameterized-insert path
(adapter.add_query + bindings) handles UTF-8 correctly; this test
locks that in.

Co-authored-by: Isaac
@github-actions
Copy link
Copy Markdown

Coverage report

This PR does not seem to contain any modification to coverable code.

Copy link
Copy Markdown
Collaborator

@tejassp-db tejassp-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use the exact words mentioned in the issue?#332

Words from this image - https://user-images.githubusercontent.com/11973991/235822152-baedfa2c-81d6-4d22-9789-afbed5184e97.png

@sd-db
Copy link
Copy Markdown
Collaborator Author

sd-db commented Apr 24, 2026

Is it possible to use the exact words mentioned in the issue?#332

Words from this image - https://user-images.githubusercontent.com/11973991/235822152-baedfa2c-81d6-4d22-9789-afbed5184e97.png

We need to do overall UTF support testing, in that regards I would say the current test is good enough and has enough coverage. We do not need to use the exact same words described in the issue

@sd-db sd-db merged commit ff4f80b into main Apr 27, 2026
12 of 13 checks passed
@sd-db sd-db deleted the sd-db/test/seed-utf8-roundtrip branch April 27, 2026 05:10
@jdbodyfelt
Copy link
Copy Markdown

Thanks for resolving!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Character encoding changes upon seeding

3 participants