Skip to content

Character encoding changes upon seeding #332

@jdbodyfelt

Description

@jdbodyfelt

Describe the bug

A CSV file that has UTF-8 encoding is seeded with dbt seed. Upon review of the load, the column encoding has appeared to change.

Steps To Reproduce

Create a CSV with non-standard non-Roman UTF-8 characters (Arabic, Greek, etc.) and try seeding it.

Expected behavior

I expect a CSV seeds exactly what is inside of it, ESPECIALLY strings.

Screenshots and log output

CSV:
image
Injection Result:
image

System information

The output of dbt --version:

Core:
  - installed: 1.4.6
Plugins:
  - databricks: 1.4.3 

The operating system you're using:
Ubuntu 22.04.1 LTS

The output of python --version:
Python 3.10.6

Additional context

It would be great to have a seeds configuration option for column encoding, e.g.

seeds:
   - name: <tableName>
      config:
         columns:
             - name: <columeName>
                dtype: <columnDatatype>
                encoding: <columnEncoding if STRING or VARCHAR>

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions