fix(teradata): validate destination table schema in precheck and upload by tabossert · Pull Request #686 · Unstructured-IO/unstructured-ingest

tabossert · 2026-04-09T04:01:36Z

Summary

Adds schema validation to TeradataUploader.precheck() — queries DBC.ColumnsV to verify an existing table has all required columns (id, record_id, element_id, text, type) before the workflow starts
Adds the same validation at the top of upload_dataframe() as a runtime guard, so even if precheck is bypassed or the table is altered mid-run, the error is caught and surfaced cleanly
Both raise DestinationConnectionError (a recognized error type) with a clear message naming the missing columns and suggesting remediation
Previously, an incomplete table schema (e.g., only id and text) would fail deep in the SQL INSERT and surface as the generic "Unexpected job failure after 3 retries" in the UI

Test plan

Added test_teradata_uploader_precheck_rejects_missing_columns — precheck raises on incomplete schema
Added test_teradata_uploader_precheck_schema_check_skipped_when_table_name_none — auto-create mode skips validation
Added test_teradata_uploader_precheck_schema_check_tolerates_extra_columns — extra columns are fine
Added test_teradata_uploader_precheck_schema_check_case_insensitive — handles Teradata uppercase
Added test_teradata_uploader_precheck_skips_schema_check_when_table_missing — nonexistent table defers to create_destination
Added test_teradata_uploader_upload_dataframe_rejects_missing_columns — runtime upload guard
All 66 existing tests pass

Note

Medium Risk
Adds new fail-fast schema validation for Teradata uploads, which can newly block workflows that previously proceeded (and failed later) when destination tables are missing required columns. Low code-surface change, but it affects runtime behavior and user-facing errors for existing Teradata destinations.

Overview
Teradata destination uploads now validate the destination table schema before a workflow runs and again right before inserting rows.

TeradataUploader.precheck() queries DBC.ColumnsV for an existing table and raises a new DestinationSchemaError if required columns (id, record_id, element_id, text, type) are missing; it skips validation when the table is missing (defer to create_destination) or when table_name is None (auto-create mode). upload_dataframe() adds the same guard using discovered table columns to fail with a clear, actionable message instead of a late SQL failure.

Also bumps version to 1.4.21 and expands unit tests to cover missing/extra columns, case-insensitivity, and skip conditions.

^{Reviewed by Cursor Bugbot for commit a4dfe43. Bugbot is set up for automated code reviews on this repo. Configure here.}

When an existing table is missing required columns (e.g., only has id and text), the upload silently fails with a raw Teradata SQL error that surfaces as "Unexpected job failure after 3 retries" in the UI. Add schema validation in both precheck() and upload_dataframe() so users see a clear error message naming the missing columns and suggesting remediation.

… import - Bump version to 1.4.21 (1.4.20 already taken on main) - Add CHANGELOG.md entry for schema validation feature - Remove unused REQUIRED_DESTINATION_COLUMNS import in test file (ruff F401)

…DatabaseName Teradata doesn't support LIMIT syntax — the precheck schema validation was silently failing due to a broad except clause. Also adds DatabaseName filter to DBC.ColumnsV queries to avoid false matches in multi-database environments. Addresses Cursor Bugbot findings (high: invalid LIMIT syntax, medium: missing DatabaseName scope).

…validation Eliminate unnecessary database round trip by using a single SELECT ColumnName query instead of a separate existence check followed by a column fetch. An empty result already indicates the table doesn't exist.

tabossert · 2026-04-09T04:44:59Z

failed tests unrelated to changes

potter-potter · 2026-04-09T12:40:26Z

I don't think we usually go this deep on precheck. What was the motivation behind this.

potter-potter · 2026-04-09T12:44:06Z

Users could set a custom record id key. I am not real psyched on this thing in general.

ctrahey · 2026-04-12T22:41:17Z

+
+        missing = REQUIRED_DESTINATION_COLUMNS - existing
+        if missing:
+            raise DestinationConnectionError(


Recommend swapping out the name of this error to avoid the word "Connection" which could lead to a network-related hypothesis of the issue.

Perhaps "DestinationConfigurationMismatchError" or something like that

ctrahey · 2026-04-12T22:42:56Z

+            with self.get_cursor() as cursor:
+                cursor.execute("SELECT DATABASE")
+                current_db = cursor.fetchone()[0].strip()
+                cursor.execute(


Do we want to also check for columns in the schema which are NOT NULL if we don't have a way to fill them? It's not a perfectly easy check because there could be NOT NULL with a default specified also, in which case we don't want to fail (this pattern is common with last_updated columns, for example).

ctrahey · 2026-04-12T22:44:45Z

I don't think we usually go this deep on precheck. What was the motivation behind this.

We want to move toward richer errors, especially if they are able to deterministically predict failure of the job. In fact, I'd say that's the entire purpose of the precheck. Also note that precheck-on-save is just one case. We also want to (soon? Eventually?) support a preflight of jobs when they startup in case some condition has changed (think column added or removed).

Schema validation errors (missing required columns) were raised as DestinationConnectionError which surfaces as "Error in connecting to downstream data source" in the UI — misleading for a schema mismatch. Added DestinationSchemaError as a new recognized error type with a clear "Destination table schema error" message.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit a4dfe43. Configure here.}

cursor · 2026-04-12T22:54:55Z

+# Columns that must exist in the destination table for uploads to succeed.
+# record_id is needed for delete-before-upsert; element_id, text, and type
+# carry the core document content.
+REQUIRED_DESTINATION_COLUMNS = {"id", "record_id", "element_id", "text", "type"}


Hardcoded record_id ignores configurable record_id_key setting

Medium Severity

REQUIRED_DESTINATION_COLUMNS hardcodes "record_id", but SQLUploaderConfig.record_id_key is a configurable field that users can set to a different column name. Both _validate_table_schema and upload_dataframe compare against this hardcoded set, so a table with a valid custom record-ID column (e.g. "my_record_id") will be falsely rejected as missing "record_id". The required set needs to dynamically include self.upload_config.record_id_key instead.

Additional Locations (2)

unstructured_ingest/processes/connectors/sql/teradata.py#L460-L461

unstructured_ingest/processes/connectors/sql/teradata.py#L412-L413

^{Reviewed by Cursor Bugbot for commit a4dfe43. Configure here.}

tabossert requested a review from a team as a code owner April 9, 2026 04:01

tabossert temporarily deployed to ci April 9, 2026 04:01 — with GitHub Actions Inactive

fix: resolve CI failures — bump version, add changelog, remove unused…

dbdda69

… import - Bump version to 1.4.21 (1.4.20 already taken on main) - Add CHANGELOG.md entry for schema validation feature - Remove unused REQUIRED_DESTINATION_COLUMNS import in test file (ruff F401)

tabossert temporarily deployed to ci April 9, 2026 04:05 — with GitHub Actions Inactive

cursor Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread unstructured_ingest/processes/connectors/sql/teradata.py Outdated

Comment thread unstructured_ingest/processes/connectors/sql/teradata.py Outdated

tabossert temporarily deployed to ci April 9, 2026 04:20 — with GitHub Actions Inactive

cursor Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread unstructured_ingest/processes/connectors/sql/teradata.py

tabossert temporarily deployed to ci April 9, 2026 04:37 — with GitHub Actions Inactive

tabossert temporarily deployed to ci April 9, 2026 04:38 — with GitHub Actions Inactive

cursor Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread unstructured_ingest/processes/connectors/sql/teradata.py

ctrahey reviewed Apr 12, 2026

View reviewed changes

cursor Bot reviewed Apr 12, 2026

View reviewed changes

tabossert closed this Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(teradata): validate destination table schema in precheck and upload#686

fix(teradata): validate destination table schema in precheck and upload#686
tabossert wants to merge 5 commits intomainfrom
fix/teradata-schema-validation

tabossert commented Apr 9, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tabossert commented Apr 9, 2026

Uh oh!

Uh oh!

potter-potter commented Apr 9, 2026

Uh oh!

potter-potter commented Apr 9, 2026

Uh oh!

ctrahey Apr 12, 2026

Uh oh!

ctrahey Apr 12, 2026

Uh oh!

ctrahey commented Apr 12, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tabossert commented Apr 9, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tabossert commented Apr 9, 2026

Uh oh!

Uh oh!

potter-potter commented Apr 9, 2026

Uh oh!

potter-potter commented Apr 9, 2026

Uh oh!

ctrahey Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

ctrahey Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

ctrahey commented Apr 12, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 12, 2026

Choose a reason for hiding this comment

Hardcoded record_id ignores configurable record_id_key setting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tabossert commented Apr 9, 2026 •

edited by cursor Bot

Loading

Hardcoded `record_id` ignores configurable `record_id_key` setting