Skip to content

fix(teradata): validate destination table schema in precheck and upload#686

Closed
tabossert wants to merge 5 commits intomainfrom
fix/teradata-schema-validation
Closed

fix(teradata): validate destination table schema in precheck and upload#686
tabossert wants to merge 5 commits intomainfrom
fix/teradata-schema-validation

Conversation

@tabossert
Copy link
Copy Markdown
Contributor

@tabossert tabossert commented Apr 9, 2026

Summary

  • Adds schema validation to TeradataUploader.precheck() — queries DBC.ColumnsV to verify an existing table has all required columns (id, record_id, element_id, text, type) before the workflow starts
  • Adds the same validation at the top of upload_dataframe() as a runtime guard, so even if precheck is bypassed or the table is altered mid-run, the error is caught and surfaced cleanly
  • Both raise DestinationConnectionError (a recognized error type) with a clear message naming the missing columns and suggesting remediation
  • Previously, an incomplete table schema (e.g., only id and text) would fail deep in the SQL INSERT and surface as the generic "Unexpected job failure after 3 retries" in the UI

Test plan

  • Added test_teradata_uploader_precheck_rejects_missing_columns — precheck raises on incomplete schema
  • Added test_teradata_uploader_precheck_schema_check_skipped_when_table_name_none — auto-create mode skips validation
  • Added test_teradata_uploader_precheck_schema_check_tolerates_extra_columns — extra columns are fine
  • Added test_teradata_uploader_precheck_schema_check_case_insensitive — handles Teradata uppercase
  • Added test_teradata_uploader_precheck_skips_schema_check_when_table_missing — nonexistent table defers to create_destination
  • Added test_teradata_uploader_upload_dataframe_rejects_missing_columns — runtime upload guard
  • All 66 existing tests pass

Note

Medium Risk
Adds new fail-fast schema validation for Teradata uploads, which can newly block workflows that previously proceeded (and failed later) when destination tables are missing required columns. Low code-surface change, but it affects runtime behavior and user-facing errors for existing Teradata destinations.

Overview
Teradata destination uploads now validate the destination table schema before a workflow runs and again right before inserting rows.

TeradataUploader.precheck() queries DBC.ColumnsV for an existing table and raises a new DestinationSchemaError if required columns (id, record_id, element_id, text, type) are missing; it skips validation when the table is missing (defer to create_destination) or when table_name is None (auto-create mode). upload_dataframe() adds the same guard using discovered table columns to fail with a clear, actionable message instead of a late SQL failure.

Also bumps version to 1.4.21 and expands unit tests to cover missing/extra columns, case-insensitivity, and skip conditions.

Reviewed by Cursor Bugbot for commit a4dfe43. Bugbot is set up for automated code reviews on this repo. Configure here.

When an existing table is missing required columns (e.g., only has id
and text), the upload silently fails with a raw Teradata SQL error that
surfaces as "Unexpected job failure after 3 retries" in the UI.

Add schema validation in both precheck() and upload_dataframe() so
users see a clear error message naming the missing columns and
suggesting remediation.
@tabossert tabossert requested a review from a team as a code owner April 9, 2026 04:01
… import

- Bump version to 1.4.21 (1.4.20 already taken on main)
- Add CHANGELOG.md entry for schema validation feature
- Remove unused REQUIRED_DESTINATION_COLUMNS import in test file (ruff F401)
Comment thread unstructured_ingest/processes/connectors/sql/teradata.py Outdated
Comment thread unstructured_ingest/processes/connectors/sql/teradata.py Outdated
…DatabaseName

Teradata doesn't support LIMIT syntax — the precheck schema validation
was silently failing due to a broad except clause. Also adds DatabaseName
filter to DBC.ColumnsV queries to avoid false matches in multi-database
environments.

Addresses Cursor Bugbot findings (high: invalid LIMIT syntax,
medium: missing DatabaseName scope).
Comment thread unstructured_ingest/processes/connectors/sql/teradata.py
…validation

Eliminate unnecessary database round trip by using a single SELECT ColumnName
query instead of a separate existence check followed by a column fetch.
An empty result already indicates the table doesn't exist.
@tabossert
Copy link
Copy Markdown
Contributor Author

failed tests unrelated to changes

Comment thread unstructured_ingest/processes/connectors/sql/teradata.py
@potter-potter
Copy link
Copy Markdown
Contributor

I don't think we usually go this deep on precheck. What was the motivation behind this.

@potter-potter
Copy link
Copy Markdown
Contributor

Users could set a custom record id key. I am not real psyched on this thing in general.


missing = REQUIRED_DESTINATION_COLUMNS - existing
if missing:
raise DestinationConnectionError(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend swapping out the name of this error to avoid the word "Connection" which could lead to a network-related hypothesis of the issue.

Perhaps "DestinationConfigurationMismatchError" or something like that

with self.get_cursor() as cursor:
cursor.execute("SELECT DATABASE")
current_db = cursor.fetchone()[0].strip()
cursor.execute(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to also check for columns in the schema which are NOT NULL if we don't have a way to fill them? It's not a perfectly easy check because there could be NOT NULL with a default specified also, in which case we don't want to fail (this pattern is common with last_updated columns, for example).

@ctrahey
Copy link
Copy Markdown
Contributor

ctrahey commented Apr 12, 2026

I don't think we usually go this deep on precheck. What was the motivation behind this.

We want to move toward richer errors, especially if they are able to deterministically predict failure of the job. In fact, I'd say that's the entire purpose of the precheck. Also note that precheck-on-save is just one case. We also want to (soon? Eventually?) support a preflight of jobs when they startup in case some condition has changed (think column added or removed).

Schema validation errors (missing required columns) were raised as
DestinationConnectionError which surfaces as "Error in connecting to
downstream data source" in the UI — misleading for a schema mismatch.
Added DestinationSchemaError as a new recognized error type with a
clear "Destination table schema error" message.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a4dfe43. Configure here.

# Columns that must exist in the destination table for uploads to succeed.
# record_id is needed for delete-before-upsert; element_id, text, and type
# carry the core document content.
REQUIRED_DESTINATION_COLUMNS = {"id", "record_id", "element_id", "text", "type"}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded record_id ignores configurable record_id_key setting

Medium Severity

REQUIRED_DESTINATION_COLUMNS hardcodes "record_id", but SQLUploaderConfig.record_id_key is a configurable field that users can set to a different column name. Both _validate_table_schema and upload_dataframe compare against this hardcoded set, so a table with a valid custom record-ID column (e.g. "my_record_id") will be falsely rejected as missing "record_id". The required set needs to dynamically include self.upload_config.record_id_key instead.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a4dfe43. Configure here.

@tabossert tabossert closed this Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants