Skip to content

feat: support LOAD DATA INFILE with IGNORE to skip error rows#24031

Closed
ck89119 wants to merge 11 commits intomatrixorigin:mainfrom
ck89119:feature/load-data-ignore-error-rows
Closed

feat: support LOAD DATA INFILE with IGNORE to skip error rows#24031
ck89119 wants to merge 11 commits intomatrixorigin:mainfrom
ck89119:feature/load-data-ignore-error-rows

Conversation

@ck89119
Copy link
Copy Markdown
Contributor

@ck89119 ck89119 commented Mar 31, 2026

What type of PR is this?

  • Feature

Which issue(s) does this PR fix or relate to?

Fixes #23937

What does this PR do?

Add support for using the existing IGNORE keyword in LOAD DATA INFILE to skip bad CSV records:

LOAD DATA INFILE '/path/to/file.csv' IGNORE INTO TABLE t1 FIELDS TERMINATED BY ',';

Current scope of IGNORE in this PR: only CSV parse errors from csvReader.Read() are skipped with warnings.
JSON parse errors and type conversion errors are still fail-fast.

Changes

  • Add IgnoreError flag to tree.ExParam and external.ExParamConst
  • Set IgnoreError in both bindExternalScan and buildLoad when DuplicateHandling is DuplicateKeyIgnore
  • Add ignore_error field to pipeline.proto ExternalScan message
  • Propagate IgnoreError through operator construction and remoterun serialization/deserialization
  • In CsvReader.makeBatchRows, apply row-skip only to CSV parse errors from csvReader.Read()
  • On skipped CSV parse errors, recover parser state by discarding the broken row tail to avoid treating row fragments as new records
  • Keep JSON parsing (transJson2Lines) and type conversion (getOneRowData) errors as hard failures
  • Add warning_count in remoterun end message and merge remote warnings on client side
  • Fix prepare-notify forwarding path (receiveMsgAndForward) to merge remote warning counts on end message
  • Update tests to match the current scope (CSV parse errors skipped; type conversion errors still fail)

…origin#23937)

Add support for ignoring error rows during LOAD DATA INFILE by using
the existing IGNORE keyword:

  LOAD DATA INFILE '...' IGNORE INTO TABLE t1 ...

When IGNORE is specified, rows that fail CSV parsing, JSON parsing,
or type conversion are skipped with a warning log instead of aborting
the entire load operation. Partially appended column data is rolled
back to maintain batch consistency.

Changes:
- Add IgnoreError flag to tree.ExParam and external.ExParamConst
- Set IgnoreError in both bindExternalScan and buildLoad when
  DuplicateHandling is DuplicateKeyIgnore
- Add ignore_error field to pipeline.proto ExternalScan message
- Propagate IgnoreError through operator construction and remoterun
  serialization/deserialization
- In CsvReader.makeBatchRows, skip error rows and rollback partial
  vector appends when IgnoreError is true
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an “ignore error rows” behavior to LOAD DATA INFILE ... IGNORE, plumbing a new IgnoreError flag from the SQL AST through plan/compile/pipeline serialization to the external CSV/JSONLINE reader so malformed/convert-failing rows can be skipped instead of aborting the load.

Changes:

  • Introduces and propagates IgnoreError across parser AST (tree.ExParam), plan binding, operator construction, and pipeline proto / (de)serialization.
  • Implements row-skipping behavior in CsvReader.makeBatchRows for CSV read errors, JSONLINE parse errors, and type conversion errors (with partial-append rollback).
  • Adds distributed BVT coverage and a sample CSV containing invalid rows; surfaces ignored-row counts as MySQL warnings via a new process.BaseProcess.Warnings counter.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/distributed/resources/load_data/ignore_error.csv Test data with invalid rows to validate skipping behavior.
test/distributed/cases/load_data/load_data_ignore_error.sql Adds a BVT case covering LOAD DATA ... IGNORE behavior.
test/distributed/cases/load_data/load_data_ignore_error.result Expected output for the new distributed test.
proto/pipeline.proto Adds ignore_error to ExternalScan for remoterun transport.
pkg/pb/pipeline/pipeline.pb.go Regenerated protobuf bindings for the new field.
pkg/sql/parsers/tree/update.go Adds IgnoreError to tree.ExParam (LOAD DATA parameters).
pkg/sql/plan/bind_load.go Sets IgnoreError when DuplicateHandling is DuplicateKeyIgnore (i.e., IGNORE).
pkg/sql/compile/operator.go Propagates IgnoreError into constructed ExternalScan operators.
pkg/sql/compile/remoterun.go Serializes/deserializes IgnoreError across remote pipeline execution.
pkg/sql/colexec/external/types.go Adds IgnoreError to external.ExParamConst.
pkg/sql/colexec/external/reader_csv.go Skips erroneous rows when IgnoreError is true and rolls back partial vector appends.
pkg/vm/process/types.go Adds a per-statement atomic warnings counter on BaseProcess.
pkg/frontend/status_stmt.go Maps proc.Base.Warnings to MySQL OK-packet warning count for LOAD statements.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/sql/colexec/external/reader_csv.go
Comment thread pkg/sql/colexec/external/reader_csv.go Outdated
Comment thread pkg/frontend/status_stmt.go
Comment thread pkg/sql/colexec/external/reader_csv.go
… fallback

- Clamp warnings count to math.MaxUint16 to prevent uint16 wrap-around
  when ignored rows exceed 65535.
- Restore IgnoreError setting in buildLoad() fallback path so IGNORE
  semantics are not silently lost when bindLoad returns ErrUnsupportedDML.
Copy link
Copy Markdown
Contributor

@XuPeng-SH XuPeng-SH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Overall a well-structured feature addition with clean propagation through the pipeline. Found 1 critical bug and 2 minor issues.

🔴 Critical: SetLength() does NOT clear null bitmap — data corruption risk

In reader_csv.go (lines 197-203), the rollback mechanism uses vec.SetLength(rowIdx) to undo partial row appends. However, SetLength() only updates v.length — it does NOT clear stale null bits in the null bitmap (v.nsp).

Corruption scenario:

  1. Row 5 starts appending. Column A appends a NULL value → null bit at index 5 is SET
  2. Column B conversion fails → getOneRowData returns error
  3. Rollback calls vec.SetLength(5) on all vectors — but null bit 5 remains set in column A's bitmap
  4. Next row reuses index 5. Column A appends a non-null value "hello" — but the stale null bit makes it appear NULL

Fix: Clear null bits after truncation:

for j := 0; j < bat.VectorCount(); j++ {
    vec := bat.GetVector(int32(j))
    if vec.Length() > rowIdx {
        nulls.RemoveRange(&vec.GetNulls(), uint64(rowIdx), uint64(vec.Length()))
        vec.SetLength(rowIdx)
    }
}

🟡 Minor: curBatchSize accumulates even for skipped rows

At line 170-171, curBatchSize += uint64(len(row[j].Val)) is accumulated before the error check at line 193. Skipped rows inflate curBatchSize, potentially causing early batch breaks via curBatchSize >= param.maxBatchSize. This reduces batch efficiency but is not a correctness bug.

🟡 Minor: Warning count could mislead for CSV read errors

For CSV read errors (line 138-142), i-- is used but the CSV reader position has already advanced past the bad row. The i-- correctly keeps the output row index aligned, but the warning log doesn't include row number information, making debugging harder. Consider logging the approximate row number.

✅ Good aspects:

  • Clean flag propagation: ExParam.IgnoreErrorExParamConstpipeline.proto → runtime
  • Proto field number 11 is correct (follows field 10)
  • atomic.Uint32 for warnings is appropriate and overflow-safe
  • SetStmtProfile correctly resets warnings to 0 for each statement
  • math.MaxUint16 cap in status_stmt.go matches MySQL protocol
  • BVT test covers both error (without IGNORE) and success (with IGNORE) paths
  • DuplicateKeyIgnore type assertion is set in both bind_load.go and build_load.go (covers both code paths)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: LOAD DATA INFILE with CSV file supports ignoring error rows

8 participants