feat: support LOAD DATA INFILE with IGNORE to skip error rows#24031
feat: support LOAD DATA INFILE with IGNORE to skip error rows#24031ck89119 wants to merge 11 commits intomatrixorigin:mainfrom
Conversation
…origin#23937) Add support for ignoring error rows during LOAD DATA INFILE by using the existing IGNORE keyword: LOAD DATA INFILE '...' IGNORE INTO TABLE t1 ... When IGNORE is specified, rows that fail CSV parsing, JSON parsing, or type conversion are skipped with a warning log instead of aborting the entire load operation. Partially appended column data is rolled back to maintain batch consistency. Changes: - Add IgnoreError flag to tree.ExParam and external.ExParamConst - Set IgnoreError in both bindExternalScan and buildLoad when DuplicateHandling is DuplicateKeyIgnore - Add ignore_error field to pipeline.proto ExternalScan message - Propagate IgnoreError through operator construction and remoterun serialization/deserialization - In CsvReader.makeBatchRows, skip error rows and rollback partial vector appends when IgnoreError is true
There was a problem hiding this comment.
Pull request overview
This PR adds an “ignore error rows” behavior to LOAD DATA INFILE ... IGNORE, plumbing a new IgnoreError flag from the SQL AST through plan/compile/pipeline serialization to the external CSV/JSONLINE reader so malformed/convert-failing rows can be skipped instead of aborting the load.
Changes:
- Introduces and propagates
IgnoreErroracross parser AST (tree.ExParam), plan binding, operator construction, and pipeline proto / (de)serialization. - Implements row-skipping behavior in
CsvReader.makeBatchRowsfor CSV read errors, JSONLINE parse errors, and type conversion errors (with partial-append rollback). - Adds distributed BVT coverage and a sample CSV containing invalid rows; surfaces ignored-row counts as MySQL warnings via a new
process.BaseProcess.Warningscounter.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/distributed/resources/load_data/ignore_error.csv | Test data with invalid rows to validate skipping behavior. |
| test/distributed/cases/load_data/load_data_ignore_error.sql | Adds a BVT case covering LOAD DATA ... IGNORE behavior. |
| test/distributed/cases/load_data/load_data_ignore_error.result | Expected output for the new distributed test. |
| proto/pipeline.proto | Adds ignore_error to ExternalScan for remoterun transport. |
| pkg/pb/pipeline/pipeline.pb.go | Regenerated protobuf bindings for the new field. |
| pkg/sql/parsers/tree/update.go | Adds IgnoreError to tree.ExParam (LOAD DATA parameters). |
| pkg/sql/plan/bind_load.go | Sets IgnoreError when DuplicateHandling is DuplicateKeyIgnore (i.e., IGNORE). |
| pkg/sql/compile/operator.go | Propagates IgnoreError into constructed ExternalScan operators. |
| pkg/sql/compile/remoterun.go | Serializes/deserializes IgnoreError across remote pipeline execution. |
| pkg/sql/colexec/external/types.go | Adds IgnoreError to external.ExParamConst. |
| pkg/sql/colexec/external/reader_csv.go | Skips erroneous rows when IgnoreError is true and rolls back partial vector appends. |
| pkg/vm/process/types.go | Adds a per-statement atomic warnings counter on BaseProcess. |
| pkg/frontend/status_stmt.go | Maps proc.Base.Warnings to MySQL OK-packet warning count for LOAD statements. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… fallback - Clamp warnings count to math.MaxUint16 to prevent uint16 wrap-around when ignored rows exceed 65535. - Restore IgnoreError setting in buildLoad() fallback path so IGNORE semantics are not silently lost when bindLoad returns ErrUnsupportedDML.
XuPeng-SH
left a comment
There was a problem hiding this comment.
Review Summary
Overall a well-structured feature addition with clean propagation through the pipeline. Found 1 critical bug and 2 minor issues.
🔴 Critical: SetLength() does NOT clear null bitmap — data corruption risk
In reader_csv.go (lines 197-203), the rollback mechanism uses vec.SetLength(rowIdx) to undo partial row appends. However, SetLength() only updates v.length — it does NOT clear stale null bits in the null bitmap (v.nsp).
Corruption scenario:
- Row 5 starts appending. Column A appends a NULL value → null bit at index 5 is SET
- Column B conversion fails →
getOneRowDatareturns error - Rollback calls
vec.SetLength(5)on all vectors — but null bit 5 remains set in column A's bitmap - Next row reuses index 5. Column A appends a non-null value "hello" — but the stale null bit makes it appear NULL
Fix: Clear null bits after truncation:
for j := 0; j < bat.VectorCount(); j++ {
vec := bat.GetVector(int32(j))
if vec.Length() > rowIdx {
nulls.RemoveRange(&vec.GetNulls(), uint64(rowIdx), uint64(vec.Length()))
vec.SetLength(rowIdx)
}
}🟡 Minor: curBatchSize accumulates even for skipped rows
At line 170-171, curBatchSize += uint64(len(row[j].Val)) is accumulated before the error check at line 193. Skipped rows inflate curBatchSize, potentially causing early batch breaks via curBatchSize >= param.maxBatchSize. This reduces batch efficiency but is not a correctness bug.
🟡 Minor: Warning count could mislead for CSV read errors
For CSV read errors (line 138-142), i-- is used but the CSV reader position has already advanced past the bad row. The i-- correctly keeps the output row index aligned, but the warning log doesn't include row number information, making debugging harder. Consider logging the approximate row number.
✅ Good aspects:
- Clean flag propagation:
ExParam.IgnoreError→ExParamConst→pipeline.proto→ runtime - Proto field number 11 is correct (follows field 10)
atomic.Uint32for warnings is appropriate and overflow-safeSetStmtProfilecorrectly resets warnings to 0 for each statementmath.MaxUint16cap instatus_stmt.gomatches MySQL protocol- BVT test covers both error (without IGNORE) and success (with IGNORE) paths
DuplicateKeyIgnoretype assertion is set in bothbind_load.goandbuild_load.go(covers both code paths)
What type of PR is this?
Which issue(s) does this PR fix or relate to?
Fixes #23937
What does this PR do?
Add support for using the existing
IGNOREkeyword inLOAD DATA INFILEto skip bad CSV records:Current scope of
IGNOREin this PR: only CSV parse errors fromcsvReader.Read()are skipped with warnings.JSON parse errors and type conversion errors are still fail-fast.
Changes
IgnoreErrorflag totree.ExParamandexternal.ExParamConstIgnoreErrorin bothbindExternalScanandbuildLoadwhenDuplicateHandlingisDuplicateKeyIgnoreignore_errorfield topipeline.protoExternalScanmessageIgnoreErrorthrough operator construction and remoterun serialization/deserializationCsvReader.makeBatchRows, apply row-skip only to CSV parse errors fromcsvReader.Read()transJson2Lines) and type conversion (getOneRowData) errors as hard failureswarning_countin remoterun end message and merge remote warnings on client sidereceiveMsgAndForward) to merge remote warning counts on end message