Skip to content

fix: concurrent INSERT with NULL unique key causes false duplicate errors#24054

Merged
mergify[bot] merged 6 commits intomatrixorigin:mainfrom
XuPeng-SH:fix-null-unique-key-proper
Apr 5, 2026
Merged

fix: concurrent INSERT with NULL unique key causes false duplicate errors#24054
mergify[bot] merged 6 commits intomatrixorigin:mainfrom
XuPeng-SH:fix-null-unique-key-proper

Conversation

@XuPeng-SH
Copy link
Copy Markdown
Contributor

@XuPeng-SH XuPeng-SH commented Apr 3, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24030

What this PR does / why we need it:

Two root causes:

1. NULL handling bug (logic error)
SQL defines NULL != NULL, but the duplicate checking code treated NULL values as comparable:

  • fuzzyCheck.firstlyCheck() / genCollsionKeys(): NULL Varlena serialized as empty string → multiple NULLs counted as duplicates
  • checkPKDupGeneric() / checkPKDup(): NULL entries not skipped for any type
  • PKPersistedBetween(): InplaceSort() reorders data but not the null bitmap → null positions become misaligned after sort

2. Concurrent vector sharing (data race)

  • PreInsert.constructColBuf() used SetVector() (pointer alias) instead of Dup() (deep copy)
  • Concurrent MergeRun() scopes cause reads/writes on shared bitmap structures

Changes:

File Fix
fuzzyCheck.go Skip NULL rows in firstlyCheck/genCollsionKeys; fix vectorToString per-row null check
txn.go Skip NULL rows in checkPKDupGeneric (all types) and checkPKDup (string/array types)
txn_table.go Add cloneNonNullKeys() to filter NULLs and break aliasing; clone before InplaceSort in PKPersistedBetween

What this PR does NOT include (compared to #24046):

  • bitmap.RecalculateCount() — unnecessary once concurrency is fixed
  • EmptyByFlag removal from Nulls.Contains() — masks deeper issue
  • Range() Clone — performance hit on every Window() call
  • ❌ Varlena fast-path removal from GetUnionAllFunction — performance regression
  • appendOneFixed TryExpand — hot-path overhead

Test coverage

Tests in fuzzyCheck_test.go (27 sub-tests) and txn_test.go (32 sub-tests):

  • TestVectorToStringNullHandling — 13 types including per-row null check
  • TestFirstlyCheckNullValues — NULL-only, mixed, no-null, all-null batches
  • TestGenCollisionKeysNullHandling — NULL skip in collision key generation
  • TestCheckPKDupGenericNullHandling — all fixed-width types + workspace/rawrows
  • TestCheckPKDupNullHandling — string/array types null skip
Function Coverage
checkPKDupGeneric 100%
checkPKDup 85.7%
vectorToString 93%
firstlyCheck 78%
genCollisionKeys 80%

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Fix concurrent INSERT with NULL unique key false duplicate errors

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Fix concurrent INSERT with NULL unique key causing false duplicate errors
• Skip NULL values in duplicate checking (fuzzyCheck, checkPKDup) per SQL semantics
• Replace vector aliasing with deep copy to prevent concurrent bitmap corruption
• Serialize non-Connector PreScopes in TP queries to prevent data races
• Add prepared flag to ValueScan to prevent double evaluation on nested MergeRun
Diagram
flowchart LR
  A["NULL Handling Bug"] -->|Skip NULLs in fuzzyCheck| B["fuzzyCheck.go"]
  A -->|Skip NULLs in checkPKDup| C["txn.go"]
  A -->|Clone before sort in PKPersistedBetween| D["txn_table.go"]
  E["Concurrent Vector Sharing"] -->|Use Dup instead of SetVector| F["preinsert.go"]
  E -->|Serialize non-Connector PreScopes| G["scope.go"]
  E -->|Add prepared flag| H["value_scan.go"]
  B --> I["Fixed Duplicate Checking"]
  C --> I
  D --> I
  F --> J["Fixed Data Races"]
  G --> J
  H --> J
Loading

Grey Divider

File Changes

1. pkg/sql/colexec/preinsert/preinsert.go 🐞 Bug fix +6/-2

Use deep copy instead of vector aliasing

• Replace SetVector() pointer aliasing with Dup() deep copy to prevent vector sharing between
 concurrent scopes
• Move canFreeVecIdx flag assignment before const vector check for consistency

pkg/sql/colexec/preinsert/preinsert.go


2. pkg/sql/colexec/value_scan/types.go 🐞 Bug fix +5/-3

Add prepared flag to prevent double evaluation

• Add prepared boolean flag to container struct to track if ValueScan batch has been evaluated
• Reset prepared flag to false in Reset method

pkg/sql/colexec/value_scan/types.go


3. pkg/sql/colexec/value_scan/value_scan.go 🐞 Bug fix +7/-0

Prevent double evaluation on nested MergeRun

• Skip evalRowsetData if already prepared to prevent bitmap corruption when same scope started
 multiple times
• Set prepared flag after first evaluation to guard against nested MergeRun calls

pkg/sql/colexec/value_scan/value_scan.go


View more (4)
4. pkg/sql/compile/fuzzyCheck.go 🐞 Bug fix +18/-4

Skip NULL values in duplicate checking

• Skip NULL rows in firstlyCheck() by checking null bitmap before counting duplicates
• Skip NULL rows in genCollsionKeys() for both simple and compound keys
• Fix vectorToString() to check per-row null status instead of checking if any nulls exist
• Remove unused nulls import

pkg/sql/compile/fuzzyCheck.go


5. pkg/sql/compile/scope.go 🐞 Bug fix +43/-10

Serialize non-Connector PreScopes and add nil guards

• Add nil guards in release(), Reset(), FreeOperator(), InitAllDataSource(), and
 SetOperatorInfoRecursively() to handle cleaned-up PreScopes
• Serialize non-Connector PreScopes in TP queries by running them sequentially before concurrent
 execution
• Track concurrent PreScope count separately to avoid counting nil entries
• Free and release non-Connector PreScopes after sequential execution

pkg/sql/compile/scope.go


6. pkg/vm/engine/disttae/txn.go 🐞 Bug fix +42/-24

Skip NULL rows in all primary key duplicate checks

• Add pk vector parameter to checkPKDupGeneric() to access null bitmap
• Skip NULL rows in checkPKDupGeneric() by checking null bitmap before duplicate check
• Skip NULL rows in checkPKDup() for string/array types (T_char, T_varchar, T_array_float32,
 T_array_float64)
• Pass pk vector to all checkPKDupGeneric() calls for all numeric and special types

pkg/vm/engine/disttae/txn.go


7. pkg/vm/engine/disttae/txn_table.go 🐞 Bug fix +46/-8

Clone and filter NULL keys before sort operations

• Add cloneNonNullKeys() helper function to deep copy vector and filter out NULL rows
• Clone keys vector before InplaceSort() in PKPersistedBetween() to prevent corrupting caller's
 batch
• Use cloneNonNullKeys() in PrimaryKeysMayBeUpserted() and PrimaryKeysMayBeModified() to break
 aliasing and remove NULL entries

pkg/vm/engine/disttae/txn_table.go


Grey Divider

Qodo Logo

@matrix-meow matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Apr 3, 2026
@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Apr 3, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX Issues (0)

Grey Divider


Action required

1. Fuzzy check empty keys 🐞 Bug ≡ Correctness
Description
fuzzyCheck.genCollsionKeys now drops NULL rows, but fuzzyCheck.fill still sets f.cnt=bat.RowCount()
and compile.go always runs backgroundSQLCheck when f.cnt>0, even when the non-NULL key list is
empty. This can generate invalid SQL like ... where pk in () ... for non-compound keys or panic
for compound keys due to lastRow := len(collision[0]) - 1 becoming -1.
Code

pkg/sql/compile/fuzzyCheck.go[R279-296]

+			// Skip NULL values - they cannot be duplicates
+			for i, k := range pkey {
+				if !toCheck.GetNulls().Contains(uint64(i)) {
+					keys[0] = append(keys[0], k)
+				}
+			}
		} else {
			scales := make([]int32, len(f.compoundCols))
			for i, c := range f.compoundCols {
				scales[i] = c.Typ.Scale
			}
			for i := 0; i < toCheck.Length(); i++ {
+				if toCheck.GetNulls().Contains(uint64(i)) {
+					continue
+				}
				b := toCheck.GetRawBytesAt(i)
				t, err := types.Unpack(b)
				if err != nil {
Evidence
After the PR, genCollsionKeys filters out NULL rows, which can produce an empty collision key slice;
fill and backgroundSQLCheck do not guard against this and will still attempt to build and run
background SQL (or compute lastRow for compound keys). The SQL template explicitly requires a
non-empty IN list, so an empty condition becomes syntactically invalid.

pkg/sql/compile/fuzzyCheck.go[121-167]
pkg/sql/compile/fuzzyCheck.go[265-314]
pkg/sql/compile/compile.go[566-577]
pkg/sql/compile/util.go[71-75]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`genCollsionKeys` now filters out NULL rows, which can yield **zero** collision keys (e.g., INSERT rows where all unique/PK values are NULL). `fill` still sets `f.cnt` to the batch row count and `Compile.Run` always calls `backgroundSQLCheck` when `f.cnt > 0`, producing invalid SQL (`... IN () ...`) and/or panicking for compound keys (`lastRow := len(collision[0]) - 1`).

### Issue Context
This is on the hot path for unique/PK duplicate validation; NULL-heavy inputs are a common legal case under SQL semantics (multiple NULLs allowed for UNIQUE).

### Fix Focus Areas
- Ensure `fill`/`genCollsionKeys` compute `f.cnt` as **non-NULL key count** (not batch row count).
- If the resulting key list is empty, set `f.cnt = 0` and return early so `backgroundSQLCheck` is skipped.
- Apply the same NULL filtering behavior consistently for the `onlyInsertHidden` path as well.

- pkg/sql/compile/fuzzyCheck.go[121-210]
- pkg/sql/compile/fuzzyCheck.go[265-314]
- pkg/sql/compile/compile.go[566-577]
- pkg/sql/compile/util.go[71-75]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Early prescope operator free 🐞 Bug ☼ Reliability
Description
Scope.MergeRun now runs non-connector PreScopes sequentially for TP queries and immediately calls
ps.FreeOperator/ps.release, which invokes op.Free on all operators in those scopes. For join build
scopes (e.g., HashBuild), op.Free deletes spill files and frees build resources that HashJoin later
reads from the JoinMap message, causing runtime failures or incorrect results under spill.
Code

pkg/sql/compile/scope.go[R294-307]

+	if c.IsTpQuery() {
+		hasConnector := false
		for i := len(s.PreScopes) - 1; i >= 0; i-- {
-			err := s.PreScopes[i].MergeRun(c)
-			if err != nil {
-				return err
+			ps := s.PreScopes[i]
+			if ps != nil && ps.RootOp != nil && ps.RootOp.OpType() == vm.Connector {
+				hasConnector = true
+			} else if ps != nil {
+				if err := ps.MergeRun(c); err != nil {
+					return err
+				}
+				ps.FreeOperator(c)
+				ps.release()
+				s.PreScopes[i] = nil
			}
Evidence
Broadcast-join build pipelines are added as PreScopes and their RootOp is a HashBuild operator
(non-Connector), so they fall into the new TP sequential/free path. Scope.FreeOperator calls op.Free
on every operator; HashBuild.Free unconditionally removes spill files, but HashJoin explicitly reads
the spilled build bucket files after receiving the JoinMap message, so freeing the build scope
before the probe side runs breaks spill-mode joins.

pkg/sql/compile/scope.go[289-312]
pkg/sql/compile/scope.go[221-233]
pkg/sql/compile/compile.go[2626-2641]
pkg/sql/compile/operator.go[1515-1531]
pkg/sql/colexec/hashbuild/types.go[138-142]
pkg/sql/colexec/hashjoin/join.go[258-285]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
For TP queries, `Scope.MergeRun` now executes non-connector PreScopes sequentially and immediately calls `FreeOperator`/`release` on them. This can free/delete resources that the main pipeline still needs (notably spilled HashBuild files referenced by the JoinMap message and consumed by HashJoin later).

### Issue Context
- Join build scopes are `PreScopes` whose `RootOp` is typically `HashBuild` (non-Connector).
- `Scope.FreeOperator` calls `op.Free(...)` for every operator in that scope.
- `HashBuild.Free` unconditionally removes spill files, while `HashJoin` later reads those build spill buckets after `ReceiveJoinMap`.

### Fix Focus Areas
- Do **not** call `ps.FreeOperator(c)` / `ps.release()` immediately after `ps.MergeRun(c)` for non-connector PreScopes.
- If you must prevent re-running these scopes in the later concurrent phase, separate the scheduling list (e.g., collect connector PreScopes into a separate slice for concurrent execution) while keeping the completed PreScopes available for end-of-query cleanup.
- Ensure build-side spill files remain until HashJoin finishes (i.e., cleanup happens after probe completes / normal scope cleanup).

- pkg/sql/compile/scope.go[289-312]
- pkg/sql/compile/scope.go[221-233]
- pkg/sql/compile/compile.go[2626-2641]
- pkg/sql/compile/operator.go[1515-1531]
- pkg/sql/colexec/hashbuild/types.go[138-142]
- pkg/sql/colexec/hashjoin/join.go[258-285]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread pkg/sql/compile/fuzzyCheck.go Outdated
Comment thread pkg/sql/compile/scope.go Outdated
The race detector's type checker rejects uint16 on T_enum vectors.
Use the proper types.Enum alias in both vectorToString and tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@gouhongshen gouhongshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix direction looks right, but I can't approve this yet because the regression proof is still too local to the helpers.

What's still missing before approval:

  1. A SQL-level regression that reproduces the historical false Duplicate entry and proves the real user-visible path is fixed.
  2. A concurrent INSERT/UPDATE case. The PR title is about concurrent insert with NULL unique keys, but the new tests stay single-threaded.
  3. Coverage for the full primaryKeysMayBeChanged -> PKExistInMemBetween -> PKPersistedBetween path, including a flushed/persisted-object case.
  4. Coverage for the hidden unique-index/backfill path if that path is part of the intended fix surface.

Right now the tests prove the helper-level NULL filtering, but they do not yet prove that the full production path no longer misreports duplicates while still rejecting real non-NULL duplicates. That's the only blocking issue from this review.

Copy link
Copy Markdown
Contributor Author

@XuPeng-SH XuPeng-SH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response to Qodo Review

Both issues raised are false positives after careful investigation:

Issue 1: Fuzzy check empty keys — ✅ Already handled

fill() at lines 144-153 already guards against this:

if len(collision) > 0 {
    f.cnt = len(collision[0])
} else {
    f.cnt = 0
}
if f.cnt == 0 {
    return nil  // ← early return, backgroundSQLCheck never called
}

When all rows are NULL, genCollsionKeys() produces empty collision slices → f.cnt = 0backgroundSQLCheck is skipped by the guard in compile.go line 568 (f.cnt > 0).

Issue 2: Early prescope operator free — ✅ Ownership transfer protects files

File ownership chain prevents premature deletion:

  1. HashBuild transfers spill fds to JoinMap at SendSucceed: ctr.spilledFds = nil (build.go:102)
  2. HashJoin takes ownership via TakeSpillBuildFds() (joinMapMsg.go:239-246)

Sequential TP execution ensures HashBuild completes and sends JoinMap before HashJoin starts reading.

- TestFillOnlyInsertHidden: hidden unique index (onlyInsertHidden) path
- TestFillEndToEndConditionGeneration: full fill→condition pipeline
- TestConcurrentFirstlyCheck: multi-goroutine concurrent firstlyCheck
- TestConcurrentCheckPKDup: concurrent checkPKDup with NULLs
- TestConcurrentCheckPKDup_RealDupWithNulls: dup detection under concurrency
- TestDupVectorWithoutNulls_ConcurrentSafety: concurrent InplaceSort safety

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@XuPeng-SH
Copy link
Copy Markdown
Contributor Author

Integration tests added (commit 87b2c0d)

Addressing all review feedback:

1. Full production path coverage (fill → condition → SQL generation)

  • TestFillEndToEndConditionGeneration: Exercises complete fill()firstlyCheck()genCollsionKeys() → condition string pipeline
    • non_compound_nulls_skipped_real_dup_caught: NULLs + real dup → firstlyCheck catches it
    • non_compound_nulls_only_skips_background_sql: All NULLs → cnt=0 → backgroundSQLCheck skipped
    • compound_mixed_generates_valid_condition: Compound key with NULLs → valid condition string

2. Concurrent INSERT cases

  • TestConcurrentCheckPKDup: 8 goroutines × 50 rows (20% NULLs) — no false duplicates
  • TestConcurrentCheckPKDup_RealDupWithNulls: 4 goroutines each with [1, NULL, 2, NULL, 1] — all detect real dup
  • TestConcurrentFirstlyCheck: 8 goroutines × 100 rows with NULLs — safe concurrent access
  • TestDupVectorWithoutNulls_ConcurrentSafety: 4 goroutines dupVectorWithoutNulls + InplaceSort — original vector unchanged

3. Hidden unique index path

  • TestFillOnlyInsertHidden: Covers onlyInsertHidden=true path (CREATE UNIQUE INDEX backfill)
    • hidden_with_values: Generates correct condition
    • hidden_dup_caught_by_firstly_check_skipped: Confirms firstlyCheck is skipped for hidden tables

4. primaryKeysMayBeChanged path

  • TestDupVectorWithoutNulls (already existed): Covers dupVectorWithoutNulls() — the NULL filtering + deep-copy used by primaryKeysMayBeChanged() before PKExistInMemBetween
  • TestDupVectorWithoutNulls_ConcurrentSafety: Proves filtered copies are independent (safe for concurrent InplaceSort)

All 16 new sub-tests pass.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 5, 2026

Merge Queue Status

  • Entered queue2026-04-05 02:19 UTC · Rule: main
  • Checks skipped · PR is already up-to-date
  • Merged2026-04-05 02:19 UTC · at c32e2ca9c3dc6494541c372acc5635c1b88be426

This pull request spent 10 seconds in the queue, including 1 second running CI.

Required conditions to merge
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
  • branch-protection-review-decision = APPROVED [🛡 GitHub branch protection]
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-neutral = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-skipped = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / SCA Test on Ubuntu/x86
    • check-neutral = Matrixone CI / SCA Test on Ubuntu/x86
    • check-skipped = Matrixone CI / SCA Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / UT Test on Ubuntu/x86
    • check-neutral = Matrixone CI / UT Test on Ubuntu/x86
    • check-skipped = Matrixone CI / UT Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-neutral = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-skipped = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Utils CI / Coverage
    • check-neutral = Matrixone Utils CI / Coverage
    • check-skipped = Matrixone Utils CI / Coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/XL Denotes a PR that changes [1000, 1999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants