Skip to content

fix: 4 CN memory leak/growth issues (#24058 #24059 #24060 #24061)#24062

Merged
mergify[bot] merged 2 commits intomatrixorigin:mainfrom
XuPeng-SH:fix-cn-memory-leaks
Apr 6, 2026
Merged

fix: 4 CN memory leak/growth issues (#24058 #24059 #24060 #24061)#24062
mergify[bot] merged 2 commits intomatrixorigin:mainfrom
XuPeng-SH:fix-cn-memory-leaks

Conversation

@XuPeng-SH
Copy link
Copy Markdown
Contributor

@XuPeng-SH XuPeng-SH commented Apr 4, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

Fixes #24058 #24059 #24060 #24061

What this PR does / why we need it:

Fixes 4 CN memory leak/growth issues:

  1. aggState.go: Free argCnt on Alloc() failure path — prevents mpool leak when arena allocation fails after MakeSlice succeeds (scope.go: missing ctx.Done() in select loop causes goroutine leak on query cancellation #24061)
  2. evalExpression.go: Defer cleanup for constant-fold ivecs in GetExprZoneMap — prevents vector leak in constant folding (stats.go: statsInfoMap entries set to nil but never deleted — unbounded map growth #24060)
  3. engine.go: Move RemoveTid call outside Engine lock to prevent deadlock between Engine.mu and GlobalStats.mu, and uncomment the previously disabled RemoveTid call (aggState: mpool.MakeSlice leak when subsequent Alloc fails #24058)
  4. stats.go: Add RemoveTid() method — cleans up statsInfoMap entries when tables are unsubscribed, preventing unbounded map growth (evalExpression: partial vector leak in constant-folding Dup() loop #24059)

Tests added:

  • aggexec_test.go: 3 sub-tests — normal init/free, real Alloc failure path (limited mpool), non-savearg path
  • stats_test.go: 3 sub-tests — remove existing entries, nonexistent table, broadcast wakeup
  • stats_test.go: 2 sub-tests — TestCleanMemoryTableWithTable (partition+stats removal, missing partition)
  • evalExpression_test.go: TestGetExprZoneMapConstantFold — exercises the constant-fold defer cleanup via abs(-42)

Not included (reverted):

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Fix 4 CN memory leak and goroutine leak issues

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Fix aggState argCnt memory leak on allocation failure
• Defer cleanup of duplicated vectors in constant-folding expressions
• Add RemoveTid() method to prevent statsInfoMap unbounded growth
• Add context cancellation check to select loop preventing goroutine leaks
Diagram
flowchart LR
  A["Memory Leaks Identified"] --> B["aggState argCnt leak"]
  A --> C["evalExpression Dup'd vectors"]
  A --> D["statsInfoMap unbounded growth"]
  A --> E["scope.go goroutine leak"]
  B --> B1["Free argCnt on Alloc failure"]
  C --> C1["Defer cleanup of ivecs"]
  D --> D1["Add RemoveTid method"]
  E --> E1["Add ctx.Done case to select"]
  B1 --> F["Memory leaks fixed"]
  C1 --> F
  D1 --> F
  E1 --> F
Loading

Grey Divider

File Changes

1. pkg/sql/colexec/aggexec/aggState.go 🐞 Bug fix +2/-0

Free argCnt slice on allocation failure

• Free previously allocated argCnt slice when subsequent mp.Alloc() fails
• Prevents memory leak on error path in aggState.init()

pkg/sql/colexec/aggexec/aggState.go


2. pkg/sql/colexec/aggexec/aggexec_test.go 🧪 Tests +55/-0

Test argCnt cleanup in aggState initialization

• Add TestAggStateInitSaveArgCleanup with 3 sub-tests covering normal init, error path cleanup,
 and non-savearg path
• Verify argCnt is properly freed in all initialization scenarios
• Ensure memory pool has zero bytes after test completion

pkg/sql/colexec/aggexec/aggexec_test.go


3. pkg/sql/colexec/evalExpression.go 🐞 Bug fix +11/-1

Defer cleanup of constant-folded vectors

• Add defer block to cleanup duplicated vectors when constant-folding occurs
• Free vectors in ivecs slice to prevent memory leak on all exit paths
• Store isAllConst() result in variable to enable deferred cleanup

pkg/sql/colexec/evalExpression.go


View more (5)
4. pkg/sql/compile/scope.go 🐞 Bug fix +3/-0

Add context cancellation check to select loop

• Add case <-s.Proc.Ctx.Done() to select loop in MergeRun()
• Return context error immediately when query context is cancelled
• Prevents goroutines from leaking when context is cancelled

pkg/sql/compile/scope.go


5. pkg/sql/compile/scope_test.go 🧪 Tests +148/-0

Test select loop context cancellation handling

• Add TestSelectLoopContextCancellation with 3 sub-tests
• Test context cancellation unblocks prescope wait, remote wait, and normal completion
• Verify select loop exits promptly on context cancellation

pkg/sql/compile/scope_test.go


6. pkg/vm/engine/disttae/engine.go 🐞 Bug fix +1/-1

Enable RemoveTid call during table cleanup

• Uncomment call to e.globalStats.RemoveTid(tblId) in cleanMemoryTableWithTable()
• Enable removal of statsInfoMap entries during table garbage collection

pkg/vm/engine/disttae/engine.go


7. pkg/vm/engine/disttae/stats.go ✨ Enhancement +15/-0

Add RemoveTid method to clean statsInfoMap entries

• Add new RemoveTid() method to GlobalStats
• Iterate through statsInfoMap and delete all entries matching the given tableID
• Broadcast condition variable to wake waiting goroutines
• Prevents unbounded map growth when tables are dropped or unsubscribed

pkg/vm/engine/disttae/stats.go


8. pkg/vm/engine/disttae/stats_test.go 🧪 Tests +79/-0

Test RemoveTid functionality and goroutine wakeup

• Add TestRemoveTid with 3 sub-tests covering entry removal, non-existent table, and goroutine
 wakeup
• Verify entries for specific tableID are removed while others remain
• Test that RemoveTid broadcasts condition variable to wake waiting goroutines

pkg/vm/engine/disttae/stats_test.go


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 4, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX Issues (0)

Grey Divider


Action required

1. Engine/stats lock deadlock 🐞 Bug ☼ Reliability
Description
Engine.cleanMemoryTableWithTable holds the Engine mutex and calls GlobalStats.RemoveTid (locks
GlobalStats.mu), while GlobalStats.Get holds GlobalStats.mu and can call into
PushClient.isSubscribed which locks the Engine via GetOrCreateLatestPart. This lock-order inversion
can deadlock table GC against concurrent stats lookups.
Code

pkg/vm/engine/disttae/engine.go[894]

+	e.globalStats.RemoveTid(tblId)
Evidence
The PR makes Engine.cleanMemoryTableWithTable call GlobalStats.RemoveTid while holding Engine.Lock.
Separately, GlobalStats.Get holds GlobalStats.mu while calling pClient.toSubscribeTable(), which may
call isSubscribed(), which locks Engine via GetOrCreateLatestPart(). This creates the cycle
Engine.Lock → GlobalStats.mu and GlobalStats.mu → Engine.Lock.

pkg/vm/engine/disttae/engine.go[884-896]
pkg/vm/engine/disttae/stats.go[344-367]
pkg/vm/engine/disttae/stats.go[325-334]
pkg/vm/engine/disttae/logtail_consumer.go[1208-1236]
pkg/vm/engine/disttae/db.go[630-648]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`Engine.cleanMemoryTableWithTable()` currently holds `Engine`'s mutex while calling `GlobalStats.RemoveTid()`, which takes `GlobalStats.mu`. Elsewhere, `GlobalStats.Get()` holds `GlobalStats.mu` while calling into `PushClient.isSubscribed()` which locks the `Engine` via `GetOrCreateLatestPart()`. This lock inversion can deadlock.

### Issue Context
A consistent lock ordering (or avoiding holding multiple locks at once) is required because `GlobalStats.Get()` can acquire Engine locks while holding `GlobalStats.mu`.

### Fix Focus Areas
- pkg/vm/engine/disttae/engine.go[884-896]

### Suggested change
Restructure `cleanMemoryTableWithTable` to not call `RemoveTid` while holding `e.Lock()`.
Example approach:
1) Acquire `e.Lock()`, delete from `e.partitions`, then unlock.
2) Call `e.globalStats.RemoveTid(tblId)` after releasing `e.Lock()`.

(Alternative: otherwise ensure lock ordering is always `GlobalStats.mu` -> `Engine.Lock`, but avoiding nested locks here is simplest.)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Unconditional defer in ZM 🐞 Bug ➹ Performance
Description
GetExprZoneMap installs a defer even when constant-folding is false, adding avoidable overhead in a
potentially hot expression-evaluation path. The defer should only be registered when constFold is
true.
Code

pkg/sql/colexec/evalExpression.go[R1393-1404]

				ivecs := make([]*vector.Vector, len(args))
-				if isAllConst(args) { // constant fold
+				constFold := isAllConst(args)
+				defer func() {
+					if constFold {
+						for _, v := range ivecs {
+							if v != nil {
+								v.Free(proc.Mp())
+							}
+						}
+					}
+				}()
+				if constFold { // constant fold
Evidence
The new code creates constFold := isAllConst(args) then immediately registers a deferred cleanup
closure that checks if constFold { ... }. This defer setup occurs on every execution of the
default branch, including non-constant-folding cases.

pkg/sql/colexec/evalExpression.go[1392-1404]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`GetExprZoneMap` now registers a deferred function unconditionally in the default-case, even when `constFold` is false. This adds overhead on the non-constant-folding path.

### Issue Context
The cleanup is only needed for the constant-folding path where `ivecs[i]` holds `Dup()` copies.

### Fix Focus Areas
- pkg/sql/colexec/evalExpression.go[1392-1418]

### Suggested change
Move the `defer` inside the `if constFold { ... }` block:
- compute `constFold := isAllConst(args)`
- `if constFold { defer cleanup(ivecs); ... } else { ... }`
This preserves the leak fix without adding defer overhead to the non-const path.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Tests bypass fixed paths 🐞 Bug ⚙ Maintainability
Description
Some newly added tests reimplement the select/locking logic instead of invoking the production code
paths they claim to validate, so they can pass even if the production fix regresses. This reduces
confidence in the regression coverage for ctx cancellation and cond.Broadcast behavior.
Code

pkg/sql/compile/scope_test.go[R636-672]

+func TestSelectLoopContextCancellation(t *testing.T) {
+	t.Run("cancel_unblocks_prescope_wait", func(t *testing.T) {
+		ctx, cancel := context.WithCancel(context.Background())
+		preScopeChan := make(chan error, 1)
+		remoteChan := make(chan notifyMessageResult, 1)
+		proc := testutil.NewProcess(t)
+		proc.BuildPipelineContext(ctx)
+
+		done := make(chan error, 1)
+		go func() {
+			preScopeCount := 1
+			remoteScopeCount := 0
+			for {
+				select {
+				case <-proc.Ctx.Done():
+					done <- proc.Ctx.Err()
+					return
+				case e := <-preScopeChan:
+					if e != nil {
+						done <- e
+						return
+					}
+					preScopeCount--
+				case result := <-remoteChan:
+					result.clean(proc)
+					if result.err != nil {
+						done <- result.err
+						return
+					}
+					remoteScopeCount--
+				}
+				if preScopeCount == 0 && remoteScopeCount == 0 {
+					done <- nil
+					return
+				}
+			}
+		}()
Evidence
TestSelectLoopContextCancellation spins up its own goroutine containing a select loop and never
calls Scope.MergeRun, so it does not fail if the production select loop lacks ctx.Done().
Similarly, TestRemoveTid/remove_wakes_waiting_goroutines does not call cond.Wait(); the
goroutine only blocks on gs.mu.Lock(), so the test is not actually validating that Broadcast wakes
condition waiters.

pkg/sql/compile/scope_test.go[633-779]
pkg/vm/engine/disttae/stats_test.go[1693-1718]
pkg/sql/compile/scope.go[365-388]
pkg/vm/engine/disttae/stats.go[399-428]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The new tests for ctx cancellation and RemoveTid broadcast do not actually exercise the production implementations they are meant to protect.

### Issue Context
- `TestSelectLoopContextCancellation` should fail if `Scope.MergeRun` does not select on `Proc.Ctx.Done()`.
- `TestRemoveTid/remove_wakes_waiting_goroutines` should involve a goroutine blocked in `cond.Wait()` to validate `Broadcast()`.

### Fix Focus Areas
- pkg/sql/compile/scope_test.go[633-779]
- pkg/vm/engine/disttae/stats_test.go[1693-1718]

### Suggested change
- Update the scope test to construct a minimal `Scope` and call `MergeRun` (or the smallest extracted helper containing the select loop) in a goroutine, then cancel the context and assert the call returns.
- Update the stats test to have a goroutine acquire `gs.mu`, then call `gs.mu.cond.Wait()` (with a loop predicate), and assert `RemoveTid()`'s broadcast wakes it.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@matrix-meow matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Apr 4, 2026
Comment thread pkg/vm/engine/disttae/engine.go
1. aggState.go: Free argCnt on Alloc() failure path (matrixorigin#24061)
2. evalExpression.go: Defer cleanup for constant-fold ivecs (matrixorigin#24060)
3. engine.go: Move RemoveTid outside Engine lock to prevent deadlock (matrixorigin#24058)
4. stats.go: Add RemoveTid() to clean up statsInfoMap entries (matrixorigin#24059)

Note: scope.go ctx.Done() fix was reverted — it caused ISCP executor
test failures (TestISCPExecutor1/4/5, TestUpdateJobSpec, TestInitSql).
The goroutine leak concern remains valid but needs a more nuanced fix.

Fixes matrixorigin#24058 matrixorigin#24059 matrixorigin#24060 matrixorigin#24061

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 6, 2026

Merge Queue Status

  • Entered queue2026-04-06 13:08 UTC · Rule: main
  • Checks passed · in-place
  • Merged2026-04-06 14:06 UTC · at 69aa5674ec0cad5101565ac215317d3d2e62787c

This pull request spent 58 minutes 9 seconds in the queue, including 57 minutes 53 seconds running CI.

Required conditions to merge
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
  • branch-protection-review-decision = APPROVED [🛡 GitHub branch protection]
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-neutral = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-skipped = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / SCA Test on Ubuntu/x86
    • check-neutral = Matrixone CI / SCA Test on Ubuntu/x86
    • check-skipped = Matrixone CI / SCA Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / UT Test on Ubuntu/x86
    • check-neutral = Matrixone CI / UT Test on Ubuntu/x86
    • check-skipped = Matrixone CI / UT Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-neutral = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-skipped = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Utils CI / Coverage
    • check-neutral = Matrixone Utils CI / Coverage
    • check-skipped = Matrixone Utils CI / Coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working kind/enhancement size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

aggState: mpool.MakeSlice leak when subsequent Alloc fails

4 participants