Skip to content

fix: cnservice bootstrap panics on context.Canceled instead of returning error#24065

Merged
mergify[bot] merged 5 commits intomatrixorigin:mainfrom
XuPeng-SH:fix-bootstrap-panic-on-context-cancel
Apr 5, 2026
Merged

fix: cnservice bootstrap panics on context.Canceled instead of returning error#24065
mergify[bot] merged 5 commits intomatrixorigin:mainfrom
XuPeng-SH:fix-bootstrap-panic-on-context-cancel

Conversation

@XuPeng-SH
Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

Fixes #24064

What this PR does / why we need it:

Problem

cnservice.(*service).bootstrap() panics on ALL bootstrap errors, including context.Canceled and context.DeadlineExceeded. This crashes the entire process when the 5-minute bootstrap timeout expires, which frequently happens during embedded cluster tests with multiple CNs.

Additionally, embed/cluster.go:doStartLocked() panics on CN Start() errors instead of propagating them to the caller.

Root Cause

// server.go:944 — panics on context.Canceled!
if err := s.bootstrapService.Bootstrap(ctx); err != nil {
    panic(moerr.AttachCause(ctx, err))
}

Context cancellation is NOT a bootstrap failure — it's an external signal that the operation should stop. The same file already handles context.Canceled gracefully for BootstrapUpgrade (line 955) but not for initial Bootstrap.

Fix

1. pkg/cnservice/server.go: Check for context cancellation/timeout before panicking. Return error instead of panicking for context-related failures.

2. pkg/embed/cluster.go: Change doStartLocked() from panic(err) to proper error propagation via atomic.Value, returning the first error to the caller.

Impact

  • Prevents cascading test failures when embedded cluster bootstrap times out
  • Production benefit: graceful shutdown won't crash a CN process that's still bootstrapping
  • Consistent with existing BootstrapUpgrade context handling pattern

Risk Assessment

Very low risk. Both changes are purely additive:

  • server.go: Only changes the error handling for context-related errors; real bootstrap failures still panic as before
  • cluster.go: Replaces panic(err) with return err — strictly more graceful behavior

…ing error (matrixorigin#24064)

Two changes:

1. pkg/cnservice/server.go: bootstrap() now checks for context
   cancellation/timeout before panicking. context.Canceled and
   context.DeadlineExceeded are external signals, not bootstrap
   failures — return the error so the caller can handle gracefully.

2. pkg/embed/cluster.go: doStartLocked() propagates errors instead
   of panicking. CN Start() failures are returned via atomic.Value
   and the first error is surfaced to the caller.

Previously, any bootstrap timeout during embedded cluster tests
would crash the entire test process with 'panic: context canceled',
causing cascading failures across unrelated test packages.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Fix bootstrap panic on context cancellation and error propagation

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Prevent bootstrap panic on context cancellation/timeout
  - Check for context.Cause, context.Canceled, context.DeadlineExceeded before panicking
  - Return error gracefully instead of crashing process
• Replace panic with error propagation in embedded cluster startup
  - Use atomic.Value to capture first CN Start() error
  - Return error to caller instead of panicking
• Fixes cascading test failures during embedded cluster bootstrap timeout
Diagram
flowchart LR
  A["Bootstrap Context Timeout/Cancel"] --> B["Check Context Status"]
  B --> C["Return Error Gracefully"]
  D["CN Service Start Error"] --> E["Capture via atomic.Value"]
  E --> F["Return Error to Caller"]
  C --> G["Prevent Process Crash"]
  F --> G
Loading

Grey Divider

File Changes

1. pkg/cnservice/server.go 🐞 Bug fix +7/-0

Handle context cancellation gracefully in bootstrap

• Added context cancellation/timeout check before panicking in bootstrap()
• Returns error for context.Canceled and context.DeadlineExceeded instead of panicking
• Preserves panic behavior for actual bootstrap failures
• Aligns with existing BootstrapUpgrade context handling pattern

pkg/cnservice/server.go


2. pkg/embed/cluster.go 🐞 Bug fix +12/-8

Propagate startup errors instead of panicking

• Changed doStartLocked() return type from void to error
• Replaced panic(err) with error return for non-CN services
• Replaced panic(err) with atomic.Value capture for CN service goroutines
• Updated Start() and StartNewCNService() to handle returned errors

pkg/embed/cluster.go


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Apr 5, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX Issues (0)

Grey Divider


Action required

1. Partial start leaks services 🐞 Bug ☼ Reliability
Description
embed.(*cluster).doStartLocked can now return an error after some services have already started, but
Start/StartNewCNService do not roll back started services. cluster.Close can then fail immediately
on a never-started operator (operator.Close returns "service already stopped"), potentially leaving
already-started services running and blocking cleanup/retry.
Code

pkg/embed/cluster.go[R114-123]

+func (c *cluster) doStartLocked(from int) error {
var wg sync.WaitGroup
-	errC := make(chan error, 1)
-	defer close(errC)
+	var startErr atomic.Value
for _, s := range c.services[from:] {
if s.serviceType != metadata.ServiceType_CN {
  if err := s.Start(); err != nil {
-				panic(err)
+				return err
  }
  continue
}
Evidence
doStartLocked returns the first CN Start() error after wg.Wait with no rollback/cleanup, and
cluster.Close returns on the first Close() error. operator.Start only sets state=started on success,
while operator.Close errors if the operator is already stopped; therefore, after a partial start
failure, Close() can error on a stopped operator and abort without stopping other already-started
services.

pkg/embed/cluster.go[99-154]
pkg/embed/operator.go[120-178]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`embed.(*cluster).doStartLocked` can now return an error, leaving the cluster partially started. A subsequent `cluster.Close()` may fail early because `operator.Close()` errors when the operator is already stopped, which can leave other already-started services running.
## Issue Context
This PR intentionally replaces panics with error returns. That makes partial-start states reachable, so shutdown/cleanup paths must be robust and (ideally) rollback on start failure.
## Fix Focus Areas
- pkg/embed/cluster.go[99-154]
- pkg/embed/operator.go[120-178]
## Suggested fix approach
- In `doStartLocked(from int)`, if any start error is observed, **rollback** services started in this call (iterate `c.services[from:]` in reverse; call `Close()` on operators that reached `started`).
- Make cleanup resilient:
- Option A (preferred): make `cluster.Close()` continue closing remaining services even if one `Close()` fails; aggregate errors (e.g., `errors.Join`).
- Option B: make `operator.Close()` idempotent (return `nil` when `op.state == stopped`) so `cluster.Close()` can always attempt to close all services.
- For `StartNewCNService`, if starting newly-added CNs fails, consider removing/rolling back the newly appended operators and restoring `c.options.cn` to avoid leaving broken operators in `c.services`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Bootstrap cause not attached🐞 Bug ≡ Correctness
Description
cnservice.(*service).bootstrap returns the raw Bootstrap error on timeout/cancellation, so a
DeadlineExceeded error will not include the WithTimeoutCause cause (moerr.CauseBootstrap) in the
returned error chain. This reduces observability compared to the non-timeout path, which attaches
the cause via moerr.AttachCause(ctx, err).
Code

pkg/cnservice/server.go[R944-953]

if err := s.bootstrapService.Bootstrap(ctx); err != nil {
+		// Context cancellation or timeout is not a bootstrap failure — the
+		// service is being shut down or the deadline was exceeded.  Return
+		// the error so the caller can handle it gracefully instead of
+		// crashing the process.
+		if context.Cause(ctx) != nil || err == context.Canceled || err == context.DeadlineExceeded {
+			return err
+		}
panic(moerr.AttachCause(ctx, err))
}
Evidence
bootstrap() uses context.WithTimeoutCause(..., moerr.CauseBootstrap) but returns err directly for
timeout/cancellation. moerr.AttachCause explicitly joins context.Cause(ctx) into the error when the
error is DeadlineExceeded, but that is only called on the panic path; thus, returned
DeadlineExceeded errors won’t carry the bootstrap cause in their error chain.

pkg/cnservice/server.go[938-953]
pkg/common/moerr/cause.go[301-309]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
On bootstrap timeout/cancellation, `cnservice.(*service).bootstrap()` returns the raw error and does not attach the `WithTimeoutCause` cause (`moerr.CauseBootstrap`) into the returned error chain. This makes timeout failures harder to diagnose from logs/error propagation.
## Issue Context
`moerr.AttachCause(ctx, err)` is already the project’s helper for joining `context.Cause(ctx)` into `context.DeadlineExceeded` errors.
## Fix Focus Areas
- pkg/cnservice/server.go[938-953]
- pkg/common/moerr/cause.go[301-309]
## Suggested fix approach
- In the timeout/cancellation return path, return an attached error instead of the raw error, e.g.:
- `return moerr.AttachCause(ctx, err)` (safe: it’s a no-op unless the error is DeadlineExceeded)
- Optionally, prefer `errors.Is(err, context.Canceled/DeadlineExceeded)` over direct equality if Bootstrap may wrap these errors.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@matrix-meow matrix-meow added the size/S Denotes a PR that changes [10,99] lines label Apr 5, 2026
@mergify mergify Bot added the kind/bug Something isn't working label Apr 5, 2026
The previous check (context.Cause(ctx) != nil || err == context.Canceled ||
err == context.DeadlineExceeded) was too broad. The bootstrap context is
created with context.WithTimeoutCause(context.Background(), 5min,
moerr.CauseBootstrap), so context.Cause(ctx) is always non-nil when the
5-minute deadline fires. That deadline IS a legitimate bootstrap failure
that should panic.

Only context.Canceled indicates external shutdown — consistent with the
pattern at line 963 for BootstrapUpgrade.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread pkg/embed/cluster.go
- Use errors.Is(err, context.Canceled) instead of == for robustness
  against wrapped errors (consistent with 70% of codebase usage)
- Document first-error-only semantics in cluster.go CompareAndSwap

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@matrix-meow matrix-meow added size/M Denotes a PR that changes [100,499] lines and removed size/S Denotes a PR that changes [10,99] lines labels Apr 5, 2026
Add 5 sub-tests in TestDoStartLockedErrorPaths to exercise the error
paths in cluster.go doStartLocked that were previously uncovered:
- Non-CN service Start() error returns immediately
- CN service Start() error captured via atomic.Value
- Start() propagates doStartLocked error
- Double-start rejection
- Happy path with empty service list

This brings diff coverage from 53.8% (7/13 blocks) to 92.3% (12/13),
exceeding the 75% threshold.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mergify mergify Bot added the queued label Apr 5, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 5, 2026

Merge Queue Status

  • Entered queue2026-04-05 12:16 UTC · Rule: main
  • Checks skipped · PR is already up-to-date
  • Merged2026-04-05 12:16 UTC · at 49772d259e7aca2f6b6b599868c47aa92f3f91ae

This pull request spent 9 seconds in the queue, including 1 second running CI.

Required conditions to merge
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
  • branch-protection-review-decision = APPROVED [🛡 GitHub branch protection]
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-neutral = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-skipped = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / SCA Test on Ubuntu/x86
    • check-neutral = Matrixone CI / SCA Test on Ubuntu/x86
    • check-skipped = Matrixone CI / SCA Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / UT Test on Ubuntu/x86
    • check-neutral = Matrixone CI / UT Test on Ubuntu/x86
    • check-skipped = Matrixone CI / UT Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-neutral = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-skipped = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Utils CI / Coverage
    • check-neutral = Matrixone Utils CI / Coverage
    • check-skipped = Matrixone Utils CI / Coverage

@mergify mergify Bot merged commit 78f7b42 into matrixorigin:main Apr 5, 2026
24 of 25 checks passed
@mergify mergify Bot removed the queued label Apr 5, 2026
@XuPeng-SH
Copy link
Copy Markdown
Contributor Author

All changes from this PR have been cherry-picked into PR #24062 (fix-cn-memory-leaks), which now includes both the CN memory fixes and the bootstrap panic fix. Closing this as duplicate.

The bootstrap fix includes:

  • handleBootstrapErr extraction with errors.Is(err, context.Canceled) check
  • cluster.go doStartLocked returns error instead of panicking
  • 11 test cases (6 for handleBootstrapErr + 5 for doStartLocked error paths)
  • 92.3% diff coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cnservice bootstrap panics on context.Canceled instead of returning error

3 participants