fix: cnservice bootstrap panics on context.Canceled instead of returning error#24065
Conversation
…ing error (matrixorigin#24064) Two changes: 1. pkg/cnservice/server.go: bootstrap() now checks for context cancellation/timeout before panicking. context.Canceled and context.DeadlineExceeded are external signals, not bootstrap failures — return the error so the caller can handle gracefully. 2. pkg/embed/cluster.go: doStartLocked() propagates errors instead of panicking. CN Start() failures are returned via atomic.Value and the first error is surfaced to the caller. Previously, any bootstrap timeout during embedded cluster tests would crash the entire test process with 'panic: context canceled', causing cascading failures across unrelated test packages. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review Summary by QodoFix bootstrap panic on context cancellation and error propagation
WalkthroughsDescription• Prevent bootstrap panic on context cancellation/timeout - Check for context.Cause, context.Canceled, context.DeadlineExceeded before panicking - Return error gracefully instead of crashing process • Replace panic with error propagation in embedded cluster startup - Use atomic.Value to capture first CN Start() error - Return error to caller instead of panicking • Fixes cascading test failures during embedded cluster bootstrap timeout Diagramflowchart LR
A["Bootstrap Context Timeout/Cancel"] --> B["Check Context Status"]
B --> C["Return Error Gracefully"]
D["CN Service Start Error"] --> E["Capture via atomic.Value"]
E --> F["Return Error to Caller"]
C --> G["Prevent Process Crash"]
F --> G
File Changes1. pkg/cnservice/server.go
|
Code Review by Qodo
1. Partial start leaks services
|
The previous check (context.Cause(ctx) != nil || err == context.Canceled || err == context.DeadlineExceeded) was too broad. The bootstrap context is created with context.WithTimeoutCause(context.Background(), 5min, moerr.CauseBootstrap), so context.Cause(ctx) is always non-nil when the 5-minute deadline fires. That deadline IS a legitimate bootstrap failure that should panic. Only context.Canceled indicates external shutdown — consistent with the pattern at line 963 for BootstrapUpgrade. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use errors.Is(err, context.Canceled) instead of == for robustness against wrapped errors (consistent with 70% of codebase usage) - Document first-error-only semantics in cluster.go CompareAndSwap Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 5 sub-tests in TestDoStartLockedErrorPaths to exercise the error paths in cluster.go doStartLocked that were previously uncovered: - Non-CN service Start() error returns immediately - CN service Start() error captured via atomic.Value - Start() propagates doStartLocked error - Double-start rejection - Happy path with empty service list This brings diff coverage from 53.8% (7/13 blocks) to 92.3% (12/13), exceeding the 75% threshold. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Merge Queue Status
This pull request spent 9 seconds in the queue, including 1 second running CI. Required conditions to merge
|
|
All changes from this PR have been cherry-picked into PR #24062 (fix-cn-memory-leaks), which now includes both the CN memory fixes and the bootstrap panic fix. Closing this as duplicate. The bootstrap fix includes:
|
What type of PR is this?
Which issue(s) this PR fixes:
Fixes #24064
What this PR does / why we need it:
Problem
cnservice.(*service).bootstrap()panics on ALL bootstrap errors, includingcontext.Canceledandcontext.DeadlineExceeded. This crashes the entire process when the 5-minute bootstrap timeout expires, which frequently happens during embedded cluster tests with multiple CNs.Additionally,
embed/cluster.go:doStartLocked()panics on CN Start() errors instead of propagating them to the caller.Root Cause
Context cancellation is NOT a bootstrap failure — it's an external signal that the operation should stop. The same file already handles
context.Canceledgracefully forBootstrapUpgrade(line 955) but not for initialBootstrap.Fix
1.
pkg/cnservice/server.go: Check for context cancellation/timeout before panicking. Return error instead of panicking for context-related failures.2.
pkg/embed/cluster.go: ChangedoStartLocked()frompanic(err)to proper error propagation viaatomic.Value, returning the first error to the caller.Impact
BootstrapUpgradecontext handling patternRisk Assessment
Very low risk. Both changes are purely additive:
server.go: Only changes the error handling for context-related errors; real bootstrap failures still panic as beforecluster.go: Replacespanic(err)withreturn err— strictly more graceful behavior