Skip to content

Fix connection lifetime and init-path crashes#212

Merged
caleb2h merged 10 commits intomainfrom
fix/crash-and-init
Apr 19, 2026
Merged

Fix connection lifetime and init-path crashes#212
caleb2h merged 10 commits intomainfrom
fix/crash-and-init

Conversation

@gophergogo
Copy link
Copy Markdown
Collaborator

@gophergogo gophergogo commented Apr 19, 2026

Summary

Six independent crash/ordering fixes, split one-per-commit for review. Each addresses a real production crash or hang observed in PR #211's larger bundle; this PR lands the crash fixes cleanly without the SSE-transport / libcurl work, which will follow as separate PRs.

Commits

  1. Fix scheduleCallbackCurrentIteration to defer past caller's stack frame — the primitive was running the callback synchronously when called on the dispatcher thread, breaking the deferredDelete contract. Route on-thread scheduling through a 0ms timer. Adds test_deferred_delete_ordering covering the contract.
  2. Defer closed-connection destruction via Dispatcher::deferredDelete — replaces the shared_ptr<ConnectionPtr> + post() workaround with the correct primitive. Adds a dispatcher-thread assert on entry.
  3. Fix background-task timer lifetime in McpServer — timers were local unique_ptrs that dropped at end of scope, so ticks never fired. Hold as members; re-arm via enableTimer on the same timer.
  4. Bind server connection callbacks per connection via adapter — removes the stale current_connection_ fallback. Each connection gets its own ConnectionLifecycleCallbacks, handed to deferredDelete alongside the connection on close.
  5. Route initializeProtocol state commit back to the dispatcher thread — worker thread was mutating server_capabilities_, initialized_, and the protocol state machine off-thread. Now parses on the worker and commits state via main_dispatcher_->post.
  6. Disable body timeout for client-mode HTTP codec — SSE response bodies are open-ended; 60s body timeout was firing and crashing HttpCodecStateMachine::executeTransition(). Server mode keeps the bounded timeout.

Test plan

  • test_deferred_delete_ordering (new) — 4 cases: synchronous-deferral
    contract, batched drain, nested-during-drain, lifecycle-adapter
    self+peer pattern
  • test_timer_lifetime (new) — re-arm-same-TimerPtr contract used by
    McpServer background tasks
  • test_http_codec_state_machine — adds BodyTimeoutDisabledWhenZero
    proving body_timeout==0 disables the body timer
  • test_http_codec_filter — adds HttpCodecFilterBodyTimeoutTest ×
    2, confirming client mode wires body_timeout=0 and server mode
    retains 60s
  • test_mcp_connection_manager — two new cases
    (OnCloseRunsCleanlyOnDispatcherThread,
    RepeatedEventSequencesOnDispatcherThreadDoNotCrash) plus the two
    existing OnConnectionEventPreventsDuplicate* cases refactored to
    drive through the dispatcher to satisfy the new isThreadSafe()
    assert
  • Full ctest suite: 118/118 passing
  • Extended runtime soak against production MCP backends (reviewers to verify)

And in Out-of-scope, add a line clarifying the remaining integration-level gaps that PR D picks up:

  • McpServer integration tests for ConnectionLifecycleCallbacks real-connection
    close path and background timer lifetime (tracked for PR D — need full
    ApplicationBase bootstrap; adapter/timer contracts already covered at
    primitive level in this PR)
  • McpClient initializeProtocol integration test (same rationale — tracked for PR D)

gophergogo added 7 commits April 19, 2026 14:13
…me (#212)

The previous implementation ran the callback synchronously when called
from the dispatcher thread, which breaks deferredDelete: the drain
runs while the caller's callback is still on the stack, destroying
objects mid-iteration and causing use-after-free in connection close
handlers. Route on-thread scheduling through a 0ms timer so libevent
fires the drain after the current handler returns.

Adds test_deferred_delete_ordering covering the contract.
)

onConnectionEvent(RemoteClose|LocalClose) runs inside the connection's
own callback loop. Destroying the connection synchronously triggered a
use-after-free once the loop resumed. The previous workaround wrapped
the unique_ptr in a shared_ptr and posted it — now replaced with the
dispatcher's deferred-delete queue, which Connection already supports
via DeferredDeletable.

Also adds a dispatcher-thread assert at the top of onConnectionEvent so
any off-thread synthesis of connection events trips immediately instead
of silently corrupting state.
startBackgroundTasks assigned timers to local unique_ptrs that dropped
at end of scope, so session cleanup and resource-update ticks never
fired. Each fire also constructed a fresh timer to re-arm, which both
leaked state and destroyed the currently-firing timer from inside its
own callback.

Hold both timers as McpServer members, and on each fire re-arm the
same timer via enableTimer. stopBackgroundTasks now disables them
explicitly so a pending fire can't run after shutdown begins.
McpServer registered itself on every connection and relied on a global
current_connection_ pointer to identify which connection fired a close
event. That pointer was stale whenever more than one connection was
active, so session teardown ran against the wrong session (or none).

Introduce ConnectionLifecycleCallbacks — a tiny adapter bound to a
single (server, connection) pair. Each connection gets its own
instance, held in lifecycle_callbacks_ until close. The close path
now knows exactly which connection is dying and tears down its session
directly.

Both the connection and the adapter are handed to the dispatcher's
deferred-delete queue on close, since we are inside the adapter's own
onEvent() when we drop them. Adds a dispatcher-thread assert at the
entry so any off-thread caller trips immediately.
…212)

The detached worker that waits for the initialize response previously
also wrote server_capabilities_, initialized_, and kicked
protocol_state_machine_->handleEvent(INITIALIZED) directly. Those
fields are read from the dispatcher thread, so the worker's writes
were a data race, and the state-machine event was being handled off
the dispatcher thread that owns its timers.

Restructure so the worker only blocks and parses, then posts the state
commit and promise resolution back to main_dispatcher_. The promise
resolves after the post so get()-returning callers observe the
published state.
Client connections carry SSE responses whose body is an open-ended
stream of events — quiet periods between events are expected and
valid. The 60s body timeout was firing on these, and the state
machine's executeTransition() then crashed on a still-live stream.

Set body_timeout to 0 (disabled) for client mode; keep 60s on server
mode where request bodies are expected to complete promptly.
Covers three invariants the preceding commits rely on:

- Nested deferredDelete: a destructor that itself calls deferredDelete
  must not lose the inner object — it drains on a later pass. Guards
  the real-world pattern where a connection's destructor releases
  sub-objects that also need deferral.

- body_timeout == 0 disables the body timer entirely. The client-mode
  HttpCodecFilter sets this so SSE response streams with long idle
  gaps between events don't trigger a false timeout.

- Periodic timer re-arm on the same TimerPtr: the pattern McpServer's
  background tasks use. Confirms repeated fires and prompt cessation
  on disableTimer(), replacing the previous scheme that built a fresh
  timer inside each fire.
gophergogo added 3 commits April 19, 2026 14:47
McpConnectionManager::onConnectionEvent now asserts isThreadSafe(),
so calling it from the test thread aborts. Spin up a background
dispatcher thread in the fixture and route the two existing
OnConnectionEventPreventsDuplicate* cases through a runOnDispatcher
helper. Adds OnCloseRunsCleanlyOnDispatcherThread, which walks
Connected -> RemoteClose -> LocalClose on the dispatcher, and
RepeatedEventSequencesOnDispatcherThreadDoNotCrash, which replays
the sequence five times to catch any lifetime bug that only surfaces
under repeated close events.
Adds LifecycleAdapterDefersSelfAndPeerFromOwnCallback, a
primitive-level test that models McpServer::ConnectionLifecycleCallbacks:
an adapter which, from inside its own onEvent, hands both itself and
the owning peer connection to deferredDelete. Confirms neither
destructs synchronously inside that callback — doing so would
use-after-free the ConnectionCallbacks iterator in production — and
that both drain on a later dispatcher pass.
The state-machine test already proves body_timeout == 0 disables the
body timer. What it does not prove is that HttpCodecFilter in client
mode is actually wired to pass 0. The client-mode fix is a single
ternary in the filter's constructor; a regression (e.g. someone
reverting to a flat 60s) would bring back the SSE crash silently.

Add a narrow introspection accessor:
  - HttpCodecStateMachine::config() — read-only view of the
    as-configured values (immutable after construction).
  - HttpCodecFilter::bodyTimeout() — forwards to the state machine's
    configured body_timeout.

Use it in two new tests: server mode retains 60s, client mode is 0.
@caleb2h caleb2h merged commit f7a4c46 into main Apr 19, 2026
1 check passed
caleb2h pushed a commit that referenced this pull request Apr 19, 2026
…me (#212)

The previous implementation ran the callback synchronously when called
from the dispatcher thread, which breaks deferredDelete: the drain
runs while the caller's callback is still on the stack, destroying
objects mid-iteration and causing use-after-free in connection close
handlers. Route on-thread scheduling through a 0ms timer so libevent
fires the drain after the current handler returns.

Adds test_deferred_delete_ordering covering the contract.
caleb2h pushed a commit that referenced this pull request Apr 19, 2026
)

onConnectionEvent(RemoteClose|LocalClose) runs inside the connection's
own callback loop. Destroying the connection synchronously triggered a
use-after-free once the loop resumed. The previous workaround wrapped
the unique_ptr in a shared_ptr and posted it — now replaced with the
dispatcher's deferred-delete queue, which Connection already supports
via DeferredDeletable.

Also adds a dispatcher-thread assert at the top of onConnectionEvent so
any off-thread synthesis of connection events trips immediately instead
of silently corrupting state.
caleb2h pushed a commit that referenced this pull request Apr 19, 2026
startBackgroundTasks assigned timers to local unique_ptrs that dropped
at end of scope, so session cleanup and resource-update ticks never
fired. Each fire also constructed a fresh timer to re-arm, which both
leaked state and destroyed the currently-firing timer from inside its
own callback.

Hold both timers as McpServer members, and on each fire re-arm the
same timer via enableTimer. stopBackgroundTasks now disables them
explicitly so a pending fire can't run after shutdown begins.
caleb2h pushed a commit that referenced this pull request Apr 19, 2026
McpServer registered itself on every connection and relied on a global
current_connection_ pointer to identify which connection fired a close
event. That pointer was stale whenever more than one connection was
active, so session teardown ran against the wrong session (or none).

Introduce ConnectionLifecycleCallbacks — a tiny adapter bound to a
single (server, connection) pair. Each connection gets its own
instance, held in lifecycle_callbacks_ until close. The close path
now knows exactly which connection is dying and tears down its session
directly.

Both the connection and the adapter are handed to the dispatcher's
deferred-delete queue on close, since we are inside the adapter's own
onEvent() when we drop them. Adds a dispatcher-thread assert at the
entry so any off-thread caller trips immediately.
caleb2h pushed a commit that referenced this pull request Apr 19, 2026
…212)

The detached worker that waits for the initialize response previously
also wrote server_capabilities_, initialized_, and kicked
protocol_state_machine_->handleEvent(INITIALIZED) directly. Those
fields are read from the dispatcher thread, so the worker's writes
were a data race, and the state-machine event was being handled off
the dispatcher thread that owns its timers.

Restructure so the worker only blocks and parses, then posts the state
commit and promise resolution back to main_dispatcher_. The promise
resolves after the post so get()-returning callers observe the
published state.
caleb2h pushed a commit that referenced this pull request Apr 19, 2026
Client connections carry SSE responses whose body is an open-ended
stream of events — quiet periods between events are expected and
valid. The 60s body timeout was firing on these, and the state
machine's executeTransition() then crashed on a still-live stream.

Set body_timeout to 0 (disabled) for client mode; keep 60s on server
mode where request bodies are expected to complete promptly.
caleb2h pushed a commit that referenced this pull request Apr 19, 2026
Covers three invariants the preceding commits rely on:

- Nested deferredDelete: a destructor that itself calls deferredDelete
  must not lose the inner object — it drains on a later pass. Guards
  the real-world pattern where a connection's destructor releases
  sub-objects that also need deferral.

- body_timeout == 0 disables the body timer entirely. The client-mode
  HttpCodecFilter sets this so SSE response streams with long idle
  gaps between events don't trigger a false timeout.

- Periodic timer re-arm on the same TimerPtr: the pattern McpServer's
  background tasks use. Confirms repeated fires and prompt cessation
  on disableTimer(), replacing the previous scheme that built a fresh
  timer inside each fire.
caleb2h pushed a commit that referenced this pull request Apr 19, 2026
McpConnectionManager::onConnectionEvent now asserts isThreadSafe(),
so calling it from the test thread aborts. Spin up a background
dispatcher thread in the fixture and route the two existing
OnConnectionEventPreventsDuplicate* cases through a runOnDispatcher
helper. Adds OnCloseRunsCleanlyOnDispatcherThread, which walks
Connected -> RemoteClose -> LocalClose on the dispatcher, and
RepeatedEventSequencesOnDispatcherThreadDoNotCrash, which replays
the sequence five times to catch any lifetime bug that only surfaces
under repeated close events.
caleb2h pushed a commit that referenced this pull request Apr 19, 2026
Adds LifecycleAdapterDefersSelfAndPeerFromOwnCallback, a
primitive-level test that models McpServer::ConnectionLifecycleCallbacks:
an adapter which, from inside its own onEvent, hands both itself and
the owning peer connection to deferredDelete. Confirms neither
destructs synchronously inside that callback — doing so would
use-after-free the ConnectionCallbacks iterator in production — and
that both drain on a later dispatcher pass.
@gophergogo gophergogo mentioned this pull request Apr 19, 2026
5 tasks
gophergogo pushed a commit that referenced this pull request Apr 20, 2026
McpServerConfig already carries http_sse_path, http_rpc_path, and the
new external_url (added earlier in this stack), but the fallback
"default HTTP/SSE filter chain factory" construction path was still
building the factory with no path arguments — so the SSE server
transport was hardcoded to /sse and /mcp with no external URL,
regardless of what the operator configured.

Forward all three fields through the ctor. Also set http_path to
http_rpc_path explicitly so a server that uses this fallback path
surfaces a consistent RPC endpoint; http_host stays "localhost"
because it's the client-mode Host header and irrelevant in server
mode.

Note: upstream's version of this change also rewrote the
active_connections_ cleanup on close to post-based deferral, but the
PR #212 crash-fix stack already replaced that code with
Dispatcher::deferredDelete, which is the correct primitive here and
handles the use-after-free window that post() only partially covers.
Intentionally skipping that part of the upstream diff.
gophergogo pushed a commit that referenced this pull request Apr 20, 2026
)

McpServerConfig already carries http_sse_path, http_rpc_path, and the
new external_url (added earlier in this stack), but the fallback
"default HTTP/SSE filter chain factory" construction path was still
building the factory with no path arguments — so the SSE server
transport was hardcoded to /sse and /mcp with no external URL,
regardless of what the operator configured.

Forward all three fields through the ctor. Also set http_path to
http_rpc_path explicitly so a server that uses this fallback path
surfaces a consistent RPC endpoint; http_host stays "localhost"
because it's the client-mode Host header and irrelevant in server
mode.

Note: upstream's version of this change also rewrote the
active_connections_ cleanup on close to post-based deferral, but the
PR #212 crash-fix stack already replaced that code with
Dispatcher::deferredDelete, which is the correct primitive here and
handles the use-after-free window that post() only partially covers.
Intentionally skipping that part of the upstream diff.
gophergogo pushed a commit that referenced this pull request Apr 20, 2026
)

McpServerConfig already carries http_sse_path, http_rpc_path, and the
new external_url (added earlier in this stack), but the fallback
"default HTTP/SSE filter chain factory" construction path was still
building the factory with no path arguments — so the SSE server
transport was hardcoded to /sse and /mcp with no external URL,
regardless of what the operator configured.

Forward all three fields through the ctor. Also set http_path to
http_rpc_path explicitly so a server that uses this fallback path
surfaces a consistent RPC endpoint; http_host stays "localhost"
because it's the client-mode Host header and irrelevant in server
mode.

Note: upstream's version of this change also rewrote the
active_connections_ cleanup on close to post-based deferral, but the
PR #212 crash-fix stack already replaced that code with
Dispatcher::deferredDelete, which is the correct primitive here and
handles the use-after-free window that post() only partially covers.
Intentionally skipping that part of the upstream diff.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants