Skip to content

fix(engine): release worker buffers in coordinated close to avoid JVM crash on binding fault#1809

Merged
jfallows merged 1 commit into
developfrom
fix/engine-controlled-shutdown
May 29, 2026
Merged

fix(engine): release worker buffers in coordinated close to avoid JVM crash on binding fault#1809
jfallows merged 1 commit into
developfrom
fix/engine-controlled-shutdown

Conversation

@jfallows
Copy link
Copy Markdown
Contributor

Problem

A hot-path binding exception (e.g. a null-exchange NPE in a client stream factory) is caught by EngineWorker.doWork and rethrown wrapped in AgentTerminationException, terminating the worker agent. Agrona's AgentRunner then runs the agent's onClose on the agent thread, which released the worker's memory-mapped buffers in advance. The engine-coordinated Engine.close() then calls drain(), reading that worker's already-unmapped streams ring buffer via Unsafe.getLongVolatileSIGSEGV / JVM abort (confirmed via hs_err_pid fatal-error log).

The intended policy is preserved: an uncaught binding exception should stop the engine — but it must not also crash the JVM.

Fix

  • EngineWorkeronClose no longer releases any memory-mapped resources. All releases (targets, streamsLayout, bufferPoolLayout, debitors, creditor, eventWriter) are deferred to doClose's finally, which runs on the engine thread after runner.close() has stopped the agent. A self-terminated worker therefore no longer unmaps buffers the engine — or peer workers (shared creditor/debitor budget buffers) — still reference.
  • Engine.close — skips draining a worker whose agent already terminated (!worker.runner().isClosed()); such a worker will never consume again, so draining it would otherwise spin until the 30s timeout.

Validation

  • Repro (NPE injected in a binding newStream): pre-fix → JVM abort (exit 134, hs_err_pid); post-fix → no crash, no hs_err, ~1.0s (no drain hang). The worker stops and the engine closes cleanly.
  • ./mvnw verify -pl runtime/engine (unit + EngineIT + jacoco + checkstyle): green on the develop base.

🤖 Generated with Claude Code

@jfallows jfallows force-pushed the fix/engine-controlled-shutdown branch from fecc804 to a7b42c4 Compare May 29, 2026 20:19
… crash on binding fault

A hot-path binding exception is wrapped by EngineWorker.doWork as AgentTerminationException, terminating the worker agent. Agrona then runs onClose on the agent thread, which released the worker's memory-mapped buffers in advance. The engine-coordinated Engine.close() -> drain() then read that worker's unmapped streams ring buffer via Unsafe, causing a SIGSEGV / JVM abort.

Defer all worker memory-mapped resource releases (targets, streamsLayout, bufferPoolLayout, debitors, creditor, eventWriter) from onClose to doClose's finally, which runs after the agent has stopped, so a self-terminated worker no longer unmaps buffers the engine or peer workers still reference. Engine.close also skips draining workers whose agent already terminated, which would otherwise spin until the drain timeout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jfallows jfallows force-pushed the fix/engine-controlled-shutdown branch from a7b42c4 to 6413eee Compare May 29, 2026 20:20
Copy link
Copy Markdown
Contributor Author

@jfallows jfallows left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jfallows jfallows merged commit b5d9f61 into develop May 29, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant