Add JmsFailoverWatchdog and bound failover reconnect by jcschaff · Pull Request #1681 · virtualcell/vcell

jcschaff · 2026-05-06T21:05:08Z

Summary

Problem. On 2026-05-06 the prod data pod wedged its ActiveMQ Classic client connection with IllegalStateException: Timer already cancelled. repeating in the FailoverTransport reconnect loop. JMS traffic stopped; manual kubectl rollout restart was the only recovery. The root cause is in org.apache.activemq.transport.AbstractInactivityMonitor: a JVM-wide static Timer shared by every InactivityMonitor, refcounted by CHECKER_COUNTER. If any TimerTask throws an unchecked exception, TimerThread self-terminates silently and the static fields still point at the corpse Timer; CHECKER_COUNTER > 0 blocks the lazy-init guard, so every subsequent reconnect's schedule() throws forever. The data pod's high JMS-connection churn (~30 fresh Connection + InactivityMonitor per minute, since MessageProducerSessionJms constructs one per RPC reply) makes this latent failure essentially inevitable on long-running pods.
Defense-in-depth fix. Bound the failover reconnect budget, attach a TransportListener that exits the JVM on terminal IOException so K8s recycles the pod. Default behavior is log-only — only SimDataServer (the pod that wedged) opts into JVM exit. Other daemons (DatabaseServer, SimulationDispatcher, HtcSimulationWorker, the vcell-rest Guice binding) intentionally not opted in here; review individually before adding.
Version bump (does not fix the wedge). activemq-broker/-client 5.18.3 → 5.18.7 picks up unrelated OpenWire validation hardening + dependency updates. The InactivityMonitor / FailoverTransport files are byte-identical between 5.18.3 and 5.18.7 (verified by diff against upstream). Commit message explicitly disclaims a wedge fix.
Note for reviewers: the last commit (6d1f44bd8d) is a doc-only update to .claude/commands/loki-query.md cherry-picked from #loki-query-kubectl-direct. Whichever PR merges first will absorb it; the second will show no overlap.

Investigation evidence (in case you want to verify)

Reconstructed via /loki-query for the data pod and kubectl exec against the broker:

Time (UTC)	Event
2026-05-05 15:16	Old data pod `data-55b4888797-gm82j` starts
sometime in next 15h	A TimerTask on the JVM-wide static heartbeat Timer throws unchecked → TimerThread dies silently (no log)
2026-05-06 06:41:05	Broker WARN `Transport Connection to: tcp://10.42.3.55:40214 failed: java.io.EOFException` — exposes the latent wedge
06:41:05+	FailoverTransport tries to reconnect, schedule() on dead Timer throws, 10 instant-fail attempts in 32s
06:41:37	First "after: 10 attempt(s) with Timer already cancelled." surfaces in pod logs

The wedge isn't caused by the network event — that's just what exposes it. In steady state the bug is invisible.

Test plan

mvn test -pl vcell-server -Dtest=JmsFailoverWatchdogTest — passes (~0.7s) using embedded BrokerService + latch handler
mvn compile test-compile -pl vcell-server,vcell-rest -am — BUILD SUCCESS
Run Fast group: mvn test -Dgroups="Fast" — sanity check no regression in JMS paths from the version bump
Stage rollout: kill activemqint for >10min, observe data pod log for FATAL JMS transport unrecoverable followed by K8s pod recycle within ~5-8 minutes (maxReconnectAttempts=20 × exponential backoff capped at 30s)
Confirm only SimDataServer opts in — other daemons should log JMS transport interrupted, failover reconnecting but not exit when activemqint is unavailable
After deploy, monitor for any unexpected exits (the watchdog should fire only when the failover transport genuinely gives up)

Follow-ups not in this PR

AMQP 1.0 migration of OpenWire daemons (data, db, sched, submit) — folds into the in-flight Quarkus 3.27 + Artemis work; eliminates the bug class entirely (different heartbeat mechanism, no shared static Timer).
Reconfigure activemqint/activemqsim/artemismq to log to stdout (currently supervisord-PID-1 swallows broker logs — see the doc commit).

🤖 Generated with Claude Code

Bound the FailoverTransport reconnect budget with maxReconnectAttempts=20 and exponential backoff (1s → 30s) in VCMessagingServiceActiveMQ; keep startupMaxReconnectAttempts=-1 so pod boot tolerates a slow broker. Without this the FailoverTransport reconnects forever, which is wrong behavior in K8s where pod restart is the right response to a sustained broker outage. Add JmsFailoverWatchdog: a TransportListener attached to each ActiveMQConnection that runs a caller-supplied Runnable when the failover layer reports a terminal IOException. The terminal action is constructor- injected so production wiring stays visible at the composition root and tests can substitute their own handler. Two factory methods: logOnly() (the default — log lifecycle events but take no further action) and jvmExitOnTerminal() (escape hatch for any future service that wants K8s pod recycle on terminal transport failure). VCMessagingServiceJms holds a watchdog field with a setter; the default is JmsFailoverWatchdog.logOnly(). No service opts into jvmExitOnTerminal in this change — the setter is the escape hatch for any future need. Wired into MessageProducerSessionJms and ConsumerContextJms — the two long-lived JMS connection sites. Short-lived batch processes (OptimizationBatchServer, JavaSimulationExecutable) intentionally skipped. Why this is defense-in-depth, not the primary fix: the OOM-driven wedge mechanism that originally motivated this work — a TimerTask hitting OutOfMemoryError on the JVM-wide static InactivityMonitor heartbeat Timer, killing the TimerThread silently and corrupting the failover transport for the rest of the JVM lifetime — is closed off by the -XX:+ExitOnOutOfMemoryError flag added in PR #1683 (the JVM aborts on the first OOM before the InactivityMonitor TimerThread can be touched). The watchdog covers non-OOM terminal-failover paths: sustained network partition, broker maintenance > 8 min, or any future client regression in the static-Timer / static-counter design (still present in 5.18.x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JmsFailoverWatchdogTest spins up an embedded ActiveMQ BrokerService on a random TCP port, opens a connection through a tightly-bounded failover URL (maxReconnectAttempts=2, 50ms initial / 100ms max backoff), attaches a watchdog with a CountDownLatch terminal handler, stops the broker, and asserts the latch fires within 10s. Tagged Fast; runs in ~0.7s. Validates that JmsFailoverWatchdog.attach correctly registers with ActiveMQConnection and that the injected Runnable runs when the failover transport reports a terminal IOException — without killing the test JVM via the production System.exit handler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jcschaff mentioned this pull request May 7, 2026

Exit JVM and dump heap on OutOfMemoryError for server containers #1683

Merged

7 tasks

jcschaff and others added 2 commits May 7, 2026 00:22

jcschaff force-pushed the fix-jms-failover-wedge branch from 6d1f44b to 58e0bc8 Compare May 7, 2026 04:48

jcschaff changed the title ~~Recover ActiveMQ failover wedge via JmsFailoverWatchdog~~ Add JmsFailoverWatchdog and bound failover reconnect May 7, 2026

jcschaff merged commit 210bb01 into master May 7, 2026
13 checks passed

jcschaff deleted the fix-jms-failover-wedge branch May 7, 2026 04:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JmsFailoverWatchdog and bound failover reconnect#1681

Add JmsFailoverWatchdog and bound failover reconnect#1681
jcschaff merged 2 commits intomasterfrom
fix-jms-failover-wedge

jcschaff commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jcschaff commented May 6, 2026

Summary

Investigation evidence (in case you want to verify)

Test plan

Follow-ups not in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant