Skip to content

Add JmsFailoverWatchdog and bound failover reconnect#1681

Merged
jcschaff merged 2 commits intomasterfrom
fix-jms-failover-wedge
May 7, 2026
Merged

Add JmsFailoverWatchdog and bound failover reconnect#1681
jcschaff merged 2 commits intomasterfrom
fix-jms-failover-wedge

Conversation

@jcschaff
Copy link
Copy Markdown
Member

@jcschaff jcschaff commented May 6, 2026

Summary

  • Problem. On 2026-05-06 the prod data pod wedged its ActiveMQ Classic client connection with IllegalStateException: Timer already cancelled. repeating in the FailoverTransport reconnect loop. JMS traffic stopped; manual kubectl rollout restart was the only recovery. The root cause is in org.apache.activemq.transport.AbstractInactivityMonitor: a JVM-wide static Timer shared by every InactivityMonitor, refcounted by CHECKER_COUNTER. If any TimerTask throws an unchecked exception, TimerThread self-terminates silently and the static fields still point at the corpse Timer; CHECKER_COUNTER > 0 blocks the lazy-init guard, so every subsequent reconnect's schedule() throws forever. The data pod's high JMS-connection churn (~30 fresh Connection + InactivityMonitor per minute, since MessageProducerSessionJms constructs one per RPC reply) makes this latent failure essentially inevitable on long-running pods.
  • Defense-in-depth fix. Bound the failover reconnect budget, attach a TransportListener that exits the JVM on terminal IOException so K8s recycles the pod. Default behavior is log-only — only SimDataServer (the pod that wedged) opts into JVM exit. Other daemons (DatabaseServer, SimulationDispatcher, HtcSimulationWorker, the vcell-rest Guice binding) intentionally not opted in here; review individually before adding.
  • Version bump (does not fix the wedge). activemq-broker/-client 5.18.3 → 5.18.7 picks up unrelated OpenWire validation hardening + dependency updates. The InactivityMonitor / FailoverTransport files are byte-identical between 5.18.3 and 5.18.7 (verified by diff against upstream). Commit message explicitly disclaims a wedge fix.
  • Note for reviewers: the last commit (6d1f44bd8d) is a doc-only update to .claude/commands/loki-query.md cherry-picked from #loki-query-kubectl-direct. Whichever PR merges first will absorb it; the second will show no overlap.

Investigation evidence (in case you want to verify)

Reconstructed via /loki-query for the data pod and kubectl exec against the broker:

Time (UTC) Event
2026-05-05 15:16 Old data pod data-55b4888797-gm82j starts
sometime in next 15h A TimerTask on the JVM-wide static heartbeat Timer throws unchecked → TimerThread dies silently (no log)
2026-05-06 06:41:05 Broker WARN `Transport Connection to: tcp://10.42.3.55:40214 failed: java.io.EOFException` — exposes the latent wedge
06:41:05+ FailoverTransport tries to reconnect, schedule() on dead Timer throws, 10 instant-fail attempts in 32s
06:41:37 First "after: 10 attempt(s) with Timer already cancelled." surfaces in pod logs

The wedge isn't caused by the network event — that's just what exposes it. In steady state the bug is invisible.

Test plan

  • mvn test -pl vcell-server -Dtest=JmsFailoverWatchdogTest — passes (~0.7s) using embedded BrokerService + latch handler
  • mvn compile test-compile -pl vcell-server,vcell-rest -am — BUILD SUCCESS
  • Run Fast group: mvn test -Dgroups="Fast" — sanity check no regression in JMS paths from the version bump
  • Stage rollout: kill activemqint for >10min, observe data pod log for FATAL JMS transport unrecoverable followed by K8s pod recycle within ~5-8 minutes (maxReconnectAttempts=20 × exponential backoff capped at 30s)
  • Confirm only SimDataServer opts in — other daemons should log JMS transport interrupted, failover reconnecting but not exit when activemqint is unavailable
  • After deploy, monitor for any unexpected exits (the watchdog should fire only when the failover transport genuinely gives up)

Follow-ups not in this PR

  • AMQP 1.0 migration of OpenWire daemons (data, db, sched, submit) — folds into the in-flight Quarkus 3.27 + Artemis work; eliminates the bug class entirely (different heartbeat mechanism, no shared static Timer).
  • Reconfigure activemqint/activemqsim/artemismq to log to stdout (currently supervisord-PID-1 swallows broker logs — see the doc commit).

🤖 Generated with Claude Code

jcschaff and others added 2 commits May 7, 2026 00:22
Bound the FailoverTransport reconnect budget with maxReconnectAttempts=20
and exponential backoff (1s → 30s) in VCMessagingServiceActiveMQ; keep
startupMaxReconnectAttempts=-1 so pod boot tolerates a slow broker.
Without this the FailoverTransport reconnects forever, which is wrong
behavior in K8s where pod restart is the right response to a sustained
broker outage.

Add JmsFailoverWatchdog: a TransportListener attached to each
ActiveMQConnection that runs a caller-supplied Runnable when the failover
layer reports a terminal IOException. The terminal action is constructor-
injected so production wiring stays visible at the composition root and
tests can substitute their own handler. Two factory methods: logOnly()
(the default — log lifecycle events but take no further action) and
jvmExitOnTerminal() (escape hatch for any future service that wants K8s
pod recycle on terminal transport failure).

VCMessagingServiceJms holds a watchdog field with a setter; the default
is JmsFailoverWatchdog.logOnly(). No service opts into jvmExitOnTerminal
in this change — the setter is the escape hatch for any future need.

Wired into MessageProducerSessionJms and ConsumerContextJms — the two
long-lived JMS connection sites. Short-lived batch processes
(OptimizationBatchServer, JavaSimulationExecutable) intentionally skipped.

Why this is defense-in-depth, not the primary fix: the OOM-driven wedge
mechanism that originally motivated this work — a TimerTask hitting
OutOfMemoryError on the JVM-wide static InactivityMonitor heartbeat
Timer, killing the TimerThread silently and corrupting the failover
transport for the rest of the JVM lifetime — is closed off by the
-XX:+ExitOnOutOfMemoryError flag added in PR #1683 (the JVM aborts on
the first OOM before the InactivityMonitor TimerThread can be touched).
The watchdog covers non-OOM terminal-failover paths: sustained network
partition, broker maintenance > 8 min, or any future client regression
in the static-Timer / static-counter design (still present in 5.18.x).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JmsFailoverWatchdogTest spins up an embedded ActiveMQ BrokerService on a
random TCP port, opens a connection through a tightly-bounded failover
URL (maxReconnectAttempts=2, 50ms initial / 100ms max backoff), attaches
a watchdog with a CountDownLatch terminal handler, stops the broker, and
asserts the latch fires within 10s. Tagged Fast; runs in ~0.7s.

Validates that JmsFailoverWatchdog.attach correctly registers with
ActiveMQConnection and that the injected Runnable runs when the failover
transport reports a terminal IOException — without killing the test JVM
via the production System.exit handler.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jcschaff jcschaff force-pushed the fix-jms-failover-wedge branch from 6d1f44b to 58e0bc8 Compare May 7, 2026 04:48
@jcschaff jcschaff changed the title Recover ActiveMQ failover wedge via JmsFailoverWatchdog Add JmsFailoverWatchdog and bound failover reconnect May 7, 2026
@jcschaff jcschaff merged commit 210bb01 into master May 7, 2026
13 checks passed
@jcschaff jcschaff deleted the fix-jms-failover-wedge branch May 7, 2026 04:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant