Add JmsFailoverWatchdog and bound failover reconnect#1681
Merged
Conversation
7 tasks
Bound the FailoverTransport reconnect budget with maxReconnectAttempts=20 and exponential backoff (1s → 30s) in VCMessagingServiceActiveMQ; keep startupMaxReconnectAttempts=-1 so pod boot tolerates a slow broker. Without this the FailoverTransport reconnects forever, which is wrong behavior in K8s where pod restart is the right response to a sustained broker outage. Add JmsFailoverWatchdog: a TransportListener attached to each ActiveMQConnection that runs a caller-supplied Runnable when the failover layer reports a terminal IOException. The terminal action is constructor- injected so production wiring stays visible at the composition root and tests can substitute their own handler. Two factory methods: logOnly() (the default — log lifecycle events but take no further action) and jvmExitOnTerminal() (escape hatch for any future service that wants K8s pod recycle on terminal transport failure). VCMessagingServiceJms holds a watchdog field with a setter; the default is JmsFailoverWatchdog.logOnly(). No service opts into jvmExitOnTerminal in this change — the setter is the escape hatch for any future need. Wired into MessageProducerSessionJms and ConsumerContextJms — the two long-lived JMS connection sites. Short-lived batch processes (OptimizationBatchServer, JavaSimulationExecutable) intentionally skipped. Why this is defense-in-depth, not the primary fix: the OOM-driven wedge mechanism that originally motivated this work — a TimerTask hitting OutOfMemoryError on the JVM-wide static InactivityMonitor heartbeat Timer, killing the TimerThread silently and corrupting the failover transport for the rest of the JVM lifetime — is closed off by the -XX:+ExitOnOutOfMemoryError flag added in PR #1683 (the JVM aborts on the first OOM before the InactivityMonitor TimerThread can be touched). The watchdog covers non-OOM terminal-failover paths: sustained network partition, broker maintenance > 8 min, or any future client regression in the static-Timer / static-counter design (still present in 5.18.x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JmsFailoverWatchdogTest spins up an embedded ActiveMQ BrokerService on a random TCP port, opens a connection through a tightly-bounded failover URL (maxReconnectAttempts=2, 50ms initial / 100ms max backoff), attaches a watchdog with a CountDownLatch terminal handler, stops the broker, and asserts the latch fires within 10s. Tagged Fast; runs in ~0.7s. Validates that JmsFailoverWatchdog.attach correctly registers with ActiveMQConnection and that the injected Runnable runs when the failover transport reports a terminal IOException — without killing the test JVM via the production System.exit handler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6d1f44b to
58e0bc8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
IllegalStateException: Timer already cancelled.repeating in the FailoverTransport reconnect loop. JMS traffic stopped; manualkubectl rollout restartwas the only recovery. The root cause is inorg.apache.activemq.transport.AbstractInactivityMonitor: a JVM-wide staticTimershared by everyInactivityMonitor, refcounted byCHECKER_COUNTER. If any TimerTask throws an unchecked exception,TimerThreadself-terminates silently and the static fields still point at the corpse Timer;CHECKER_COUNTER > 0blocks the lazy-init guard, so every subsequent reconnect'sschedule()throws forever. The data pod's high JMS-connection churn (~30 freshConnection+InactivityMonitorper minute, sinceMessageProducerSessionJmsconstructs one per RPC reply) makes this latent failure essentially inevitable on long-running pods.TransportListenerthat exits the JVM on terminal IOException so K8s recycles the pod. Default behavior is log-only — onlySimDataServer(the pod that wedged) opts into JVM exit. Other daemons (DatabaseServer, SimulationDispatcher, HtcSimulationWorker, the vcell-rest Guice binding) intentionally not opted in here; review individually before adding.6d1f44bd8d) is a doc-only update to.claude/commands/loki-query.mdcherry-picked from #loki-query-kubectl-direct. Whichever PR merges first will absorb it; the second will show no overlap.Investigation evidence (in case you want to verify)
Reconstructed via
/loki-queryfor the data pod andkubectl execagainst the broker:data-55b4888797-gm82jstartsThe wedge isn't caused by the network event — that's just what exposes it. In steady state the bug is invisible.
Test plan
mvn test -pl vcell-server -Dtest=JmsFailoverWatchdogTest— passes (~0.7s) using embedded BrokerService + latch handlermvn compile test-compile -pl vcell-server,vcell-rest -am— BUILD SUCCESSmvn test -Dgroups="Fast"— sanity check no regression in JMS paths from the version bumpactivemqintfor >10min, observedatapod log forFATAL JMS transport unrecoverablefollowed by K8s pod recycle within ~5-8 minutes (maxReconnectAttempts=20 × exponential backoff capped at 30s)JMS transport interrupted, failover reconnectingbut not exit when activemqint is unavailableFollow-ups not in this PR
🤖 Generated with Claude Code