Skip to content

DATAGO-134580: Recreate JCSMP producer on unsolicited CloseFlow#475

Open
sunil-solace wants to merge 14 commits into
SolaceProducts:masterfrom
SolaceDev:DATAGO-134580
Open

DATAGO-134580: Recreate JCSMP producer on unsolicited CloseFlow#475
sunil-solace wants to merge 14 commits into
SolaceProducts:masterfrom
SolaceDev:DATAGO-134580

Conversation

@sunil-solace
Copy link
Copy Markdown

When the broker fans out an unsolicited CloseFlow on a publisher flow (message-spool maintenance, DR failover, "503: Service Unavailable" on GD), JCSMP terminally closes the per-binding XMLMessageProducer. The JCSMP session stays connected, but every subsequent producer.send(...) throws StaleSessionException until the application is restarted - customer-reported in DATAGO-132760.

JCSMPOutboundMessageHandler now detects StaleSessionException and ClosedFacilityException in the send-path catch block, arms a volatile producerNeedsRecreation flag, and at the top of the next handleMessage rebuilds the producer (and the TransactedSession when configured) under a double-checked-lock before publishing. Recreation failures are routed through the existing error channel and leave the flag armed so the next inbound message retries. The flag is reset on start() and closeResources() to keep stop/start lifecycles clean.

Tests:

  • Unit (JCSMPOutboundMessageHandlerTest): parameterized stale-recovery across plain and transacted sessions, ClosedFacilityException variant, recreation-failure-then-retry, and lifecycle-driven flag reset.
  • Broker IT (JCSMPProducerCloseFlowRecoveryIT, new): five scenarios against PubSubPlusExtension - three controls that document broker disruptions which do NOT reproduce the bug (spool-quota toggle on persistent topic, direct topic, queue ingress/egress toggle), the customer-reported reproducer driven via the broker's CLI (hardware message-spool shutdown over docker exec with TTY for confirmation prompts), and a repeated-cycles variant that proves the recreate path resets cleanly across consecutive stale-flow events on the same binding.

sunil-solace and others added 11 commits May 13, 2026 17:31
When the broker fans out an unsolicited CloseFlow on a publisher flow
(message-spool maintenance, DR failover, "503: Service Unavailable" on
GD), JCSMP terminally closes the per-binding XMLMessageProducer. The
JCSMP session stays connected, but every subsequent producer.send(...)
throws StaleSessionException until the application is restarted -
customer-reported in DATAGO-132760.

JCSMPOutboundMessageHandler now detects StaleSessionException and
ClosedFacilityException in the send-path catch block, arms a volatile
producerNeedsRecreation flag, and at the top of the next handleMessage
rebuilds the producer (and the TransactedSession when configured) under
a double-checked-lock before publishing. Recreation failures are routed
through the existing error channel and leave the flag armed so the next
inbound message retries. The flag is reset on start() and closeResources()
to keep stop/start lifecycles clean.

Tests:
- Unit (JCSMPOutboundMessageHandlerTest): parameterized stale-recovery
  across plain and transacted sessions, ClosedFacilityException variant,
  recreation-failure-then-retry, and lifecycle-driven flag reset.
- Broker IT (JCSMPProducerCloseFlowRecoveryIT, new): five scenarios
  against PubSubPlusExtension - three controls that document broker
  disruptions which do NOT reproduce the bug (spool-quota toggle on
  persistent topic, direct topic, queue ingress/egress toggle), the
  customer-reported reproducer driven via the broker's CLI
  (hardware message-spool shutdown over docker exec with TTY for
  confirmation prompts), and a repeated-cycles variant that proves
  the recreate path resets cleanly across consecutive stale-flow
  events on the same binding.
Five round-2 fixes from PR #141 review (Copilot):

1. Broaden the recreate trigger in JCSMPOutboundMessageHandler to also arm
   the producerNeedsRecreation flag on JCSMPTransportException. The broker
   IT explicitly documents that an unsolicited CloseFlow can surface on
   send(...) as either StaleSessionException (the typical form) or the
   raw JCSMPTransportException (when the CloseFlow event reaches send()
   before JCSMP's stale-marker has propagated); both are now treated as
   recreation triggers. New unit test
   test_producerRecreatedAfterJCSMPTransportException covers the new arm.

2. Fix BrokerConfiguratorBuilder.disableMsgSpoolForVpn Javadoc which
   incorrectly claimed zeroing maxMsgSpoolUsage tears down already-bound
   publisher / consumer flows. The IT documents the opposite: the broker
   keeps existing flows alive and only NACKs new GD publishes. Javadoc
   now matches the empirical behaviour and points at IT test 1.

3. JCSMPProducerCloseFlowRecoveryIT comment-vs-code mismatch: the
   "TWO sequential confirmation prompts" comment was inaccurate - the
   `shutdown` command commits after a single `y` (the second prompt
   visible in the broker's TTY output is decorative; the broker accepts
   our `y` on the first answerable prompt). Comment realigned to a
   single confirmation.

4. runSolaceCliCommands now checks the boolean from awaitCompletion(...)
   and throws an IllegalStateException with the full script and captured
   stdout/stderr if the docker exec does not complete in time, instead
   of silently letting downstream assertions fail with misleading
   timeouts. The Frame callback was extended to buffer the CLI output
   for inclusion in the diagnostic.

5. CLI script terminator extended from `end\nexit\n` to `end\nexit\nexit\n`.
   The single `exit` only popped from privileged-exec `#` to user-exec `>`
   and then sat there waiting for input; cli -A never terminated and
   awaitCompletion returned false. The second `exit` closes the user-exec
   session so the cli -A process actually exits and awaitCompletion can
   observe it. Without this, the new defensive throw in (4) would fire
   on every CLI invocation and turn the IT into a hard failure.

All 411 binder-core unit tests remain green; all 5 broker IT scenarios
turn green again with these changes.
Five changes from mayur-solace's PR #141 review, all in
JCSMPOutboundMessageHandler:

- M1 / producer.isClosed() defensive check: stale-detection in the
  send-path catch now also arms the recreation flag when
  producer.isClosed() returns true. Guards against any future JCSMP
  exception subclass we don't enumerate explicitly - if JCSMP has
  marked the producer closed, recreation is always the right move.

- M2 / single-line if braces: all single-line if guards in
  recreateProducerIfNeeded() are now braced for consistency with house
  style.

- M3 / field rename: producerNeedsRecreation -> recreateProducer.
  Test Javadocs and assertion messages updated to match.

- M4 / catch-block ordering: stale-detection moved to the top of the
  catch in handleMessage(), before the transactional rollback handling.
  Behaviour is unchanged (the flag is consumed by the next
  handleMessage, not by anything in this catch) but the structure now
  reads as "establish facts about the producer first, then handle
  transactional bookkeeping".

- M5 / comment cleanup: stripped the verbose DATAGO-134580 comments
  and the Javadocs I added on createProducerInternal and
  recreateProducerIfNeeded. Method names are self-documenting; the
  ticket reference now lives on a single line above the field
  declaration, matching the reviewer's suggestion.

411 binder-core unit tests remain green (no behavioural change from
M1-M5; only one new code path is M1's isClosed() OR-arm).
Extend the recreate-on-stale guard with a proactive
producer.isClosed() pre-check at the top of handleMessage(...). The
reactive (catch-block) detection stays as-is and continues to cover the
race window where the broker tears down the producer between our
pre-check and the actual send(...) call, plus any future JCSMP
exception subclass we don't enumerate.

Customer-visible effect: the first publish after the broker fans out an
unsolicited CloseFlow now succeeds rather than being surfaced to the
error channel. Previously the reactive-only path always lost that first
message - the catch armed the flag and the next message recovered, but
the in-flight message that hit the dead producer always failed.

Unit test test_producerRecreatedProactivelyWhenIsClosedDetectedBeforeSend
mocks a producer that reports isClosed() == true on the first
handleMessage, with no exception thrown by send, and verifies:
- recreation happens (createProducer called twice: once at start, once
  proactively)
- the fresh producer services the publish
- the closed original is never sent through (Mockito.never())
- the closed original is closed during the recreate

Integration test updates: the bug-witness assertions in tests 4 and 5
of JCSMPProducerCloseFlowRecoveryIT previously expected the first
post-shutdown publish to throw a JCSMP-rooted MessagingException - that
was the master-branch failure mode they documented. With the proactive
check, the handler intercepts the dead producer before send(...) is
ever called, so the first publish now succeeds transparently. The bug-
witness phase is replaced with a doesNotThrowAnyException assertion on
the first publish, plus a steady-state assertion on a follow-up
publish. The new shape is strictly stricter than the old: a regression
that removed the proactive check would fail because reactive-only
recovery surfaces an exception on that first attempt; a regression that
removed both halves would fail outright; a regression on cycle 2+ in
test 5 would fail on the second cycle. Test-class Javadoc, per-test
headers, and the test list have all been updated to drop the
"reproducer"/"bug witness" framing - the bug doesn't manifest as a
publish failure anymore, the IT now characterises the recovery contract
instead.

411 binder-core unit tests still green (75 in
JCSMPOutboundMessageHandlerTest, +1 from previous commit for the new
proactive test).
The reactive + proactive recreate-on-stale logic added in PR #141
(commits 931f09c..134e7ef) protects each binding's per-binding
XMLMessageProducer in JCSMPOutboundMessageHandler. The error-queue
republish path in ErrorQueueInfrastructure has the same exposure but
on a different producer: it borrows the session-default producer from
JCSMPSessionProducerManager via producerManager.get(producerKey) and
historically had no recovery logic when the broker tore that producer
down via unsolicited CloseFlow.

Failure mode without this fix: when the broker fans out CloseFlow
(message-spool maintenance, DR failover, "503: Service Unavailable"),
the shared session-default producer is marked closed by JCSMP. Every
subsequent error-queue republish in ErrorQueueInfrastructure.send()
throws StaleSessionException / JCSMPTransportException /
ClosedFacilityException; ErrorQueueRepublishCorrelationKey.handleError()
catches at the message-retry level and re-attempts up to
maxDeliveryAttempts - all attempts re-using the same dead producer
reference, all doomed to fail. After max attempts the message is
re-queued onto the original consumer queue, the consumer redelivers
it, fails again, hits the error-queue path again, fails again. Net
effect: failed-consumer messages disappear from the system after a
DR failover or spool maintenance event.

The fix mirrors the outbound-handler approach:

- Proactive: at the top of send(), after producerManager.get(...),
  check producer.isClosed(). If true, call the new
  producerManager.forceRecreate() to rebuild the shared producer
  before send is attempted.

- Reactive: wrap producer.send(...) in a try-catch. On
  StaleSessionException, JCSMPTransportException,
  ClosedFacilityException, or post-failure producer.isClosed(), call
  forceRecreate() so the next ErrorQueueRepublishCorrelationKey
  retry-loop iteration picks up a fresh producer. The original
  exception still propagates so the retry caller can do its
  errorQueueDeliveryAttempt++ bookkeeping.

The shared producer is reference-counted across the entire session
(JCSMPOutboundMessageHandler also registers itself for ref-count
purposes even though it uses its own per-binding producer for sends).
release() + get() does NOT work as a recovery primitive in production
because it only closes the resource when registeredIds.size() <= 1 -
in any deployment with at least one outbound binding, the ref-count
stays > 1 and release() leaves the dead resource in place. The new
forceRecreate() in SharedResourceManager sidesteps the ref-count: it
unconditionally closes the current resource and create()s a new one
under the existing lock, leaving registrations intact so every
already-registered caller picks up the fresh resource on their next
get(). Added as a generic method on SharedResourceManager since the
recovery contract is independent of the JCSMP specifics.

Tests (ErrorQueueInfrastructureTest, new):
- test_errorQueueProducerRecreatedProactivelyOnIsClosed: closed
  producer detected before send -> forceRecreate -> fresh producer
  services the publish; stale producer never sent through
  (Mockito.never()).
- test_errorQueueProducerRecreatedReactivelyOnStaleSendException:
  @CartesianTest over Stale / JCSMPTransport / ClosedFacility -
  verifies all three exception types trigger forceRecreate AND
  propagate to the retry caller (so handleError can drive its loop).
- test_errorQueueProducerNotRecreatedOnUnrelatedJCSMPException:
  negative control - a non-stale JCSMPException (e.g. malformed
  message) propagates normally and does NOT churn the shared
  producer, guarding against an over-broad reactive arm.

417 binder-core unit tests green (was 411 + 6 new from this commit).

This branch is layered on DATAGO-134580 (PR #141) so the new
SharedResourceManager.forceRecreate() and the ErrorQueueInfrastructure
changes can be reviewed alongside the related outbound-handler work.
Three PR #142 review items:

C1 (Copilot) + C3 (mayur-solace) - race in forceRecreate(). The
original unconditional implementation could have two callers both
observe the same stale shared resource, both enter forceRecreate(),
and have the second caller close a healthy replacement that the
first caller just installed. Fix: compare-and-swap. forceRecreate now
takes an `expected` argument - the reference the caller observed.
Under the lock, the manager recreates only if `sharedResource ==
expected`; otherwise it returns whatever a concurrent caller already
installed without closing or re-creating anything. The caller-visible
contract is now: pass what you observed, use what's returned.

C2 (mayur-solace) - Javadoc on SharedResourceManager.forceRecreate
referenced the broker / CloseFlow concern specifically. Since
SharedResourceManager is generic and could host non-broker resources
in the future, the docs are rewritten to describe the CAS contract
generically without naming the JCSMP/broker context.

ErrorQueueInfrastructure.send() updated at both call sites to pass
the observed producer reference and use the value returned by
forceRecreate (which may be the fresh one we requested, or the
already-installed replacement another caller put in place).

New unit test
testErrorQueueProducerUsesManagerReturnedReferenceAfterForceRecreate
simulates the exact race C1 flagged: stale producer observed,
manager's CAS returns an already-installed replacement, send must
use the replacement rather than the locally-observed stale one.
Existing tests updated to pass the observed reference and verify CAS
arguments.

Also aligned the test method names to drop the test_ underscore
form, matching the no-underscore convention used elsewhere in the
binder-core test suite (e.g. SolaceErrorMessageHandlerTest).

418 binder-core unit tests green (was 417 + 1 new CAS-race test).
Per PR #142 follow-up: the previous Javadoc (24 lines, two paragraphs of
explanation) was verbose for an IDE hover. Reduced to a single sentence
describing the CAS contract plus the standard param/return/throws.
Collapse multi-paragraph Javadoc and inline narrative on the
DATAGO-134580 test additions to one-line summaries; drop the two
unused imports that the trimmed Javadoc had been carrying.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename test methods to repo's testCamelCase convention.
- Hoist BrokerConfigurator, vpnName, original spool quota, and
  cleanup of producer binding / queue deprovision / spool restore
  into @beforeeach + @AfterEach so each test focuses on the
  scenario under test.
- Extract createPersistentProducerChannel helper for the shared
  binder.bindProducer setup.
- Drop the two SEMP-polling helpers (awaitVpnMaxMsgSpoolUsage,
  awaitQueueIngressEgress); SEMPv2 calls are synchronous so the
  toggles are committed by the time the call returns. Keep
  awaitBrokerSempResponsive for the CLI shutdown path where the
  broker management surface itself stutters.
- Collapse BrokerConfiguratorBuilder's disable/restore spool pair
  into setMsgVpnSpool, and add getMsgVpnSpool accessor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DATAGO-134580: Recover error-queue producer from unsolicited CloseFlow
Align the six DATAGO-134580 tests in JCSMPOutboundMessageHandlerTest
with the repo's dominant testCamelCase convention. Pre-existing
test_xxx_yyy methods in the same file are not part of this PR and
are left alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Outbound handler / error-queue (PR SolaceProducts#475 review):
- Drop orphan section comment above ErrorQueueInfrastructure.send;
  the recreate logic in the catch arm is self-documenting.
- Rename recreateLock -> lifecycleLock and take it inside start(),
  stop(), and recreateProducerIfNeeded() so start/stop no longer race
  the recreate path.
- Collapse recreateProducerIfNeeded body to closeResources() +
  producerManager.get(id) + createProducerInternal(). Keep the
  recreateProducer flag re-armed on failure so the next-message
  retry contract still holds.
- Demote the hot-path "Detected stale ..." / "Recreating ..." logs
  from warn to debug in both the outbound handler and the error
  queue path to avoid flooding user logs on the message path.

Integration test robustness:
- findSolaceContainerId now matches the broker container by the SMF
  host port the JCSMP session is connected to, parsed from
  JCSMPProperties.HOST. Filtering by image name alone was unsafe
  with Ryuk disabled or with leftover containers across runs, which
  silently sent CLI commands to the wrong broker and made the
  recovery test pass vacuously.
- After 'no shutdown', wait on session.isCapable(PUB_GUARANTEED)
  via Awaitility before the recovery publish. SEMP being responsive
  is not a proxy for JCSMP's cached capability set, which the
  per-binding createProducer call checks; without this wait the
  publish races the capability refresh and intermittently fails
  with "Router does not support guaranteed publisher flows".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sunil-solace sunil-solace requested a review from Nephery May 21, 2026 22:57
…eProducts#475)

Apply latest Nephery review feedback on the outbound handler:

- Move producerManager.get(id) into createProducerInternal so callers
  no longer have to remember the prelude.
- Move the catch (with closeResources + RuntimeException wrap) into
  createProducerInternal so cleanup occurs whenever creation fails,
  regardless of which caller invoked it.
- Add defensive synchronized(lifecycleLock) inside createProducerInternal
  and closeResources to keep them safe if future callers don't already
  hold the lock (intrinsic locks are reentrant, so current call sites
  are unaffected).
- Drop now-unreachable null checks on the producer field in the
  handleMessage catch arm and recreateProducerIfNeeded pre-checks;
  once the isRunning gate has passed, producer is guaranteed non-null
  and is never re-nulled. The null check in closeResources stays
  because it can be called from start()'s catch where producer is
  still unset.
- Knock-on: start() outer try/catch now only wraps the header check;
  recreateProducerIfNeeded no longer needs the inline get(id).
- One unit test updated: the underlying JCSMPException is now the
  ROOT cause (wrapped once by createProducerInternal's RuntimeException),
  so testProducerRecreationFailurePropagatesAndRetriesNext switches
  hasCauseInstanceOf -> hasRootCauseInstanceOf.

All 75 outbound-handler unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sunil-solace sunil-solace requested a review from Nephery May 22, 2026 20:24
…(PR SolaceProducts#475)

Apply Nephery review feedback and round out test coverage:

- Merge testProducerRecreatedAfterClosedFacilityException and
  testProducerRecreatedAfterJCSMPTransportException into
  testProducerRecreatedAfterUnsolicitedCloseFlow as a Cartesian
  parameter (transacted x exceptionType). 4 tests -> 6 instances,
  same coverage, less duplication.
- Drop the class-level Javadoc on ErrorQueueInfrastructureTest;
  test names are self-documenting.
- Add `transacted` as a Cartesian dimension to the three remaining
  DATAGO-134580 outbound tests that previously only exercised the
  non-transacted path:
    * testProducerRecreatedProactivelyWhenIsClosedDetectedBeforeSend
    * testProducerRecreationFailurePropagatesAndRetriesNext
    * testRecreationFlagResetAcrossStopStartCycle
  This catches transacted-specific regressions in the proactive
  pre-check, the recreate-failure-then-retry path, and the
  stop/start flag-reset semantic.

All 80 JCSMPOutboundMessageHandlerTest cases and 6
ErrorQueueInfrastructureTest cases pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sunil-solace sunil-solace requested a review from Nephery May 22, 2026 21:01
sunil-solace added a commit to SolaceDev/solace-spring-cloud that referenced this pull request May 22, 2026
Backport of PR SolaceProducts#475 (SolaceProducts) / commits 931f09c..ee61040 on
master to stage-4.11.1.

When the broker fans out an unsolicited CloseFlow on a publisher flow
(message-spool maintenance, DR failover, "503: Service Unavailable" on
GD), JCSMP marks the per-binding XMLMessageProducer terminally closed.
The JCSMP session stays connected, but every subsequent producer.send
throws StaleSessionException until the application is restarted.

Outbound handler (JCSMPOutboundMessageHandler):
- New volatile recreateProducer flag + lifecycleLock that covers
  start(), stop(), closeResources(), createProducerInternal(), and
  recreateProducerIfNeeded().
- Catch arm in handleMessage detects StaleSessionException /
  JCSMPTransportException / ClosedFacilityException (and a closed
  producer post-send), arms the flag, and surfaces the original
  exception via the error channel.
- Pre-check at the top of every handleMessage rebuilds the producer
  proactively when producer.isClosed() returns true.
- createProducerInternal is now self-contained: locks, gets the
  shared session-default producer from JCSMPSessionProducerManager,
  creates the per-binding producer (+ transacted session when
  configured), and on failure closes whatever was partially built and
  wraps in a RuntimeException.
- Recreate failure stays armed so the next inbound message retries.

Shared producer manager:
- JCSMPSessionProducerManager.forceRecreate(expected) added. CAS
  semantics: only recreates if the manager still holds the supplied
  reference; otherwise returns the currently-installed one.

Error-queue path (ErrorQueueInfrastructure):
- Proactive isClosed() check on the shared session-default producer
  before send.
- Reactive forceRecreate(observed) on stale-flow / transport / closed
  send exceptions. Recovery is single-shot here because
  ErrorQueueRepublishCorrelationKey.handleError() already loops up to
  errorQueueMaxDeliveryAttempts.

Tests:
- Unit (JCSMPOutboundMessageHandlerTest, ErrorQueueInfrastructureTest):
  Cartesian coverage over transacted x stale-flow exception type for
  the recovery paths; proactive isClosed pre-check; recreate-failure
  retry; stop/start flag-reset; CAS noop for forceRecreate.
- Broker IT (JCSMPProducerCloseFlowRecoveryIT, new): three control
  scenarios that document broker disruptions which do NOT reproduce
  the bug (spool-quota toggle on persistent topic, direct topic,
  queue ingress/egress toggle), plus the customer-reported reproducer
  driven via broker CLI (hardware message-spool shutdown over docker
  exec with TTY for confirmation prompts) and a repeated-cycles
  variant. Container is selected by SMF host port to avoid targeting
  leftover containers. After re-enable, the test waits for JCSMP's
  PUB_GUARANTEED capability to refresh before driving recovery
  publishes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants