Skip to content

DATAGO-134580: Recreate JCSMP producer on unsolicited CloseFlow (backport to 4.11.1)#481

Open
sunil-solace wants to merge 1 commit into
SolaceProducts:stage-4.11.1from
SolaceDev:DATAGO-134580-4.11.1
Open

DATAGO-134580: Recreate JCSMP producer on unsolicited CloseFlow (backport to 4.11.1)#481
sunil-solace wants to merge 1 commit into
SolaceProducts:stage-4.11.1from
SolaceDev:DATAGO-134580-4.11.1

Conversation

@sunil-solace
Copy link
Copy Markdown

Backport of PR #475 (SolaceProducts) / commits 931f09c..ee61040 on master to stage-4.11.1.

When the broker fans out an unsolicited CloseFlow on a publisher flow (message-spool maintenance, DR failover, "503: Service Unavailable" on GD), JCSMP marks the per-binding XMLMessageProducer terminally closed. The JCSMP session stays connected, but every subsequent producer.send throws StaleSessionException until the application is restarted.

Outbound handler (JCSMPOutboundMessageHandler):

  • New volatile recreateProducer flag + lifecycleLock that covers start(), stop(), closeResources(), createProducerInternal(), and recreateProducerIfNeeded().
  • Catch arm in handleMessage detects StaleSessionException / JCSMPTransportException / ClosedFacilityException (and a closed producer post-send), arms the flag, and surfaces the original exception via the error channel.
  • Pre-check at the top of every handleMessage rebuilds the producer proactively when producer.isClosed() returns true.
  • createProducerInternal is now self-contained: locks, gets the shared session-default producer from JCSMPSessionProducerManager, creates the per-binding producer (+ transacted session when configured), and on failure closes whatever was partially built and wraps in a RuntimeException.
  • Recreate failure stays armed so the next inbound message retries.

Shared producer manager:

  • JCSMPSessionProducerManager.forceRecreate(expected) added. CAS semantics: only recreates if the manager still holds the supplied reference; otherwise returns the currently-installed one.

Error-queue path (ErrorQueueInfrastructure):

  • Proactive isClosed() check on the shared session-default producer before send.
  • Reactive forceRecreate(observed) on stale-flow / transport / closed send exceptions. Recovery is single-shot here because ErrorQueueRepublishCorrelationKey.handleError() already loops up to errorQueueMaxDeliveryAttempts.

Tests:

  • Unit (JCSMPOutboundMessageHandlerTest, ErrorQueueInfrastructureTest): Cartesian coverage over transacted x stale-flow exception type for the recovery paths; proactive isClosed pre-check; recreate-failure retry; stop/start flag-reset; CAS noop for forceRecreate.
  • Broker IT (JCSMPProducerCloseFlowRecoveryIT, new): three control scenarios that document broker disruptions which do NOT reproduce the bug (spool-quota toggle on persistent topic, direct topic, queue ingress/egress toggle), plus the customer-reported reproducer driven via broker CLI (hardware message-spool shutdown over docker exec with TTY for confirmation prompts) and a repeated-cycles variant. Container is selected by SMF host port to avoid targeting leftover containers. After re-enable, the test waits for JCSMP's PUB_GUARANTEED capability to refresh before driving recovery publishes.

Backport of PR SolaceProducts#475 (SolaceProducts) / commits 931f09c..ee61040 on
master to stage-4.11.1.

When the broker fans out an unsolicited CloseFlow on a publisher flow
(message-spool maintenance, DR failover, "503: Service Unavailable" on
GD), JCSMP marks the per-binding XMLMessageProducer terminally closed.
The JCSMP session stays connected, but every subsequent producer.send
throws StaleSessionException until the application is restarted.

Outbound handler (JCSMPOutboundMessageHandler):
- New volatile recreateProducer flag + lifecycleLock that covers
  start(), stop(), closeResources(), createProducerInternal(), and
  recreateProducerIfNeeded().
- Catch arm in handleMessage detects StaleSessionException /
  JCSMPTransportException / ClosedFacilityException (and a closed
  producer post-send), arms the flag, and surfaces the original
  exception via the error channel.
- Pre-check at the top of every handleMessage rebuilds the producer
  proactively when producer.isClosed() returns true.
- createProducerInternal is now self-contained: locks, gets the
  shared session-default producer from JCSMPSessionProducerManager,
  creates the per-binding producer (+ transacted session when
  configured), and on failure closes whatever was partially built and
  wraps in a RuntimeException.
- Recreate failure stays armed so the next inbound message retries.

Shared producer manager:
- JCSMPSessionProducerManager.forceRecreate(expected) added. CAS
  semantics: only recreates if the manager still holds the supplied
  reference; otherwise returns the currently-installed one.

Error-queue path (ErrorQueueInfrastructure):
- Proactive isClosed() check on the shared session-default producer
  before send.
- Reactive forceRecreate(observed) on stale-flow / transport / closed
  send exceptions. Recovery is single-shot here because
  ErrorQueueRepublishCorrelationKey.handleError() already loops up to
  errorQueueMaxDeliveryAttempts.

Tests:
- Unit (JCSMPOutboundMessageHandlerTest, ErrorQueueInfrastructureTest):
  Cartesian coverage over transacted x stale-flow exception type for
  the recovery paths; proactive isClosed pre-check; recreate-failure
  retry; stop/start flag-reset; CAS noop for forceRecreate.
- Broker IT (JCSMPProducerCloseFlowRecoveryIT, new): three control
  scenarios that document broker disruptions which do NOT reproduce
  the bug (spool-quota toggle on persistent topic, direct topic,
  queue ingress/egress toggle), plus the customer-reported reproducer
  driven via broker CLI (hardware message-spool shutdown over docker
  exec with TTY for confirmation prompts) and a repeated-cycles
  variant. Container is selected by SMF host port to avoid targeting
  leftover containers. After re-enable, the test waits for JCSMP's
  PUB_GUARANTEED capability to refresh before driving recovery
  publishes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Nephery Nephery requested review from Nephery and mayur-solace May 22, 2026 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant