fix(DATAGO-137322): add timeout to slack client http call by zhenyu369 · Pull Request #169 · SolaceLabs/solace-agent-mesh-core-plugins

zhenyu369 · 2026-05-22T15:10:08Z

This PR fixes the recurring Slack-gateway hung-consumer bug (DATAGO-137322) that has wedged the gateway 4 times in 16 days, each requiring a pod restart. Root cause and load-bearing fix described below.

Root cause

The Slack gateway's file-upload path uses requests.post (synchronous, run via asyncio.to_thread) to send file bytes to Slack's temporary upload URL:

# message_queue.py:_handle_file_upload (pre-fix)
upload_response = await asyncio.to_thread(
    requests.post, upload_url, data=op.content_bytes
)

requests has no default HTTP timeout, so a stalled connection to Slack's upload URL hangs the queue worker indefinitely. Unlike chat.postMessage / chat.update (which inherit slack-sdk's 30s AsyncWebClient default), this path is unbounded. When it hangs, the per-task SlackMessageQueue worker never calls task_done(), handle_task_complete() never returns, the broker message never gets ACKed, and eventually the broker's max-delivered-unacked-msgs-per-flow=10000 ceiling is hit and delivery stops for ALL Slack tasks across all users — system-wide wedge until pod restart.

This matches the production signature: zero SlackApiError / TimeoutError log lines before the wedge (slack-sdk would have logged them), 300s task-inactivity timeout fires across many users, queue depth grows to 77K+, only a restart recovers.

Changes

Load-bearing fix

requests.post for file uploads now uses timeout=60 (message_queue.py:_handle_file_upload). Bounds the most likely production hang source. 60s is generous for large-artifact uploads but still finite.

Defense-in-depth

wait_for(queue.wait_until_complete(), timeout=60.0) in handle_task_complete (adapter.py). Even if a hang ever bypasses the HTTP-layer timeouts (new failure mode, deadlock, etc.), this ensures the broker message gets ACKed within 60s and the consumer flow keeps moving.

Bookkeeping / clarity

Explicit timeout=30 on AsyncWebClient at the wire level — matches the slack-sdk default (so no behavioral change), but makes the bound visible at the call site.
Restored user_agent_prefix="Bolt-Async/{version}" so Slack telemetry continues to identify the adapter as a Bolt client (the explicit AsyncWebClient construction was dropping the prefix that slack_bolt.util.async_utils.create_async_web_client() adds).

Tests

6 new tests in tests/unit/test_adapter_timeouts.py:

requests.post for file uploads is invoked with a positive numeric timeout=
AsyncWebClient is constructed with timeout=30
AsyncWebClient's User-Agent header includes Bolt-Async/
handle_task_complete doesn't hang when the queue stalls (with caplog assertion on the warning)
handle_task_complete invokes wait_for with the documented QUEUE_DRAIN_TIMEOUT_SEC constant
Happy path emits no spurious timeout warning

2 new tests in tests/integration/test_adapter_timeouts_integration.py using a real SlackMessageQueue + real worker + Slack client mocked to hang on asyncio.sleep(3600) to verify real-time recovery via the wait_for bound.

All 153 existing tests still pass.

Reviewer feedback addressed

@amir-ghasemi caught that the original timeout=30 on AsyncWebClient was a no-op at the wire level and that the file-upload requests.post was the actual unbounded hang source. Both are addressed in 58bdac0.

Signed-off-by: zhenyu369 <job.zhenyuchen@gmail.com>

Signed-off-by: zhenyu369 <zhenyu.chen@solace.com>

amir-ghasemi

Nice catch on the hang scenario — the asyncio.wait_for defense in handle_task_complete is the right call and the integration test that exercises a real queue + real worker + hung client is exactly the right shape (it would actually fail if the fix were reverted, which is the test you want).

One consistency note that doesn't fit inline:

handle_error still uses a hard-coded 10.0 (adapter.py:628). After this PR there are three drain-timeout values in the file: 10s in handle_error, 60s in handle_task_complete, and 60s inside SlackMessageQueue.stop(). Either switch handle_error to QUEUE_DRAIN_TIMEOUT_SEC, or add a second named constant (e.g. ERROR_DRAIN_TIMEOUT_SEC = 10.0) so the asymmetry is at least explicit. If 10s is deliberate (error path shouldn't wait as long), naming it would document that intent.

Also a smaller doc nit: the _resolve_citations_final_pass docstring at adapter.py:1067 claims "Concurrency safety: this method is called AFTER queue.wait_until_complete() returns. At that point the queue processor has finished..." — after this PR that's no longer strictly true on the timeout path. The worker is still alive and could still be mutating text_buffer / current_text_message_ts. In practice the worker is probably wedged on the same hung call so the read-side is consistent, but the docstring's reasoning doesn't hold anymore — worth a one-line addendum acknowledging the timeout case.

amir-ghasemi · 2026-05-22T19:13:30Z

+        slack_web_client = AsyncWebClient(
+            token=adapter_config.slack_bot_token,
+            timeout=30,
+        )


Question on the actual production cause: slack_sdk.AsyncWebClient.__init__ already defaults to timeout: int = 30, and slack_bolt 1.25.0's create_async_web_client() (used when you pass token= to AsyncApp) doesn't override that default. So the pre-PR code was already running with a 30s HTTP timeout — this explicit timeout=30 doesn't change the wire-level behaviour.

Which made me wonder what actually hung in production. The one Slack-bound HTTP call in this adapter that currently has no timeout is the file upload step:

# message_queue.py:1184 upload_response = await asyncio.to_thread( requests.post, upload_url, data=op.content_bytes )

That's requests.post to the temporary upload URL, no timeout= argument, so it can block the queue worker indefinitely. If that's the call that wedged the consumer in the original incident, this PR's wait_for defense in handle_task_complete does prevent the consumer-flow wedge, but the underlying hang at the HTTP layer is still unbounded.

Do you have logs from the incident that show which call was stuck? If it was chat.update then the wait_for is the load-bearing fix and the explicit timeout=30 here is belt-and-braces (fine to keep, just worth noting in the PR description that it doesn't change runtime behaviour). If it was the file-upload requests.post, this PR doesn't actually bound the original hang — worth adding timeout=30 to that requests.post call too.

Great catch — verified both points and addressed them in 58bdac0.

On the production cause — I dug into the Datadog logs for the May 20 incident (and confirmed the same shape in the three prior recurrences since May 4). In the window before the wedge there are zero SlackApiError or TimeoutError log lines from the slack-sdk path — no Failed to send Slack message... from utils.send_slack_message's except Exception clause either. If the existing 30s HTTP timeout on chat.postMessage/chat.update had been firing, we'd see them.

The silent-hang signature matches your requests.post hypothesis exactly: a sync call run via asyncio.to_thread, no default timeout in the requests library, hangs the queue worker without ever raising. It also explains why this has recurred 4 times in 16 days, each fixed only by pod restart — restart discards the wedged worker; the bug returns the next time an upload hits a stalled connection.

Added timeout=60 to the requests.post call at message_queue.py:_handle_file_upload. 60s was chosen over 30s because file uploads are I/O-heavy and we don't want to false-positive on legitimate slow uploads of larger artifacts — but it's still bounded.

So the load-bearing pieces of this PR are now:

timeout=60 on requests.post (new) — actually bounds the most likely hang.

wait_for(queue.wait_until_complete(), 60s) in handle_task_complete — structural defense ensuring the broker ACK fires regardless of what's stuck inside.

timeout=30 on AsyncWebClient — explicit documentation; same value the SDK uses by default. Kept for clarity.

I'll update the PR description to reflect this.

amir-ghasemi · 2026-05-22T19:13:30Z

+            token=adapter_config.slack_bot_token,
+            timeout=30,
+        )
+        self.slack_app = AsyncApp(client=slack_web_client)


Minor regression: when you let AsyncApp(token=...) build its own client, Bolt's create_async_web_client passes user_agent_prefix=f"Bolt-Async/{bolt_version}". Constructing AsyncWebClient(token=..., timeout=30) directly drops that prefix, so Slack's telemetry stops seeing this adapter as a Bolt user — makes it slightly harder for Slack support to identify if you ever open a case.

One-line fix:

from slack_bolt import version as bolt_version slack_web_client = AsyncWebClient( token=adapter_config.slack_bot_token, timeout=30, user_agent_prefix=f"Bolt-Async/{bolt_version.__version__}", )

You're right — restored the prefix in 58bdac0.

Used the same import pattern slack_bolt.util.async_utils.create_async_web_client() uses internally so it'll stay in sync with whatever version of slack_bolt is installed:

from slack_bolt.version import __version__ as bolt_version slack_web_client = AsyncWebClient( token=adapter_config.slack_bot_token, timeout=30, user_agent_prefix=f"Bolt-Async/{bolt_version}", )

Added a unit test that asserts Bolt-Async/ appears in the AsyncWebClient's User-Agent header so it can't silently regress again.

…prefix Address review feedback on PR #169: - Restore the user_agent_prefix="Bolt-Async/{version}" that slack_bolt's create_async_web_client() would have set when AsyncApp(token=...) built its own client. Constructing AsyncWebClient explicitly was dropping this, making it harder for Slack support to identify this adapter as a Bolt client. - Add timeout=60 to the synchronous requests.post used for Slack file uploads (message_queue.py:_handle_file_upload). The slack-sdk's async methods default to timeout=30 at the HTTP layer, but the file-upload path uses requests.post (run via asyncio.to_thread), which has NO default timeout. Without an explicit timeout, a stalled connection to Slack's upload URL hangs the queue worker indefinitely — consistent with the silent-hang signature observed in the production logs (no SlackApiError or TimeoutError raised before the 300s task-inactivity timeout fired), and consistent with the recurring nature of the incident (4 occurrences in 16 days, each only fixed by pod restart). - Add unit tests covering both: the Bolt UA prefix must appear in the AsyncWebClient's User-Agent header, and requests.post must be invoked with a positive numeric timeout= kwarg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Strip DATAGO-137322 mentions from source comments and test docstrings. The fix's intent stays clear from the comments themselves; ticket references belong in commit messages and PR descriptions, not in the tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-22T20:26:43Z

✅ SonarQube Quality Gate - All Passed

Project	Quality Gate Status	Analysis
`sam-slack-gateway-adapter`	✅ Passed	See analysis details on SonarQube

Quality gate checks run for each changed project.

github-actions · 2026-05-22T20:26:47Z

✅ FOSSA Guard (PR Diff) - No New Issues Introduced

Diff mode compares PR head against base revision and reports only newly introduced issues.

Check	Scope	Status
FOSSA Vulnerabilities	Per-Project (PR Diff)	✅ Passed
License Check	Per-Project (PR Diff)	✅ Passed

Projects With New Issues

None

Project	FOSSA Vulnerabilities	License Check	FOSSA Report
`sam-slack-gateway-adapter`	✅ Passed	✅ Passed	View report

Only newly introduced issues are shown in this report.

amir-ghasemi

LGTM

add timeout to slack client http call

94df955

zhenyu369 requested a review from amir-ghasemi May 22, 2026 15:10

github-actions Bot added the sam-slack-gateway-adapter label May 22, 2026

Update adapter.py

62fd4b8

Signed-off-by: zhenyu369 <job.zhenyuchen@gmail.com>

solace-mdupls reviewed May 22, 2026

View reviewed changes

Comment thread sam-slack-gateway-adapter/src/sam_slack_gateway_adapter/adapter.py Outdated

Update adapter.py

e477daf

Signed-off-by: zhenyu369 <zhenyu.chen@solace.com>

zhenyu369 requested a review from solace-mdupls May 22, 2026 15:35

zhenyu369 enabled auto-merge (squash) May 22, 2026 16:00

amir-ghasemi reviewed May 22, 2026

View reviewed changes

zhenyu369 and others added 2 commits May 22, 2026 16:17

zhenyu369 requested a review from amir-ghasemi May 22, 2026 20:27

amir-ghasemi approved these changes May 23, 2026

View reviewed changes

zhenyu369 merged commit 305ae0d into main May 23, 2026
12 checks passed

zhenyu369 deleted the DATAGO-137322/zhenyu/slack_gateway_hung_bug_fix branch May 23, 2026 00:00

solace-re-bot-1 mentioned this pull request May 23, 2026

chore: release main #170

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(DATAGO-137322): add timeout to slack client http call#169

fix(DATAGO-137322): add timeout to slack client http call#169
zhenyu369 merged 5 commits into
mainfrom
DATAGO-137322/zhenyu/slack_gateway_hung_bug_fix

zhenyu369 commented May 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

amir-ghasemi left a comment

Uh oh!

amir-ghasemi May 22, 2026

Uh oh!

zhenyu369 May 22, 2026

Uh oh!

amir-ghasemi May 22, 2026

Uh oh!

zhenyu369 May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

amir-ghasemi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

zhenyu369 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Changes

Load-bearing fix

Defense-in-depth

Bookkeeping / clarity

Tests

Reviewer feedback addressed

Uh oh!

Uh oh!

amir-ghasemi left a comment

Choose a reason for hiding this comment

Uh oh!

amir-ghasemi May 22, 2026

Choose a reason for hiding this comment

Uh oh!

zhenyu369 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

amir-ghasemi May 22, 2026

Choose a reason for hiding this comment

Uh oh!

zhenyu369 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 22, 2026

✅ SonarQube Quality Gate - All Passed

Uh oh!

github-actions Bot commented May 22, 2026

✅ FOSSA Guard (PR Diff) - No New Issues Introduced

Projects With New Issues

Uh oh!

amir-ghasemi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhenyu369 commented May 22, 2026 •

edited

Loading