Skip to content

fix(DATAGO-137322): add timeout to slack client http call#169

Merged
zhenyu369 merged 5 commits into
mainfrom
DATAGO-137322/zhenyu/slack_gateway_hung_bug_fix
May 23, 2026
Merged

fix(DATAGO-137322): add timeout to slack client http call#169
zhenyu369 merged 5 commits into
mainfrom
DATAGO-137322/zhenyu/slack_gateway_hung_bug_fix

Conversation

@zhenyu369

@zhenyu369 zhenyu369 commented May 22, 2026

Copy link
Copy Markdown
Contributor

This PR fixes the recurring Slack-gateway hung-consumer bug (DATAGO-137322) that has wedged the gateway 4 times in 16 days, each requiring a pod restart. Root cause and load-bearing fix described below.

Root cause

The Slack gateway's file-upload path uses requests.post (synchronous, run via asyncio.to_thread) to send file bytes to Slack's temporary upload URL:

# message_queue.py:_handle_file_upload (pre-fix)
upload_response = await asyncio.to_thread(
    requests.post, upload_url, data=op.content_bytes
)

requests has no default HTTP timeout, so a stalled connection to Slack's upload URL hangs the queue worker indefinitely. Unlike chat.postMessage / chat.update (which inherit slack-sdk's 30s AsyncWebClient default), this path is unbounded. When it hangs, the per-task SlackMessageQueue worker never calls task_done(), handle_task_complete() never returns, the broker message never gets ACKed, and eventually the broker's max-delivered-unacked-msgs-per-flow=10000 ceiling is hit and delivery stops for ALL Slack tasks across all users — system-wide wedge until pod restart.

This matches the production signature: zero SlackApiError / TimeoutError log lines before the wedge (slack-sdk would have logged them), 300s task-inactivity timeout fires across many users, queue depth grows to 77K+, only a restart recovers.

Changes

Load-bearing fix

  • requests.post for file uploads now uses timeout=60 (message_queue.py:_handle_file_upload). Bounds the most likely production hang source. 60s is generous for large-artifact uploads but still finite.

Defense-in-depth

  • wait_for(queue.wait_until_complete(), timeout=60.0) in handle_task_complete (adapter.py). Even if a hang ever bypasses the HTTP-layer timeouts (new failure mode, deadlock, etc.), this ensures the broker message gets ACKed within 60s and the consumer flow keeps moving.

Bookkeeping / clarity

  • Explicit timeout=30 on AsyncWebClient at the wire level — matches the slack-sdk default (so no behavioral change), but makes the bound visible at the call site.
  • Restored user_agent_prefix="Bolt-Async/{version}" so Slack telemetry continues to identify the adapter as a Bolt client (the explicit AsyncWebClient construction was dropping the prefix that slack_bolt.util.async_utils.create_async_web_client() adds).

Tests

6 new tests in tests/unit/test_adapter_timeouts.py:

  • requests.post for file uploads is invoked with a positive numeric timeout=
  • AsyncWebClient is constructed with timeout=30
  • AsyncWebClient's User-Agent header includes Bolt-Async/
  • handle_task_complete doesn't hang when the queue stalls (with caplog assertion on the warning)
  • handle_task_complete invokes wait_for with the documented QUEUE_DRAIN_TIMEOUT_SEC constant
  • Happy path emits no spurious timeout warning

2 new tests in tests/integration/test_adapter_timeouts_integration.py using a real SlackMessageQueue + real worker + Slack client mocked to hang on asyncio.sleep(3600) to verify real-time recovery via the wait_for bound.

All 153 existing tests still pass.

Reviewer feedback addressed

  • @amir-ghasemi caught that the original timeout=30 on AsyncWebClient was a no-op at the wire level and that the file-upload requests.post was the actual unbounded hang source. Both are addressed in 58bdac0.

Signed-off-by: zhenyu369 <job.zhenyuchen@gmail.com>
Comment thread sam-slack-gateway-adapter/src/sam_slack_gateway_adapter/adapter.py Outdated
Signed-off-by: zhenyu369 <zhenyu.chen@solace.com>
@zhenyu369 zhenyu369 requested a review from solace-mdupls May 22, 2026 15:35
@zhenyu369 zhenyu369 enabled auto-merge (squash) May 22, 2026 16:00

@amir-ghasemi amir-ghasemi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch on the hang scenario — the asyncio.wait_for defense in handle_task_complete is the right call and the integration test that exercises a real queue + real worker + hung client is exactly the right shape (it would actually fail if the fix were reverted, which is the test you want).

One consistency note that doesn't fit inline:

handle_error still uses a hard-coded 10.0 (adapter.py:628). After this PR there are three drain-timeout values in the file: 10s in handle_error, 60s in handle_task_complete, and 60s inside SlackMessageQueue.stop(). Either switch handle_error to QUEUE_DRAIN_TIMEOUT_SEC, or add a second named constant (e.g. ERROR_DRAIN_TIMEOUT_SEC = 10.0) so the asymmetry is at least explicit. If 10s is deliberate (error path shouldn't wait as long), naming it would document that intent.

Also a smaller doc nit: the _resolve_citations_final_pass docstring at adapter.py:1067 claims "Concurrency safety: this method is called AFTER queue.wait_until_complete() returns. At that point the queue processor has finished..." — after this PR that's no longer strictly true on the timeout path. The worker is still alive and could still be mutating text_buffer / current_text_message_ts. In practice the worker is probably wedged on the same hung call so the read-side is consistent, but the docstring's reasoning doesn't hold anymore — worth a one-line addendum acknowledging the timeout case.

slack_web_client = AsyncWebClient(
token=adapter_config.slack_bot_token,
timeout=30,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question on the actual production cause: slack_sdk.AsyncWebClient.__init__ already defaults to timeout: int = 30, and slack_bolt 1.25.0's create_async_web_client() (used when you pass token= to AsyncApp) doesn't override that default. So the pre-PR code was already running with a 30s HTTP timeout — this explicit timeout=30 doesn't change the wire-level behaviour.

Which made me wonder what actually hung in production. The one Slack-bound HTTP call in this adapter that currently has no timeout is the file upload step:

# message_queue.py:1184
upload_response = await asyncio.to_thread(
    requests.post, upload_url, data=op.content_bytes
)

That's requests.post to the temporary upload URL, no timeout= argument, so it can block the queue worker indefinitely. If that's the call that wedged the consumer in the original incident, this PR's wait_for defense in handle_task_complete does prevent the consumer-flow wedge, but the underlying hang at the HTTP layer is still unbounded.

Do you have logs from the incident that show which call was stuck? If it was chat.update then the wait_for is the load-bearing fix and the explicit timeout=30 here is belt-and-braces (fine to keep, just worth noting in the PR description that it doesn't change runtime behaviour). If it was the file-upload requests.post, this PR doesn't actually bound the original hang — worth adding timeout=30 to that requests.post call too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch — verified both points and addressed them in 58bdac0.

On the production cause — I dug into the Datadog logs for the May 20 incident (and confirmed the same shape in the three prior recurrences since May 4). In the window before the wedge there are zero SlackApiError or TimeoutError log lines from the slack-sdk path — no Failed to send Slack message... from utils.send_slack_message's except Exception clause either. If the existing 30s HTTP timeout on chat.postMessage/chat.update had been firing, we'd see them.

The silent-hang signature matches your requests.post hypothesis exactly: a sync call run via asyncio.to_thread, no default timeout in the requests library, hangs the queue worker without ever raising. It also explains why this has recurred 4 times in 16 days, each fixed only by pod restart — restart discards the wedged worker; the bug returns the next time an upload hits a stalled connection.

Added timeout=60 to the requests.post call at message_queue.py:_handle_file_upload. 60s was chosen over 30s because file uploads are I/O-heavy and we don't want to false-positive on legitimate slow uploads of larger artifacts — but it's still bounded.

So the load-bearing pieces of this PR are now:

  1. timeout=60 on requests.post (new) — actually bounds the most likely hang.
  2. wait_for(queue.wait_until_complete(), 60s) in handle_task_complete — structural defense ensuring the broker ACK fires regardless of what's stuck inside.
  3. timeout=30 on AsyncWebClient — explicit documentation; same value the SDK uses by default. Kept for clarity.

I'll update the PR description to reflect this.

token=adapter_config.slack_bot_token,
timeout=30,
)
self.slack_app = AsyncApp(client=slack_web_client)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor regression: when you let AsyncApp(token=...) build its own client, Bolt's create_async_web_client passes user_agent_prefix=f"Bolt-Async/{bolt_version}". Constructing AsyncWebClient(token=..., timeout=30) directly drops that prefix, so Slack's telemetry stops seeing this adapter as a Bolt user — makes it slightly harder for Slack support to identify if you ever open a case.

One-line fix:

from slack_bolt import version as bolt_version

slack_web_client = AsyncWebClient(
    token=adapter_config.slack_bot_token,
    timeout=30,
    user_agent_prefix=f"Bolt-Async/{bolt_version.__version__}",
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — restored the prefix in 58bdac0.

Used the same import pattern slack_bolt.util.async_utils.create_async_web_client() uses internally so it'll stay in sync with whatever version of slack_bolt is installed:

from slack_bolt.version import __version__ as bolt_version

slack_web_client = AsyncWebClient(
    token=adapter_config.slack_bot_token,
    timeout=30,
    user_agent_prefix=f"Bolt-Async/{bolt_version}",
)

Added a unit test that asserts Bolt-Async/ appears in the AsyncWebClient's User-Agent header so it can't silently regress again.

zhenyu369 and others added 2 commits May 22, 2026 16:17
…prefix

Address review feedback on PR #169:

- Restore the user_agent_prefix="Bolt-Async/{version}" that slack_bolt's
  create_async_web_client() would have set when AsyncApp(token=...) built
  its own client. Constructing AsyncWebClient explicitly was dropping
  this, making it harder for Slack support to identify this adapter as
  a Bolt client.

- Add timeout=60 to the synchronous requests.post used for Slack file
  uploads (message_queue.py:_handle_file_upload). The slack-sdk's async
  methods default to timeout=30 at the HTTP layer, but the file-upload
  path uses requests.post (run via asyncio.to_thread), which has NO
  default timeout. Without an explicit timeout, a stalled connection to
  Slack's upload URL hangs the queue worker indefinitely — consistent
  with the silent-hang signature observed in the production logs (no
  SlackApiError or TimeoutError raised before the 300s task-inactivity
  timeout fired), and consistent with the recurring nature of the
  incident (4 occurrences in 16 days, each only fixed by pod restart).

- Add unit tests covering both: the Bolt UA prefix must appear in the
  AsyncWebClient's User-Agent header, and requests.post must be invoked
  with a positive numeric timeout= kwarg.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strip DATAGO-137322 mentions from source comments and test docstrings.
The fix's intent stays clear from the comments themselves; ticket
references belong in commit messages and PR descriptions, not in the
tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

✅ SonarQube Quality Gate - All Passed

Project Quality Gate Status Analysis
sam-slack-gateway-adapter ✅ Passed See analysis details on SonarQube

Quality gate checks run for each changed project.

@github-actions

Copy link
Copy Markdown

✅ FOSSA Guard (PR Diff) - No New Issues Introduced

Diff mode compares PR head against base revision and reports only newly introduced issues.

Check Scope Status
FOSSA Vulnerabilities Per-Project (PR Diff) ✅ Passed
License Check Per-Project (PR Diff) ✅ Passed

Projects With New Issues

  • None
Project FOSSA Vulnerabilities License Check FOSSA Report
sam-slack-gateway-adapter ✅ Passed ✅ Passed View report

Only newly introduced issues are shown in this report.

@zhenyu369 zhenyu369 requested a review from amir-ghasemi May 22, 2026 20:27

@amir-ghasemi amir-ghasemi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhenyu369 zhenyu369 merged commit 305ae0d into main May 23, 2026
12 checks passed
@zhenyu369 zhenyu369 deleted the DATAGO-137322/zhenyu/slack_gateway_hung_bug_fix branch May 23, 2026 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants