fix(DATAGO-137322): add timeout to slack client http call#169
Conversation
Signed-off-by: zhenyu369 <job.zhenyuchen@gmail.com>
Signed-off-by: zhenyu369 <zhenyu.chen@solace.com>
amir-ghasemi
left a comment
There was a problem hiding this comment.
Nice catch on the hang scenario — the asyncio.wait_for defense in handle_task_complete is the right call and the integration test that exercises a real queue + real worker + hung client is exactly the right shape (it would actually fail if the fix were reverted, which is the test you want).
One consistency note that doesn't fit inline:
handle_error still uses a hard-coded 10.0 (adapter.py:628). After this PR there are three drain-timeout values in the file: 10s in handle_error, 60s in handle_task_complete, and 60s inside SlackMessageQueue.stop(). Either switch handle_error to QUEUE_DRAIN_TIMEOUT_SEC, or add a second named constant (e.g. ERROR_DRAIN_TIMEOUT_SEC = 10.0) so the asymmetry is at least explicit. If 10s is deliberate (error path shouldn't wait as long), naming it would document that intent.
Also a smaller doc nit: the _resolve_citations_final_pass docstring at adapter.py:1067 claims "Concurrency safety: this method is called AFTER queue.wait_until_complete() returns. At that point the queue processor has finished..." — after this PR that's no longer strictly true on the timeout path. The worker is still alive and could still be mutating text_buffer / current_text_message_ts. In practice the worker is probably wedged on the same hung call so the read-side is consistent, but the docstring's reasoning doesn't hold anymore — worth a one-line addendum acknowledging the timeout case.
| slack_web_client = AsyncWebClient( | ||
| token=adapter_config.slack_bot_token, | ||
| timeout=30, | ||
| ) |
There was a problem hiding this comment.
Question on the actual production cause: slack_sdk.AsyncWebClient.__init__ already defaults to timeout: int = 30, and slack_bolt 1.25.0's create_async_web_client() (used when you pass token= to AsyncApp) doesn't override that default. So the pre-PR code was already running with a 30s HTTP timeout — this explicit timeout=30 doesn't change the wire-level behaviour.
Which made me wonder what actually hung in production. The one Slack-bound HTTP call in this adapter that currently has no timeout is the file upload step:
# message_queue.py:1184
upload_response = await asyncio.to_thread(
requests.post, upload_url, data=op.content_bytes
)That's requests.post to the temporary upload URL, no timeout= argument, so it can block the queue worker indefinitely. If that's the call that wedged the consumer in the original incident, this PR's wait_for defense in handle_task_complete does prevent the consumer-flow wedge, but the underlying hang at the HTTP layer is still unbounded.
Do you have logs from the incident that show which call was stuck? If it was chat.update then the wait_for is the load-bearing fix and the explicit timeout=30 here is belt-and-braces (fine to keep, just worth noting in the PR description that it doesn't change runtime behaviour). If it was the file-upload requests.post, this PR doesn't actually bound the original hang — worth adding timeout=30 to that requests.post call too.
There was a problem hiding this comment.
Great catch — verified both points and addressed them in 58bdac0.
On the production cause — I dug into the Datadog logs for the May 20 incident (and confirmed the same shape in the three prior recurrences since May 4). In the window before the wedge there are zero SlackApiError or TimeoutError log lines from the slack-sdk path — no Failed to send Slack message... from utils.send_slack_message's except Exception clause either. If the existing 30s HTTP timeout on chat.postMessage/chat.update had been firing, we'd see them.
The silent-hang signature matches your requests.post hypothesis exactly: a sync call run via asyncio.to_thread, no default timeout in the requests library, hangs the queue worker without ever raising. It also explains why this has recurred 4 times in 16 days, each fixed only by pod restart — restart discards the wedged worker; the bug returns the next time an upload hits a stalled connection.
Added timeout=60 to the requests.post call at message_queue.py:_handle_file_upload. 60s was chosen over 30s because file uploads are I/O-heavy and we don't want to false-positive on legitimate slow uploads of larger artifacts — but it's still bounded.
So the load-bearing pieces of this PR are now:
timeout=60onrequests.post(new) — actually bounds the most likely hang.wait_for(queue.wait_until_complete(), 60s)inhandle_task_complete— structural defense ensuring the broker ACK fires regardless of what's stuck inside.timeout=30onAsyncWebClient— explicit documentation; same value the SDK uses by default. Kept for clarity.
I'll update the PR description to reflect this.
| token=adapter_config.slack_bot_token, | ||
| timeout=30, | ||
| ) | ||
| self.slack_app = AsyncApp(client=slack_web_client) |
There was a problem hiding this comment.
Minor regression: when you let AsyncApp(token=...) build its own client, Bolt's create_async_web_client passes user_agent_prefix=f"Bolt-Async/{bolt_version}". Constructing AsyncWebClient(token=..., timeout=30) directly drops that prefix, so Slack's telemetry stops seeing this adapter as a Bolt user — makes it slightly harder for Slack support to identify if you ever open a case.
One-line fix:
from slack_bolt import version as bolt_version
slack_web_client = AsyncWebClient(
token=adapter_config.slack_bot_token,
timeout=30,
user_agent_prefix=f"Bolt-Async/{bolt_version.__version__}",
)There was a problem hiding this comment.
You're right — restored the prefix in 58bdac0.
Used the same import pattern slack_bolt.util.async_utils.create_async_web_client() uses internally so it'll stay in sync with whatever version of slack_bolt is installed:
from slack_bolt.version import __version__ as bolt_version
slack_web_client = AsyncWebClient(
token=adapter_config.slack_bot_token,
timeout=30,
user_agent_prefix=f"Bolt-Async/{bolt_version}",
)Added a unit test that asserts Bolt-Async/ appears in the AsyncWebClient's User-Agent header so it can't silently regress again.
…prefix Address review feedback on PR #169: - Restore the user_agent_prefix="Bolt-Async/{version}" that slack_bolt's create_async_web_client() would have set when AsyncApp(token=...) built its own client. Constructing AsyncWebClient explicitly was dropping this, making it harder for Slack support to identify this adapter as a Bolt client. - Add timeout=60 to the synchronous requests.post used for Slack file uploads (message_queue.py:_handle_file_upload). The slack-sdk's async methods default to timeout=30 at the HTTP layer, but the file-upload path uses requests.post (run via asyncio.to_thread), which has NO default timeout. Without an explicit timeout, a stalled connection to Slack's upload URL hangs the queue worker indefinitely — consistent with the silent-hang signature observed in the production logs (no SlackApiError or TimeoutError raised before the 300s task-inactivity timeout fired), and consistent with the recurring nature of the incident (4 occurrences in 16 days, each only fixed by pod restart). - Add unit tests covering both: the Bolt UA prefix must appear in the AsyncWebClient's User-Agent header, and requests.post must be invoked with a positive numeric timeout= kwarg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strip DATAGO-137322 mentions from source comments and test docstrings. The fix's intent stays clear from the comments themselves; ticket references belong in commit messages and PR descriptions, not in the tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
✅ SonarQube Quality Gate - All Passed
Quality gate checks run for each changed project. |
✅ FOSSA Guard (PR Diff) - No New Issues IntroducedDiff mode compares PR head against base revision and reports only newly introduced issues.
Projects With New Issues
Only newly introduced issues are shown in this report. |
This PR fixes the recurring Slack-gateway hung-consumer bug (DATAGO-137322) that has wedged the gateway 4 times in 16 days, each requiring a pod restart. Root cause and load-bearing fix described below.
Root cause
The Slack gateway's file-upload path uses
requests.post(synchronous, run viaasyncio.to_thread) to send file bytes to Slack's temporary upload URL:requestshas no default HTTP timeout, so a stalled connection to Slack's upload URL hangs the queue worker indefinitely. Unlikechat.postMessage/chat.update(which inherit slack-sdk's 30sAsyncWebClientdefault), this path is unbounded. When it hangs, the per-taskSlackMessageQueueworker never callstask_done(),handle_task_complete()never returns, the broker message never gets ACKed, and eventually the broker'smax-delivered-unacked-msgs-per-flow=10000ceiling is hit and delivery stops for ALL Slack tasks across all users — system-wide wedge until pod restart.This matches the production signature: zero
SlackApiError/TimeoutErrorlog lines before the wedge (slack-sdk would have logged them), 300s task-inactivity timeout fires across many users, queue depth grows to 77K+, only a restart recovers.Changes
Load-bearing fix
requests.postfor file uploads now usestimeout=60(message_queue.py:_handle_file_upload). Bounds the most likely production hang source. 60s is generous for large-artifact uploads but still finite.Defense-in-depth
wait_for(queue.wait_until_complete(), timeout=60.0)inhandle_task_complete(adapter.py). Even if a hang ever bypasses the HTTP-layer timeouts (new failure mode, deadlock, etc.), this ensures the broker message gets ACKed within 60s and the consumer flow keeps moving.Bookkeeping / clarity
timeout=30onAsyncWebClientat the wire level — matches the slack-sdk default (so no behavioral change), but makes the bound visible at the call site.user_agent_prefix="Bolt-Async/{version}"so Slack telemetry continues to identify the adapter as a Bolt client (the explicitAsyncWebClientconstruction was dropping the prefix thatslack_bolt.util.async_utils.create_async_web_client()adds).Tests
6 new tests in
tests/unit/test_adapter_timeouts.py:requests.postfor file uploads is invoked with a positive numerictimeout=AsyncWebClientis constructed withtimeout=30AsyncWebClient's User-Agent header includesBolt-Async/handle_task_completedoesn't hang when the queue stalls (with caplog assertion on the warning)handle_task_completeinvokeswait_forwith the documentedQUEUE_DRAIN_TIMEOUT_SECconstant2 new tests in
tests/integration/test_adapter_timeouts_integration.pyusing a realSlackMessageQueue+ real worker + Slack client mocked to hang onasyncio.sleep(3600)to verify real-time recovery via thewait_forbound.All 153 existing tests still pass.
Reviewer feedback addressed
timeout=30onAsyncWebClientwas a no-op at the wire level and that the file-uploadrequests.postwas the actual unbounded hang source. Both are addressed in 58bdac0.