Skip to content

Hardcoded aiohttp timeout causes failure on large SaaS controllers during Agent extraction #185

@thiagogabriell

Description

@thiagogabriell

Description

The config-assessment-tool (v1.7.2) consistently fails with asyncio.exceptions.TimeoutError during the "Extracting Agent Details" phase (getAppServerAgents) when running against large SaaS controllers. The root cause is a hardcoded 5-minute timeout in aiohttp.ClientSession() within backend/api/appd/AuthMethod.py.

Environment

  • Tool version: v1.7.2 (Windows executable bundle)
  • Controller: Large SaaS controller
  • Controller size: 500+ APM applications, 4300+ servers, 1048 dashboards
  • Auth method: API Client (token)
  • OS: Windows

Steps to Reproduce

  1. Configure DefaultJob.json pointing to a large SaaS controller
  2. Run: config-assessment-tool.exe -j DefaultJob -t DefaultThresholds -c 1
  3. Tool authenticates successfully, extracts APM/EUM/MRUM applications, servers, and dashboards without issue
  4. Tool reaches "Extracting Agent Details" phase and calls getAppServerAgents
  5. The controller takes longer than 5 minutes to respond due to the large number of agents
  6. Tool crashes with asyncio.exceptions.TimeoutError

Error Output

[ERROR] root run: Traceback (most recent call last):
File "uplink/clients/io/asyncio_strategy.py", line 17, in invoke
File "uplink/hooks.py", line 109, in handle_exception
File "six.py", line 719, in reraise
File "uplink/clients/io/asyncio_strategy.py", line 17, in invoke
File "uplink/clients/aiohttp_.py", line 135, in send
File "aiohttp/client.py", line 544, in _request
File "aiohttp/client_reqrep.py", line 905, in start
File "aiohttp/helpers.py", line 656, in exit
asyncio.exceptions.TimeoutError

Root Cause

In backend/api/appd/AuthMethod.py, the aiohttp.ClientSession() is created without a custom timeout parameter (lines ~64 and ~371), which defaults to aiohttp built-in 5-minute (total=300s) timeout.

For large controllers with thousands of agents, the getAppServerAgents API call can take 30+ minutes for the controller to respond. The 5-minute default is insufficient.

Additional Observations

  • All phases prior to Agent extraction complete successfully (APM apps, EUM, MRUM, Servers, Dashboards)
  • With -c 1 (single connection), the agent extraction request takes 30+ minutes waiting for the controller to respond before timing out
  • With higher concurrency (e.g. -c 50, the default for SaaS), the controller becomes unresponsive, likely because it only has about 10 threads dedicated to REST API processing
  • With -c 5, the same long wait behavior is observed during agent extraction

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions