Follow-up to #6
After landing the max_pool_connections=50 fix in #7 / released as v0.4.2, the pool-saturation class of failure should be gone. If a residual error rate remains under load, the next likely culprit is the boto3 default request timeouts: connect_timeout=60s and read_timeout=60s.
Why this matters
A hanging S3 connection (bad network, throttling, endpoint hiccup) currently keeps a Thumbor worker blocked for up to 60 s. Under a brief S3-side blip this cascades into:
asyncio.to_thread worker threads all stuck on S3 I/O.
- Incoming Thumbor requests queue behind them.
- Varnish/CDN keepalive begins retrying — amplifying the stuck load.
- Overall frontend slowdown proportional to
connect_timeout + read_timeout per hung request.
A tighter budget bounds the blast radius: fail fast, free the worker, let the client retry (or fall through to the PG path if both are available).
Proposal
Extend the botocore.Config already introduced in #7 with timeout settings, both env-var overridable:
from botocore.config import Config
max_pool = int(os.environ.get("PGTHUMBOR_S3_MAX_POOL_CONNECTIONS", "50"))
connect_timeout = float(os.environ.get("PGTHUMBOR_S3_CONNECT_TIMEOUT", "5"))
read_timeout = float(os.environ.get("PGTHUMBOR_S3_READ_TIMEOUT", "30"))
kwargs["config"] = Config(
max_pool_connections=max_pool,
connect_timeout=connect_timeout,
read_timeout=read_timeout,
retries={"max_attempts": 2, "mode": "adaptive"},
)
Suggested defaults:
| Setting |
Default |
Rationale |
connect_timeout |
5 s |
Fast fail on unreachable endpoints; legitimate TLS handshakes are well under this |
read_timeout |
30 s |
Covers typical blob sizes (KB–low-MB) on sane connections |
retries.max_attempts |
2 |
One retry for truly transient errors; keep latency bounded |
retries.mode |
adaptive |
Honours throttling signals from S3 |
All three env-var overridable for ops tuning.
Interaction with existing 4xx/5xx microcache (#5)
If a timeout causes the loader to return None → Thumbor emits a 400/5xx → the PGTHUMBOR_CACHE_CONTROL_ERROR microcache (10 s) absorbs the repeat-requests, so even during an S3 incident the backend doesn't get hammered. The two fixes compound nicely.
Test plan
- Unit: patch
boto3.client, assert kwargs["config"].connect_timeout == 5 / read_timeout == 30 with env unset.
- Unit: env override for each, asserted individually.
- Integration (optional): use
moto with a deliberate slow response and assert the loader returns None within the configured budget rather than hanging.
Not yet prioritised because
The pool fix alone already addressed the observed symptoms on aaf-6 prod in the scope of #6. Open this as a standing task for the next time an S3-side hiccup causes visible latency spikes — at which point these defaults become the mitigation.
Follow-up to #6
After landing the
max_pool_connections=50fix in #7 / released as v0.4.2, the pool-saturation class of failure should be gone. If a residual error rate remains under load, the next likely culprit is the boto3 default request timeouts:connect_timeout=60sandread_timeout=60s.Why this matters
A hanging S3 connection (bad network, throttling, endpoint hiccup) currently keeps a Thumbor worker blocked for up to 60 s. Under a brief S3-side blip this cascades into:
asyncio.to_threadworker threads all stuck on S3 I/O.connect_timeout + read_timeoutper hung request.A tighter budget bounds the blast radius: fail fast, free the worker, let the client retry (or fall through to the PG path if both are available).
Proposal
Extend the
botocore.Configalready introduced in #7 with timeout settings, both env-var overridable:Suggested defaults:
connect_timeout5 sread_timeout30 sretries.max_attempts2retries.modeadaptiveAll three env-var overridable for ops tuning.
Interaction with existing 4xx/5xx microcache (#5)
If a timeout causes the loader to return
None→ Thumbor emits a 400/5xx → thePGTHUMBOR_CACHE_CONTROL_ERRORmicrocache (10 s) absorbs the repeat-requests, so even during an S3 incident the backend doesn't get hammered. The two fixes compound nicely.Test plan
boto3.client, assertkwargs["config"].connect_timeout == 5/read_timeout == 30with env unset.motowith a deliberate slow response and assert the loader returnsNonewithin the configured budget rather than hanging.Not yet prioritised because
The pool fix alone already addressed the observed symptoms on aaf-6 prod in the scope of #6. Open this as a standing task for the next time an S3-side hiccup causes visible latency spikes — at which point these defaults become the mitigation.