Skip to content

Tighten boto3 S3 timeouts (connect/read) #9

@jensens

Description

@jensens

Follow-up to #6

After landing the max_pool_connections=50 fix in #7 / released as v0.4.2, the pool-saturation class of failure should be gone. If a residual error rate remains under load, the next likely culprit is the boto3 default request timeouts: connect_timeout=60s and read_timeout=60s.

Why this matters

A hanging S3 connection (bad network, throttling, endpoint hiccup) currently keeps a Thumbor worker blocked for up to 60 s. Under a brief S3-side blip this cascades into:

  • asyncio.to_thread worker threads all stuck on S3 I/O.
  • Incoming Thumbor requests queue behind them.
  • Varnish/CDN keepalive begins retrying — amplifying the stuck load.
  • Overall frontend slowdown proportional to connect_timeout + read_timeout per hung request.

A tighter budget bounds the blast radius: fail fast, free the worker, let the client retry (or fall through to the PG path if both are available).

Proposal

Extend the botocore.Config already introduced in #7 with timeout settings, both env-var overridable:

from botocore.config import Config

max_pool = int(os.environ.get("PGTHUMBOR_S3_MAX_POOL_CONNECTIONS", "50"))
connect_timeout = float(os.environ.get("PGTHUMBOR_S3_CONNECT_TIMEOUT", "5"))
read_timeout = float(os.environ.get("PGTHUMBOR_S3_READ_TIMEOUT", "30"))

kwargs["config"] = Config(
    max_pool_connections=max_pool,
    connect_timeout=connect_timeout,
    read_timeout=read_timeout,
    retries={"max_attempts": 2, "mode": "adaptive"},
)

Suggested defaults:

Setting Default Rationale
connect_timeout 5 s Fast fail on unreachable endpoints; legitimate TLS handshakes are well under this
read_timeout 30 s Covers typical blob sizes (KB–low-MB) on sane connections
retries.max_attempts 2 One retry for truly transient errors; keep latency bounded
retries.mode adaptive Honours throttling signals from S3

All three env-var overridable for ops tuning.

Interaction with existing 4xx/5xx microcache (#5)

If a timeout causes the loader to return None → Thumbor emits a 400/5xx → the PGTHUMBOR_CACHE_CONTROL_ERROR microcache (10 s) absorbs the repeat-requests, so even during an S3 incident the backend doesn't get hammered. The two fixes compound nicely.

Test plan

  • Unit: patch boto3.client, assert kwargs["config"].connect_timeout == 5 / read_timeout == 30 with env unset.
  • Unit: env override for each, asserted individually.
  • Integration (optional): use moto with a deliberate slow response and assert the loader returns None within the configured budget rather than hanging.

Not yet prioritised because

The pool fix alone already addressed the observed symptoms on aaf-6 prod in the scope of #6. Open this as a standing task for the next time an S3-side hiccup causes visible latency spikes — at which point these defaults become the mitigation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions