Skip to content

fix: microcache error responses instead of pinning long-TTL (closes #5)#8

Merged
jensens merged 1 commit into
mainfrom
fix/no-cache-error-responses
Apr 20, 2026
Merged

fix: microcache error responses instead of pinning long-TTL (closes #5)#8
jensens merged 1 commit into
mainfrom
fix/no-cache-error-responses

Conversation

@jensens
Copy link
Copy Markdown
Member

@jensens jensens commented Apr 20, 2026

Summary

  • AuthImagingHandler.finish() applied the long-TTL Cache-Control override to every response, so a transient Thumbor 400 (e.g. PIL decompression-bomb rejection) inherited public, max-age=31536000, immutable and got pinned in Varnish/CDN for a year — one bad fetch persistently hid an image on every shard that cached it.
  • Closes #5.

Change

Gate the success override on 2xx. For 4xx/5xx apply a short microcache header instead (PGTHUMBOR_CACHE_CONTROL_ERROR, default public, max-age=10). 3xx is left untouched (Thumbor default).

Status Before After
2xx long-TTL override (e.g. max-age=31536000, immutable) long-TTL override (unchanged)
3xx long-TTL override ❌ untouched (Thumbor default)
4xx / 5xx long-TTL override ❌ public, max-age=10 (configurable)

Why microcache instead of no-store

The issue proposed no-store for errors. Using a short-TTL microcache (default 10 s) instead is strictly better under load:

  1. Decouples transient errors from long-term poisoning — same goal as no-store, achieved with a 10-s TTL instead of 0.
  2. Absorbs request floods for broken URLs — downstream Varnish/CDN serves the error themselves for the next few seconds instead of every request hitting Thumbor → PG → S3. Cheap DoS-amplification defense on top of the primary fix.
  3. Operators can dial to no-store by setting PGTHUMBOR_CACHE_CONTROL_ERROR="no-store" if they explicitly don't want the absorption behaviour.

Test plan

  • New test: 4xx gets microcache, long-TTL override MUST NOT leak.
  • New test: 4xx without any _cache_control_override still gets microcache.
  • New test: 5xx handled same as 4xx.
  • New test: PGTHUMBOR_CACHE_CONTROL_ERROR env var override is honoured.
  • New test: 3xx leaves Cache-Control untouched.
  • Existing 2xx-path tests still pass unchanged.
  • Full non-integration test suite: 77/77 passing (integration test has a pre-existing DB-state flake unrelated to this change).
  • Production verification on aaf-6: feed a known-bad image URL, confirm Varnish Age header on repeated GETs tops out at ~10 s instead of the year-long pinning observed today.

Docs

  • README.md env-var table updated with PGTHUMBOR_CACHE_CONTROL_ERROR.
  • thumbor.conf template updated with default and rationale comment.

🤖 Generated with Claude Code

AuthImagingHandler.finish() applied _cache_control_override to every
response regardless of status, so a transient Thumbor 400 (e.g. PIL
decompression-bomb rejection) inherited "public, max-age=31536000,
immutable" and got pinned in Varnish/CDN for a year — one bad fetch
persistently hid the image on every shard that cached it.

Gate the long-TTL override on 2xx. For 4xx/5xx apply a short
microcache header instead (PGTHUMBOR_CACHE_CONTROL_ERROR, default
"public, max-age=10"): decouples transient errors from long-term
cache poisoning AND lets downstream caches absorb request floods
for broken URLs — a cheap DoS amplification defense. 3xx is left
untouched.

Closes #5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jensens jensens merged commit 75c954f into main Apr 20, 2026
7 of 8 checks passed
@jensens jensens deleted the fix/no-cache-error-responses branch April 20, 2026 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AuthImagingHandler applies long Cache-Control to error responses (400/403/404)

1 participant