You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Household API returned 500s and empty responses to authenticated partners from about 19:00 ET on April 22 to 14:30 ET on April 23. The trigger was the weekly policyengine_us bump (#1457, 1.626.1 → 1.663.0) that @hua7450 merged on April 22 at 18:45 ET. The bump itself was harmless; the unpinned production build picked up Authlib 1.7.0 alongside it. Authlib 1.7 routes JWT decoding through joserfc, whose guess_key() rejected the KeySet our auth code built via authlib.jose.rfc7517.jwk.JsonWebKey.import_key_set(...) — every authenticated request raised ValueError: Invalid key.
Tests missed it: the auth integration suite used StaticBearerTokenValidator, which matched fake tokens and never invoked Auth0 or JWKS code. The production Docker build had no lock file, so any weekly model bump could pull new transitive dependencies. Partners flagged the outage before monitoring did.
Apr 23, 14:26 — @anth-volk confirms recovery to Impactica.
Apr 23, 15:39 — @PavelMakarchuk confirms recovery to Amplifi with the full outage window.
Total impact window: ~19.5 hours.
Root cause
#1457 changed only policyengine_us, but the production Docker build resolved transitive dependencies fresh and pulled Authlib 1.7.0. Authlib 1.7 delegates JWT decoding to joserfc, and joserfc's guess_key() rejected the KeySet our auth code built via authlib.jose.rfc7517.jwk.JsonWebKey.import_key_set(...). JWTBearerTokenValidator.authenticate_token() raised ValueError: Invalid key on every authenticated request.
CI passed because the authenticated tests used StaticBearerTokenValidator, which matched static fake tokens and never invoked Auth0 or JWKS code. Nothing exercised the real validator against the deployed service after release, so the failure surfaced only when partners called production.
Rolled the code back to 0.13.13 (Roll back API to 0.13.13 and pin core/urllib3 #1476, 13:24 ET deploy) to narrow the regression. The rollback also reverted the April 17 audit-fix batch — CORS, request validation, analytics JWT handling, Auth0 JWKS loading, config parsing, GCP error handling, household variable flattening — but errors persisted on the baseline, which pointed at the dependency rather than the code.
Pinned Authlib<1.7.0 (Pin Authlib below 1.7.0 #1478, 14:21 ET deploy) after reproducing the failure against 1.7.0 in an isolated virtualenv and confirming 1.6.11 was clean. The pin restored service.
Follow-up landed across three PRs:
Add JWT validation test and dependency lock file #1479 — @PavelMakarchuk authored and @anth-volk merged real JWT validator tests against self-signed RSA keys, pyproject.toml + uv.lock, frozen installs (uv sync --frozen) in production Docker builds, deployed HTTP integration tests, and a staging-first deployment workflow.
Reapply audit fixes and migrate Authlib 1.7 JWKS handling #1488 — @anth-volk migrated JWKS parsing to joserfc.jwk.KeySet, made JWKS loading lazy and time-bounded so transient Auth0 failures no longer crash startup, removed the Authlib cap and locked 1.7.0 alongside joserfc, added RS256 regression coverage for the real validator path, and re-applied the rolled-back audit fixes.
Still open
Production alerting on Household API 5xx rate and empty-response rate. Partners detected this outage before we did; that's the gap we haven't closed.
Deployed authenticated partner-style checks as a post-deploy or traffic-shift gate, not just in CI and staging.
A partner incident comms template — start time, restoration time, current status, follow-up owner — so we aren't drafting from scratch under pressure.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The Household API returned 500s and empty responses to authenticated partners from about 19:00 ET on April 22 to 14:30 ET on April 23. The trigger was the weekly
policyengine_usbump (#1457, 1.626.1 → 1.663.0) that @hua7450 merged on April 22 at 18:45 ET. The bump itself was harmless; the unpinned production build picked upAuthlib1.7.0 alongside it. Authlib 1.7 routes JWT decoding throughjoserfc, whoseguess_key()rejected theKeySetour auth code built viaauthlib.jose.rfc7517.jwk.JsonWebKey.import_key_set(...)— every authenticated request raisedValueError: Invalid key.Tests missed it: the auth integration suite used
StaticBearerTokenValidator, which matched fake tokens and never invoked Auth0 or JWKS code. The production Docker build had no lock file, so any weekly model bump could pull new transitive dependencies. Partners flagged the outage before monitoring did.Timeline (ET)
policyengine_us1.626.1 → 1.663.0).#partner-impactica.employment_income).#partner-amplifi.Authlib<1.7.0pin (Pin Authlib below 1.7.0 #1478) deploys and restores service.Total impact window: ~19.5 hours.
Root cause
#1457 changed only
policyengine_us, but the production Docker build resolved transitive dependencies fresh and pulledAuthlib1.7.0. Authlib 1.7 delegates JWT decoding tojoserfc, andjoserfc'sguess_key()rejected theKeySetour auth code built viaauthlib.jose.rfc7517.jwk.JsonWebKey.import_key_set(...).JWTBearerTokenValidator.authenticate_token()raisedValueError: Invalid keyon every authenticated request.CI passed because the authenticated tests used
StaticBearerTokenValidator, which matched static fake tokens and never invoked Auth0 or JWKS code. Nothing exercised the real validator against the deployed service after release, so the failure surfaced only when partners called production.Resolution
@anth-volk recovered service in two steps:
Authlib<1.7.0(Pin Authlib below 1.7.0 #1478, 14:21 ET deploy) after reproducing the failure against 1.7.0 in an isolated virtualenv and confirming 1.6.11 was clean. The pin restored service.Follow-up landed across three PRs:
pyproject.toml+uv.lock, frozen installs (uv sync --frozen) in production Docker builds, deployed HTTP integration tests, and a staging-first deployment workflow.policyengine_us==1.663.0once the lock file made builds reproducible.joserfc.jwk.KeySet, made JWKS loading lazy and time-bounded so transient Auth0 failures no longer crash startup, removed the Authlib cap and locked 1.7.0 alongside joserfc, added RS256 regression coverage for the real validator path, and re-applied the rolled-back audit fixes.Still open
Beta Was this translation helpful? Give feedback.
All reactions