Skip to content

Commit 1e3a73b

Browse files
MaxGhenisclaude
andauthored
Fix 11 bugs from audit (security + correctness) (#1473)
* Fix indentation bug in flatten_variables_from_household Move flattened_variables.append(new_pair) inside the period loop so every (entity, variable, period) tuple is captured. Previously only the last period's entry was appended. Fixes #1462 * Tighten rate limit on public /calculate_demo endpoint The demo endpoint is intentionally public (documented in config/README.md). Its previous "1 per second" limit allowed up to 86,400 free calls per IP per day, which is cheap abuse. Tighten to 1 per 10 seconds. Auth protection is still applied to /calculate and /ai-analysis via @require_auth_if_enabled(). Fixes #1463 * Fix sqlite URI and protect analytics DB from wipe on init Two related issues in debug-mode analytics init: 1. The SQLALCHEMY_DATABASE_URI was built with five slashes ("sqlite:////" + "/private/...") which breaks SQLAlchemy URL parsing. Use f"sqlite:///{db_url}" so absolute paths produce the canonical four-slash URI. 2. The unconditional Path(db_url).unlink() wiped the captured analytics on every process start in debug mode. Gate that on RESET_ANALYTICS=1 (or analytics.reset config) so a developer must explicitly opt in to clearing the SQLite file. Fixes #1464 * Restrict CORS to PolicyEngine origins by default CORS(app) allowed every origin, so any third-party site could invoke /calculate and (when auth is disabled) read user-supplied household responses. Default to the PolicyEngine production domains and allow overrides via CORS_ALLOWED_ORIGINS (comma-separated env) or the new cors.allowed_origins config entry. Regex patterns are used so that *.policyengine.org subdomains match without extra glue. Fixes #1465 * Replace invalid ConnectionError kwargs with GCPError Python's built-in Exception/ConnectionError do not accept a `description` keyword, so every error path in GoogleCloudStorageManager raised `TypeError: __init__() got an unexpected keyword argument 'description'` instead of the intended GCP error. Introduce a small GCPError class that preserves `.description` as an attribute and use it in the four call sites that were previously broken. No call site reads `.args[0]` so switching exception class is safe. Fixes #1466 * Stop coercing "0"/"1" env vars to booleans The bool branch in ConfigLoader._convert_value ran first and included "0" and "1", so every numeric config set via env var (PORT, pool sizes, counts, ...) silently collapsed into True/False. Reorder to try int -> float -> bool words, and only treat the literal words true/false/yes/no as booleans. "0" now stays int 0, "1" stays int 1. Fixture expectations and FLASK_DEBUG test value are updated to match. Fixes #1467 * Verify JWT signature before trusting sub claim in analytics The analytics decorator decoded incoming bearer tokens with options={"verify_signature": False} and stored the resulting sub claim directly as Visit.client_id. That meant anyone could set the Authorization header to a self-signed JWT and have their chosen identity recorded against every request. Also datetime.utcnow() is deprecated in Python 3.12+ and returns naive UTC. Changes: * Add _verified_sub_claim() that fetches the Auth0 JWKS via PyJWKClient and runs jwt.decode with verify_signature=True. * If Auth0 isn't configured, or verification fails, store None for client_id instead of an attacker-controlled string. * Replace datetime.utcnow() with datetime.now(timezone.utc). * Update analytics fixtures to mock _verified_sub_claim (verified path) and add a regression test for the unverified-signature case. Fixes #1468 * Re-raise tracer failures instead of implicit None return PolicyEngineCountry.calculate() swallowed exceptions from the tracer block with a bare `except Exception as e: print(...)`, after which the function fell through and implicitly returned None. Callers in endpoints/household.py unpack the return value into (result, computation_tree_uuid), so a tracer failure surfaced as `TypeError: cannot unpack non-iterable NoneType` instead of a meaningful 500. Re-raise so the endpoint's own try/except can return a proper error response. Fixes #1469 * Validate /calculate payloads and cap axes scans The /calculate endpoint accepted arbitrary JSON and passed it straight to the simulation, exposing three abuse vectors: * Malformed household payloads reached the compute layer and produced opaque 500s instead of clear 400s. * `axes` scans multiply cost by count_0 * count_1 * ..., so an unbounded axes list could pin the compute pool. * Per-endpoint rate limiting was absent. Fixes: * Validate the household payload against the country-specific HouseholdModel{US,UK,Generic} before calling country.calculate(). * Cap `axes` at MAX_AXES_ENTRIES=10 and each `count` at MAX_AXES_COUNT=100, returning 400 on violation. * Add `@limiter.limit("60 per minute")` to /calculate. Fixes #1470 * Time-bound and lazy-load the Auth0 JWKS fetch Auth0JWTBearerTokenValidator called urlopen() on the JWKS URL at construction time with no timeout and no error handling. In practice: * Any Auth0 outage at boot bricked the API (process wedged or crashed, depending on the failure mode). * Module-level construction made every import pay the network cost. Refactor to: * Wrap urlopen in a try/except that logs on failure and returns None. * Always pass `timeout=JWKS_FETCH_TIMEOUT` (10s). * Cache successful fetches via functools.lru_cache so repeat validator constructions don't re-fetch the key set. * Override authenticate_token() to lazily retry the JWKS fetch when the initial load failed. Fixes #1471 * Use dpath.search instead of deprecated dpath.util.search The dpath.util namespace emits DeprecationWarning and is slated for removal in dpath 3.x. All util functions are accessible at the dpath package top level. Switch country.get_requested_computations over to the new import path. Fixes #1472 * Add changelog entry for bug-audit batch * Anchor default CORS regex with $ to block suffix-match bypass Flask-CORS matches origins with re.match (prefix match), so https://.*\.policyengine\.org without a trailing anchor admits hostile origins like https://evil.policyengine.org.attacker.com. Anchor the wildcard with $, add localhost / 127.0.0.1 regexes for dev, and document CORS_ALLOWED_ORIGINS and RESET_ANALYTICS in .env-example and config/README.md. Adds unit tests replicating flask-cors' re.match/probably_regex semantics to guard the bypass. Refs #1465, #1464. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Cache JWKS successes only so the lazy retry actually retries The previous fix wrapped _fetch_jwks in functools.lru_cache, which memoised the None return on failure. The "lazy retry" in Auth0JWTBearerTokenValidator.authenticate_token therefore kept getting the cached None back and never hit the network again — so a transient Auth0 outage at import time bricked the validator until the process restarted. Replace with a module-level success cache plus a per-issuer last-failure timestamp. On failure we do not memoise the None; on the next call we re-fetch, throttled by JWKS_RETRY_INTERVAL_SECONDS so we don't hammer Auth0 while it is degraded. Construction is still non-blocking. Add a regression test that patches _fetch_jwks_uncached to return None once and then a sentinel, and asserts the lazy retry actually calls the network twice and swaps in the sentinel key. Refs #1471. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Align analytics error path to NULL client_id Missing Authorization and JWT-decode errors previously set client_id to the sentinel string "unknown", diverging from the signature-verification-fail path which already stored None. The downstream analytics layer has to filter one or the other out, and inconsistency is a bug magnet. Collapse both paths to None so any unverifiable identity is a SQL NULL. Updates the two affected tests to assert NULL. Refs #1468. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Drop backward-looking "previously..." comments Code comments describe current behavior, not history — git log is the history. Trim the narrative lead-ins from the tracer re-raise and the analytics-reset gate so each comment states only what the code does today. Refs #1469. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Require people on HouseholdModel and cap JSON body size A request with no people has nothing to compute against; making people required in HouseholdModelGeneric surfaces this at the validation layer instead of deep in the solver. households stays optional to keep axes-scan payloads (which omit it) valid. Also cap total request body size via MAX_CONTENT_LENGTH (default 10 MiB, overridable by env var) so a single oversize payload cannot force the API to allocate unbounded memory before any view runs. Refs #1470. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Cover zero-period variable in flatten_variables_from_household Regression guard for the indentation-era bug where new_pair was referenced outside the period loop. With an empty period dict the inner loop never runs; the test asserts we return no entries and do not raise UnboundLocalError or leak a previous variable's tuple. Refs #1462. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 0f9fe32 commit 1e3a73b

24 files changed

Lines changed: 964 additions & 67 deletions

.env-example

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,16 @@
11
FLASK_DEBUG=1
2-
CACHE_REDIS_HOST=redis
2+
CACHE_REDIS_HOST=redis
3+
4+
# Optional: wipe the local sqlite analytics DB on startup. Only
5+
# consulted when FLASK_DEBUG=1 and analytics is enabled. Default off
6+
# so captured debug data is not lost across restarts.
7+
# RESET_ANALYTICS=1
8+
9+
# Optional: comma-separated list of origins (or regex patterns) allowed
10+
# by CORS. If unset, the default allowlist is:
11+
# - https://policyengine.org
12+
# - https://*.policyengine.org (anchored regex)
13+
# - http://localhost[:port] (any port)
14+
# - http://127.0.0.1[:port]
15+
# Example override:
16+
# CORS_ALLOWED_ORIGINS=https://foo.example.com,https://bar.example.com

changelog_entry.yaml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
- bump: patch
2+
changes:
3+
fixed:
4+
- Flatten every (entity, variable, period) triple in flatten_variables_from_household (#1462).
5+
- Tighten /calculate_demo rate limit from 1/second to 1/10 seconds (#1463).
6+
- Stop unconditionally wiping the analytics SQLite DB and fix the sqlite:// URI (#1464).
7+
- Restrict CORS to PolicyEngine origins by default, anchored so attacker subdomains can't bypass (#1465).
8+
- Replace invalid ConnectionError(description=...) with a GCPError class (#1466).
9+
- Keep "0"/"1" env-var values as integers instead of collapsing to False/True (#1467).
10+
- Verify JWT signatures in the analytics decorator and drop datetime.utcnow (#1468).
11+
- Re-raise tracer failures in PolicyEngineCountry.calculate so the endpoint can return a real 500 (#1469).
12+
- Validate /calculate payloads and cap axes scans; add per-endpoint rate limit (#1470).
13+
- Time-bound and lazy-load the Auth0 JWKS fetch so a startup outage doesn't crash the API, caching only successes so the lazy retry actually retries (#1471).
14+
- Replace deprecated dpath.util.search with dpath.search (#1472).

config/README.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -270,6 +270,39 @@ The following endpoints remain unprotected:
270270
- When enabled, all protected endpoints validate JWT tokens against Auth0's JWKS
271271
- The Auth0 domain and audience must match the configured values
272272

273+
## CORS Configuration
274+
275+
Browsers enforce CORS against the API. The default allowlist accepts:
276+
277+
- `https://policyengine.org`
278+
- Any `https://*.policyengine.org` host (anchored regex)
279+
- `http://localhost` on any port (dev servers)
280+
- `http://127.0.0.1` on any port
281+
282+
Override with `CORS_ALLOWED_ORIGINS` (comma-separated strings or
283+
regexes) or `cors.allowed_origins` in YAML:
284+
285+
```bash
286+
CORS_ALLOWED_ORIGINS=https://app.example.com,https://admin.example.com
287+
```
288+
289+
```yaml
290+
cors:
291+
allowed_origins:
292+
- https://app.example.com
293+
- 'https://.*\.example\.com$'
294+
```
295+
296+
Always terminate regex patterns with `$` — Flask-CORS matches with
297+
`re.match`, so an unanchored pattern like `https://.*\.example\.com`
298+
would accept `https://example.com.attacker.com`.
299+
300+
## Analytics reset (debug only)
301+
302+
`RESET_ANALYTICS=1` (or `analytics.reset: true` in YAML) wipes the
303+
local SQLite analytics DB on startup. This is **only** consulted when
304+
`FLASK_DEBUG=1`; production never resets the analytics DB.
305+
273306
## Usage Examples
274307

275308
### Production Deployment (Current)

config/default.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,3 +46,11 @@ ai:
4646
enabled: false
4747
anthropic:
4848
api_key: "" # Override with ANTHROPIC_API_KEY
49+
50+
# CORS configuration
51+
cors:
52+
# List of allowed origins (strings or regex patterns). If left null
53+
# the API defaults to PolicyEngine production domains. Override with
54+
# CORS_ALLOWED_ORIGINS (comma-separated) in environments that serve
55+
# additional frontends.
56+
allowed_origins: null

policyengine_household_api/api.py

Lines changed: 52 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
from policyengine_household_api.decorators.analytics import (
2525
log_analytics_if_enabled,
2626
)
27+
from policyengine_household_api.utils.config_loader import get_config_value
2728

2829
# Endpoints
2930
from .endpoints import (
@@ -40,7 +41,52 @@
4041

4142
app = application = flask.Flask(__name__)
4243

43-
CORS(app)
44+
# Reject absurdly large request bodies before any view runs. 10 MiB is
45+
# well above the largest legitimate household payload we have seen
46+
# (axes scans push a few hundred KiB) while still capping the memory a
47+
# single attacker can force us to allocate. Overridable via the
48+
# ``MAX_CONTENT_LENGTH`` env var (bytes).
49+
app.config["MAX_CONTENT_LENGTH"] = int(
50+
os.getenv("MAX_CONTENT_LENGTH", 10 * 1024 * 1024)
51+
)
52+
53+
54+
def _resolve_cors_origins():
55+
"""
56+
Resolve the CORS allowed origins list.
57+
58+
Priority:
59+
1. CORS_ALLOWED_ORIGINS env var (comma-separated list)
60+
2. config value "cors.allowed_origins" (list or comma string)
61+
3. Safe default: the PolicyEngine production domains
62+
63+
Use regex patterns so that wildcard subdomains work with
64+
Flask-CORS's `origins` kwarg.
65+
"""
66+
raw = os.getenv("CORS_ALLOWED_ORIGINS") or get_config_value(
67+
"cors.allowed_origins", None
68+
)
69+
70+
if raw is None:
71+
# Flask-CORS uses re.match, which is a prefix match; anchor with
72+
# ``$`` so a hostile host like ``policyengine.org.attacker.com``
73+
# cannot satisfy the wildcard pattern. Include ``localhost:*``
74+
# so local dev servers can hit the API without extra setup.
75+
origins = [
76+
"https://policyengine.org",
77+
r"https://.*\.policyengine\.org$",
78+
r"http://localhost(:[0-9]+)?$",
79+
r"http://127\.0\.0\.1(:[0-9]+)?$",
80+
]
81+
elif isinstance(raw, str):
82+
origins = [o.strip() for o in raw.split(",") if o.strip()]
83+
else:
84+
origins = list(raw)
85+
86+
return origins
87+
88+
89+
CORS(app, origins=_resolve_cors_origins())
4490

4591
# Use in-memory storage for rate limiting
4692
# Note that this provides limits per-instance;
@@ -59,6 +105,7 @@
59105

60106
@app.route("/<country_id>/calculate", methods=["POST"])
61107
@require_auth_if_enabled()
108+
@limiter.limit("60 per minute")
62109
@log_analytics_if_enabled
63110
def calculate(country_id):
64111
return get_calculate(country_id)
@@ -84,8 +131,11 @@ def readiness_check():
84131
)
85132

86133

134+
# Note: `/calculate_demo` is intentionally public (documented in
135+
# config/README.md). It is guarded by a conservative rate limit rather
136+
# than JWT authentication.
87137
@app.route("/<country_id>/calculate_demo", methods=["POST"])
88-
@limiter.limit("1 per second")
138+
@limiter.limit("1 per 10 seconds")
89139
def calculate_demo(country_id):
90140
return get_calculate(country_id)
91141

Lines changed: 98 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,114 @@
11
import json
2+
import logging
3+
import time
4+
from threading import Lock
25
from urllib.request import urlopen
36

47
from authlib.oauth2.rfc7523 import JWTBearerTokenValidator
58
from authlib.jose.rfc7517.jwk import JsonWebKey
69

10+
logger = logging.getLogger(__name__)
11+
12+
JWKS_FETCH_TIMEOUT = 10 # seconds
13+
# Minimum wait between back-to-back lazy retries after a failure.
14+
# Keeps us from hammering Auth0 when it is actively degraded.
15+
JWKS_RETRY_INTERVAL_SECONDS = 30
16+
17+
18+
# Module-level cache of successful JWKS fetches, keyed by issuer. Only
19+
# successes are cached so that a transient failure is retried on the
20+
# next authenticated request (``lru_cache`` would have memoised the
21+
# ``None`` return, making the "lazy retry" dead code).
22+
_jwks_cache: dict = {}
23+
# Records the monotonic timestamp of the most recent *failed* fetch
24+
# per-issuer so we can rate-limit retries without caching the failure
25+
# itself.
26+
_jwks_last_failure: dict = {}
27+
_jwks_lock = Lock()
28+
29+
30+
def _fetch_jwks_uncached(issuer: str):
31+
"""Fetch the JWKS for an Auth0 issuer, bypassing the cache.
32+
33+
Returns an authlib key set on success, ``None`` on failure. Errors
34+
are logged rather than raised so that a transient Auth0 outage
35+
doesn't crash the process at import time.
36+
"""
37+
jwks_url = f"{issuer}.well-known/jwks.json"
38+
try:
39+
with urlopen(jwks_url, timeout=JWKS_FETCH_TIMEOUT) as response:
40+
return JsonWebKey.import_key_set(json.loads(response.read()))
41+
except Exception as e:
42+
logger.warning(f"Failed to fetch JWKS from {jwks_url}: {e}")
43+
return None
44+
45+
46+
def _fetch_jwks(issuer: str):
47+
"""Fetch JWKS, caching only successful results.
48+
49+
On failure we record the time but do not memoise the ``None`` — a
50+
later call will retry (subject to ``JWKS_RETRY_INTERVAL_SECONDS``
51+
backoff) so that the validator self-heals once Auth0 recovers.
52+
"""
53+
with _jwks_lock:
54+
cached = _jwks_cache.get(issuer)
55+
if cached is not None:
56+
return cached
57+
last_failure = _jwks_last_failure.get(issuer)
58+
if (
59+
last_failure is not None
60+
and time.monotonic() - last_failure < JWKS_RETRY_INTERVAL_SECONDS
61+
):
62+
# Too soon after the last failure — don't hammer Auth0.
63+
return None
64+
65+
# Fetch outside the lock so a slow network call doesn't block
66+
# other threads that might be serving requests with a cached key.
67+
key_set = _fetch_jwks_uncached(issuer)
68+
69+
with _jwks_lock:
70+
if key_set is not None:
71+
_jwks_cache[issuer] = key_set
72+
_jwks_last_failure.pop(issuer, None)
73+
else:
74+
_jwks_last_failure[issuer] = time.monotonic()
75+
return key_set
76+
77+
78+
def _clear_jwks_cache():
79+
"""Test helper: wipe the success/failure caches."""
80+
with _jwks_lock:
81+
_jwks_cache.clear()
82+
_jwks_last_failure.clear()
83+
784

885
class Auth0JWTBearerTokenValidator(JWTBearerTokenValidator):
986
def __init__(self, domain, audience):
1087
issuer = f"https://{domain}/"
11-
jsonurl = urlopen(f"{issuer}.well-known/jwks.json")
12-
public_key = JsonWebKey.import_key_set(json.loads(jsonurl.read()))
88+
89+
public_key = _fetch_jwks(issuer)
90+
if public_key is None:
91+
# Retry on next token validation rather than failing hard
92+
# at construction time. A missing key set means token
93+
# validation will fail cleanly inside authlib.
94+
logger.warning(
95+
"JWKS unavailable at construction; will retry on first "
96+
"token validation."
97+
)
98+
1399
super(Auth0JWTBearerTokenValidator, self).__init__(public_key)
100+
self._issuer = issuer
14101
self.claims_options = {
15102
"exp": {"essential": True},
16103
"aud": {"essential": True, "value": audience},
17104
"iss": {"essential": True, "value": issuer},
18105
}
106+
107+
def authenticate_token(self, token_string):
108+
# Lazy-refresh the JWKS if the initial fetch failed. Because
109+
# ``_fetch_jwks`` only caches successes, this call will retry
110+
# the network fetch (subject to a short backoff) until Auth0
111+
# responds.
112+
if self.public_key is None:
113+
self.public_key = _fetch_jwks(self._issuer)
114+
return super().authenticate_token(token_string)

policyengine_household_api/country.py

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import importlib
2+
import logging
23
from flask import Response
34
import json
45
from policyengine_core.taxbenefitsystems import TaxBenefitSystem
@@ -432,8 +433,12 @@ def calculate(
432433

433434
return household, None
434435

435-
except Exception as e:
436-
print(f"Error computing tracer output: {e}")
436+
except Exception:
437+
# Re-raise so endpoints/household.py (which unpacks
438+
# ``(result, computation_tree_uuid)``) can surface a real
439+
# 500 instead of a TypeError on ``None`` unpacking.
440+
logging.exception("Tracer failed while computing household")
441+
raise
437442

438443

439444
def create_policy_reform(policy_data: dict) -> dict:
@@ -478,7 +483,7 @@ def apply(self):
478483

479484

480485
def get_requested_computations(household: dict):
481-
requested_computations = dpath.util.search(
486+
requested_computations = dpath.search(
482487
household,
483488
"*/*/*/*",
484489
afilter=lambda t: t is None,

policyengine_household_api/data/analytics_setup.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,11 +47,21 @@ def initialize_analytics_db_if_enabled(app):
4747
db_url = (
4848
REPO / "policyengine_household_api" / "data" / "policyengine.db"
4949
)
50-
if Path(db_url).exists():
50+
# Only wipe the analytics DB when explicitly requested via
51+
# RESET_ANALYTICS=1 (or the ``analytics.reset`` config flag).
52+
should_reset = os.getenv("RESET_ANALYTICS", "").lower() in (
53+
"1",
54+
"true",
55+
"yes",
56+
) or get_config_value("analytics.reset", False)
57+
if should_reset and Path(db_url).exists():
5158
Path(db_url).unlink()
5259
if not Path(db_url).exists():
5360
Path(db_url).touch()
54-
app.config["SQLALCHEMY_DATABASE_URI"] = "sqlite:////" + str(db_url)
61+
# sqlite: absolute paths require exactly three slashes plus the
62+
# leading "/" from the absolute path (=> "sqlite:////tmp/x.db").
63+
# db_url here is already absolute, so use an f-string.
64+
app.config["SQLALCHEMY_DATABASE_URI"] = f"sqlite:///{db_url}"
5565
else:
5666
app.config["SQLALCHEMY_DATABASE_URI"] = "mysql+pymysql://"
5767
app.config["SQLALCHEMY_ENGINE_OPTIONS"] = {"creator": getconn}

0 commit comments

Comments
 (0)