Boston fork: security hardening, upstream-portal protection, AWS hosting docs#1
Open
brendanbabb wants to merge 4 commits intomainfrom
Open
Boston fork: security hardening, upstream-portal protection, AWS hosting docs#1brendanbabb wants to merge 4 commits intomainfrom
brendanbabb wants to merge 4 commits intomainfrom
Conversation
… pin Move the deployment to us-west-2, add reserved Lambda concurrency as the primary brake on fan-out into the upstream CKAN portal, and pin Lambda packaging to cp311/manylinux wheels so the ZIP works regardless of build host Python version. - terraform/aws: add lambda_reserved_concurrency (default 10) wired to aws_lambda_function.reserved_concurrent_executions. Extract the S3 backend config out of main.tf; real backend.tf is gitignored because the bucket name embeds the deployer's AWS account ID. Ship backend.tf.example as the template. - prod/staging tfvars: aws_region=us-west-2, api_quota_limit=3000 (was 1000), lambda_reserved_concurrency=10. Prod custom domain is boston-data.codeforanchorage.org; staging has no custom domain. - scripts/deploy.sh + .github/workflows/release.yml: force cp311 manylinux wheel resolution on every pip/uv install (without this, a Python 3.14 build host produces a ZIP that 502s at Lambda cold start). Detect python3/python cross-platform. Build the ZIP with stdlib zipfile instead of the `zip` binary so the packaging step works on CI images and Windows. - scripts/setup-backend.sh: fix malformed bucket name (boston-opencontext-opendataterraform-state-... → boston-opencontext- tfstate-...). - config.yaml: replace symlink-to-example with a concrete Boston CKAN config targeting data.boston.gov. ArcGIS kept disabled for reference. - local_server.py: accept POSTs on both / and /mcp so the same local server works with Claude Desktop stdio bridges and MCP Inspector. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…size
Tighten the two attack surfaces that directly forward user-controlled
input into upstream CKAN: the execute_sql path and the aggregate_data
path. Add a request-body-size cap on the HTTP handler to bound the
work a single JSON-RPC call can cost. See docs/SECURITY.md for the
full threat model.
- plugins/ckan/sql_validator.py
* SQLValidator: shrink MAX_SQL_LENGTH 50000 → 8192; strip SQL
comments before keyword/function scans so /* ... */ and -- ...
obfuscation can't smuggle forbidden tokens past the checks;
expand FORBIDDEN_KEYWORDS with PREPARE/COPY/LISTEN/NOTIFY/VACUUM/
ANALYZE/CLUSTER/REINDEX/LOAD/DO; add FORBIDDEN_FUNCTIONS
(xp_cmdshell, pg_sleep, pg_read_file, pg_ls_dir, pg_stat_file,
lo_import, lo_export, current_setting, set_config, dblink);
walk the sqlparse AST to require every FROM/JOIN target to be a
UUID-quoted resource or a CTE alias (rejects schema-qualified
targets like pg_catalog.pg_class); match INTO OUTFILE/DUMPFILE.
* New enforce_row_limit: appends LIMIT 10000 to any validated SQL
that lacks a top-level LIMIT so a caller can't trigger an
unbounded scan on a multi-million-row CKAN DataStore table.
* New SafeSQLBuilder: typed, allowlist-only builder for the
aggregate_data path. Identifiers must match [A-Za-z_]\w*, metric
expressions must be count(*) or {count|sum|avg|min|max|stddev}
([DISTINCT] <ident>), filter values coerced per type with '
escaping, order_by parsed and quoted, limit clamped to 10000,
HAVING values must be numeric.
- plugins/ckan/plugin.py: route aggregate_data through
SafeSQLBuilder (was string concatenation); call
SQLValidator.enforce_row_limit after validate_query.
- server/http_handler.py: reject JSON-RPC bodies > 65 KB with
HTTP 413 before parsing. The MCP surface fits in a few KB; a
megabyte payload is either a bug or abuse.
- tests: cover body-size cap at and over the boundary, each new
forbidden keyword/function, comment obfuscation, schema-qualified
FROM rejection, enforce_row_limit behavior, and every
SafeSQLBuilder method.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Document the Boston fork's AWS hosting and security posture, with portal protection as the top design constraint. Add stdio_bridge.py as a Python alternative to the Go stdio client for Claude Desktop. - docs/AWS_DEPLOYMENT.md: how this fork is hosted (us-west-2, custom domain, reserved concurrency, cp311 packaging), what changed vs. upstream's single-region default, and how to operate the stack. Leads with the portal-protection design constraint. - docs/SECURITY.md: the full rationale behind the hardening changes, organized around who is being protected — upstream portal first, deployment second, end users third. Covers the SQL validator and SafeSQLBuilder surface, rate limits and body-size cap, privacy posture (stateless, no PII, 14-day log retention, SQL truncated to 500 chars), and known gaps. - README.md: link both new docs from the documentation table. - stdio_bridge.py: minimal Python stdio-to-HTTP bridge. Reads JSON-RPC messages from stdin, POSTs them to the local/remote MCP server, writes responses to stdout. Useful where the Go client is impractical (Windows, no Go toolchain). - CLAUDE.md: repo guidance for Claude Code sessions — commands, request flow, architecture notes, single-plugin rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds Method 4 to docs/TESTING.md covering how to wire the local HTTP server to Claude Desktop (claude_desktop_config.json) and Claude Code (.mcp.json) via stdio_bridge.py. The bridge was previously only mentioned in CLAUDE.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three logical changes, committed separately so they can be reviewed
independently:
cap on fan-out into the upstream CKAN portal, pin Lambda packaging to
cp311/manylinux wheels so the ZIP imports correctly regardless of the
build host's Python version, and extract the S3 backend config so the
deployer's AWS account ID does not live in the public repo.
aggregate_dataSQLbuilder to be allowlist-only and AST-aware, so that caller-supplied
strings cannot escape into the generated query; cap HTTP request bodies
at 64 KB with a 413 response before JSON parsing.
docs/AWS_DEPLOYMENT.mdanddocs/SECURITY.mddescribethe above for future operators and for anyone forking this to run
against another open-data portal. A Python
stdio_bridge.pyand aCLAUDE.mdfor repo guidance come along for the ride.Why
The top design constraint is not overwhelming the upstream open-data
portal.
data.boston.govis a shared civic resource, and an MCP serverin front of it can easily become its noisiest client — one Claude
conversation can translate into dozens of SQL queries, each hitting CKAN's
DataStore. Four overlapping layers protect the portal:
the API Gateway rate limit via the Lambda Function URL, they cannot drive
more than 10 parallel upstream queries.
quota (3000 requests/day per API key).
LIMITonexecute_sql—SQLValidator.enforce_row_limitappends
LIMIT 10000to any validated SQL lacking a top-levelLIMIT.LIMITonaggregate_data—SafeSQLBuilder.clamp_limitenforces
MAX_LIMIT = 10000.The SQL surface was also tightened against injection and DoS-via-expensive-
query regardless of portal load:
/* */obfuscation).PREPARE,COPY,LISTEN,NOTIFY,VACUUM,ANALYZE,CLUSTER,REINDEX,LOAD,DO).xp_cmdshell,pg_sleep,pg_read_file,pg_ls_dir,pg_stat_file,lo_import,lo_export,current_setting,set_config,dblink).FROM/JOINtargets are AST-validated — must be a UUID-quotedresource or a CTE alias; schema-qualified targets like
pg_catalog.pg_classare rejected.aggregate_datapath no longer builds SQL by string concatenation —every identifier, metric expression, filter value, and LIMIT goes through
SafeSQLBuilderallowlist validation.Privacy posture
Stateless: no database, no accounts, no sessions. CloudWatch logs retain
request_id, method/path, duration, status, and truncated SQL (500 chars)for 14 days. All data returned is public open data from
data.boston.gov.See
docs/SECURITY.md§4 for the full rationale.Test plan
./scripts/deploy.sh --environment stagingproduces a cp311 ZIPthat cold-starts without 502
terraform planinterraform/aws/showsreserved_concurrent_executions = 10as the only Lambda difffor an already-deployed stack
DROP TABLErejected;comment-obfuscated
DROPrejected; schema-qualified FROM rejectedSELECT * FROM "<uuid>"without an explicit LIMIT is automaticallyclamped to 10000 rows
🤖 Generated with Claude Code