feat: harden Cloud Run security with Secret Manager, VPC connector, and private backend by ChingEnLin · Pull Request #32 · ChingEnLin/QueryPal

ChingEnLin · 2026-05-15T07:50:34Z

$(cat <<'EOF'

Summary

Secret Manager: All sensitive runtime secrets (Azure credentials, Gemini API key, DB credentials) are removed from GitHub Secrets and moved into GCP Secret Manager. Cloud Run reads them natively at startup via --set-secrets — they never appear in workflow logs, build args, or env var dumps.
VPC Connector: A Serverless VPC Access connector is provisioned by Terraform (terraform/network.tf), connecting both Cloud Run services to the project's default VPC. This is the foundation for private Cloud SQL access and internal service-to-service routing.
Private backend: The backend Cloud Run service now runs with --ingress=internal — it is completely unreachable from the public internet. The frontend nginx container proxies all /api/* requests to the backend's internal URL (injected at runtime as BACKEND_URL). Browsers call /api/... on the frontend, which routes internally.
Terraform IaC (terraform/): Manages the VPC connector, Secret Manager secret structures, a dedicated Cloud Run service account (least-privilege vs default Compute SA), and Cloud SQL (importable via terraform/import.sh — no data migration needed for the existing instance). The CI pipeline continues to own image builds and Cloud Run deployments.
Data migration script (scripts/migrate_db.sh): Ready-to-use script to migrate PostgreSQL data between Cloud SQL instances via Cloud SQL Auth Proxy, in case the database ever needs to be rebuilt.

Architecture after this PR

Internet → Frontend Cloud Run (public, --ingress=all)
                │  nginx proxy /api/* → BACKEND_URL (via VPC connector)
                ↓
           Backend Cloud Run (--ingress=internal, VPC only)
                │  Cloud SQL Proxy (unix socket)
                ↓
           Cloud SQL PostgreSQL

Pre-deploy checklist (one-time infrastructure setup)

Before merging to production, a GCP admin must run Terraform once:

cd terraform
cp terraform.tfvars.example terraform.tfvars
terraform init
./import.sh          # import existing Cloud SQL into Terraform state
terraform apply

Then populate Secret Manager with actual values:

echo -n "VALUE" | gcloud secrets versions add querypal-azure-tenant-id --data-file=-
echo -n "VALUE" | gcloud secrets versions add querypal-azure-client-id --data-file=-
echo -n "VALUE" | gcloud secrets versions add querypal-azure-client-secret --data-file=-
echo -n "VALUE" | gcloud secrets versions add querypal-gemini-api-key --data-file=-
echo -n "VALUE" | gcloud secrets versions add querypal-db-user --data-file=-
echo -n "VALUE" | gcloud secrets versions add querypal-db-pass --data-file=-

Test plan

CI backend tests (flake8, black, pytest) pass
CI frontend tests (vitest) pass
CI build verification passes
After infra setup: deploy to production and confirm frontend loads
Confirm /api/health responds through the nginx proxy
Confirm direct requests to the backend Cloud Run URL are rejected (403/404 from GFE)
Confirm secrets are not visible in Cloud Run environment variable UI (they show as secret references)

https://claude.ai/code/session_01SRRzCWrpwgMpdYFurMVn7m
EOF
)

Generated by Claude Code

…nd private backend - Secret Manager: move all sensitive env vars (Azure credentials, Gemini key, DB credentials) out of GitHub Secrets and into GCP Secret Manager; Cloud Run reads them at runtime via --set-secrets, so secrets are never exposed in workflow logs or build args. - VPC Connector: add Serverless VPC Access connector (terraform/network.tf) so Cloud Run services can reach Cloud SQL and each other over the private VPC network. - Private backend: set backend Cloud Run ingress to 'internal', blocking all public internet access. Frontend nginx now proxies /api/* to the backend's internal URL (with BACKEND_URL injected as a runtime env var), so the browser never needs a direct connection to the backend. - Terraform IaC: terraform/ directory manages the VPC connector, Secret Manager secrets, Cloud Run service account, and Cloud SQL (importable via import.sh). CI continues to own image builds and Cloud Run deployments. - Data migration script: scripts/migrate_db.sh migrates PostgreSQL data between Cloud SQL instances via Cloud SQL Auth Proxy if the database ever needs to be rebuilt. https://claude.ai/code/session_01SRRzCWrpwgMpdYFurMVn7m

…ITE_API_BASE_URL GitHub Actions does not interpolate ${{ env.X }} inside the top-level env: block, so the full SA email could not reference PROJECT_ID there. Replaced CLOUD_RUN_SA with CLOUD_RUN_SA_NAME and build the email inline in the flags blocks where expression context is available. Added a comment explaining VITE_API_BASE_URL=/api — it is the nginx location prefix, not a full URL, because the browser calls the frontend's own origin and nginx proxies /api/* to the internal backend. https://claude.ai/code/session_01SRRzCWrpwgMpdYFurMVn7m

ChingEnLin · 2026-05-16T20:21:45Z

Migration Guide: Moving to Full IaC-managed Infrastructure

Before merging this to production, a one-time setup is needed. Steps must be run in order — the sequence matters.

What Terraform manages vs what CI still owns

Resource	Terraform action	Impact on running app
VPC connector	CREATE (new)	None until next CI deploy
Secret Manager secrets	CREATE (new, empty)	None until values are added
Cloud Run service account	CREATE (new)	None until next CI deploy
IAM bindings	CREATE (new)	None
Cloud SQL instance `querypal-db`	IMPORT (existing)	Zero — instance is untouched
Cloud SQL database `querypal`	IMPORT (existing)	Zero — database is untouched
Cloud Run services	❌ not managed by Terraform	CI pipeline owns deploys

Step 1 — Check that `database.tf` matches the real Cloud SQL instance

Before importing, verify the config reflects what actually exists, especially tier, database_version, and backup_configuration. A mismatch caught before import is just a config fix. A mismatch caught after import shows up as a planned change — if that change is a replacement, that means destroy + recreate, which would wipe the database.

gcloud sql instances describe querypal-db --format=json

Fix any discrepancies in terraform/database.tf before proceeding.

Step 2 — Import existing Cloud SQL into Terraform state

This registers the existing instance with Terraform without modifying it. The application stays live throughout.

cd terraform
cp terraform.tfvars.example terraform.tfvars
terraform init
./import.sh

After import, run terraform plan and read it carefully. The plan should show no changes, or only safe in-place modifications (tier, flags, backup settings). If you see any resource marked for replacement (-/+), stop — do not apply. Fix the config until the plan is clean.

Step 3 — Apply Terraform to create new resources

`terraform apply`

Creates: VPC connector (takes 2–5 min to provision), Secret Manager secret shells, dedicated Cloud Run SA, and IAM bindings. Nothing here touches the running application.

Step 4 — Populate Secret Manager with actual values

⚠️ Do this before triggering any CI deployment. If Cloud Run tries to mount a secret with no versions it refuses to start. The old revision stays live (Cloud Run only cuts traffic after the new revision passes health checks), so the app won't go down — but the deploy will time out.

for SECRET_ID in \
  querypal-azure-tenant-id \
  querypal-azure-client-id \
  querypal-azure-client-secret \
  querypal-gemini-api-key \
  querypal-db-user \
  querypal-db-pass; do
  echo -n "Enter value for ${SECRET_ID}: "
  read -rs VALUE && echo
  echo -n "${VALUE}" | gcloud secrets versions add "${SECRET_ID}" --data-file=-
done

Verify each secret has at least one version:

gcloud secrets versions list querypal-gemini-api-key

Step 5 — Trigger a CI deployment (push to `production`)

The updated workflow now applies:

--set-secrets → secrets come from Secret Manager, not env vars
--vpc-connector → both services join the VPC
--ingress=internal on backend → backend is unreachable from the public internet
--service-account → uses the new least-privilege SA instead of the default Compute SA

The frontend remains publicly accessible. Direct requests to the backend's .run.app URL from the internet will receive a 403 from Google's frontend — only the frontend nginx proxy (coming through the VPC connector) can reach it.

Monitor the rollout:

gcloud run services describe querypal-backend --region=europe-west1
gcloud run services describe querypal-frontend --region=europe-west1

Step 6 — Delete the old GitHub Secrets

Once the app is verified working, delete these from repository Settings → Secrets:

AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, GEMINI_API_KEY, DB_USER, DB_PASS

Risks to be aware of

Cloud SQL replacement in the plan — deletion_protection = true in database.tf will block a destroy, but it's not a substitute for reading the plan. If you see -/+ on the Cloud SQL instance, fix the config discrepancy first.

VPC connector CIDR conflict — the connector reserves 10.8.0.0/28. If that overlaps an existing subnet in your VPC the apply will fail. Check first:

gcloud compute networks subnets list --filter="region:europe-west1"

If there's a conflict, change vpc_connector_cidr in variables.tf before applying.

IAM propagation delay — new IAM bindings can take up to ~60 seconds to take effect. If the first CI deploy immediately after terraform apply fails with a permission error, wait a minute and re-run.

Terraform state is local by default — terraform.tfstate is gitignored. Losing it means Terraform loses track of what it manages and will try to recreate everything on the next apply. Enable the GCS backend (commented out in main.tf) before running this in a team or from CI.

Generated by Claude Code

ChingEnLin · 2026-05-16T20:24:52Z

@claude review this pr, and check if the step described in this comment #32 (comment) is legit

claude · 2026-05-16T20:25:03Z

Claude finished @ChingEnLin's task in 2m 30s —— View job

PR Review + Migration Guide Validation

Tasks:

Gather context and understand request
Read changed files (workflow, Terraform, nginx, scripts)
Review PR code quality and security
Validate migration guide steps against actual code
Post comprehensive review

PR Code Review

Overall this is a solid, well-structured security hardening PR. The architecture — public frontend nginx proxy → internal backend via VPC connector → Cloud SQL — is the right pattern for Cloud Run. The code is clean and well-commented. Below are specific findings by file.

`.github/workflows/google-cloudrun-docker.yml`

Issues:

--allow-unauthenticated on the backend is surprising alongside --ingress=internal (line 90). These two flags don't conflict — --ingress=internal restricts which network path can reach the service, and --allow-unauthenticated disables Cloud Run's IAM auth token check. With this combo, any VPC-internal caller can hit the backend without presenting a service account token. That's intentional here (nginx doesn't add auth tokens), but it means if anything else lands on your VPC it can call the backend freely. For stronger defense-in-depth, consider having nginx pass a shared secret header that the backend validates, or use Cloud Run service identity auth between frontend and backend.
--vpc-egress=private-ranges-only on the backend (line 88). This routes only RFC-1918 traffic through the VPC connector; public internet calls still go via Cloud Run's default internet egress. The backend calls Azure AD (login.microsoftonline.com) and Gemini API — those will continue to work since they go out the non-VPC path. This is fine functionally but worth documenting explicitly so future maintainers don't wonder why the VPC connector is attached but internet calls still work.
--vpc-egress=all-traffic on the frontend (line 122). This is correct — the frontend needs to route traffic to $BACKEND_URL (an internal .run.app URL) through the VPC connector. All egress goes via VPC connector here, which is right.
No .terraform.lock.hcl committed — minor, but provider lock files should be in version control to get reproducible terraform init runs. Fix this →

`terraform/iam.tf`

Issue — over-broad secretmanager.secretAccessor at project level (lines 10–14 and 32–36):

resource "google_project_iam_member" "cloudrun_secret_accessor" {
  role    = "roles/secretmanager.secretAccessor"
  member  = "serviceAccount:${google_service_account.cloudrun_sa.email}"
}

This grants the Cloud Run SA access to all secrets in the project. For least privilege, prefer per-secret bindings using google_secret_manager_secret_iam_member. For a small single-app project this is a reasonable trade-off, but worth knowing.

Issue — github_actions_secret_accessor binding is unnecessary (lines 32–36). The CI workflow never reads secret values — it only references secret names in the secrets: block, and Cloud Run fetches them at runtime using the Cloud Run SA. The GitHub Actions SA doesn't need secretmanager.secretAccessor. Fix this →

`terraform/database.tf`

Good: deletion_protection = true and ssl_mode = "ENCRYPTED_ONLY" are both correct defensive settings.

Potential import mismatch risk: The database.tf hardcodes tier = "db-f1-micro", POSTGRES_15, specific backup settings, and log_connections = on. If the real instance differs on any of these, terraform plan after import will show a diff. The migration guide correctly warns about this — the -/+ replace risk is real.

`terraform/network.tf`

Cost note: min_instances = 2 means two e2-micro VMs are always running (~$10–15/month). This is the minimum allowed for VPC connectors. Acceptable, but worth being aware of.

`frontend/nginx.conf`

Good: The envsubst '$PORT $BACKEND_URL' scope in docker-entrypoint.sh is correctly limited to avoid clobbering nginx's own $uri, $proxy_host, etc. variables. The trailing slash on proxy_pass $BACKEND_URL/ correctly strips the /api prefix. The 300s timeouts are appropriate for long-running AI queries.

Minor: proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto; passes through the incoming value rather than hardcoding https. Since Cloud Run's load balancer sets this header, it will be https in production. Correct behavior.

`scripts/migrate_db.sh`

Good: Clean argument parsing, proper trap cleanup EXIT, and temp file handling.

Minor — sleep 3 after starting proxies (line 85) is brittle. If the proxies are slow to start (cold network, IAM propagation) the dump will fail with a connection error. A retry loop with pg_isready would be more robust, but for an infrequently-run migration script this is acceptable.

Migration Guide Validation

Bottom line: the steps are legitimate and safe. The sequence (verify → import → plan carefully → populate secrets → deploy → clean up old secrets) is the correct order. Here are specific things to note:

✅ What's correct

Step 1 (verify database.tf before import) — essential step, correctly explained. The -/+ replacement risk is real.
Step 2 (import.sh) — the script itself is correct. The || echo "Already imported" guards make it re-runnable safely.
Step 4 (populate secrets before deploying) — the warning is accurate. Cloud Run refuses to start a revision if --set-secrets references a secret with no versions. The old revision stays live (GFE holds traffic until health check passes), so the app won't go down — but the deploy will sit in "deploying" state until it times out.
Step 5 (push to production to trigger CI) — matches the workflow trigger on line 12 of the workflow file.
Step 6 (delete old GitHub Secrets) — correct cleanup.
All risk warnings (Cloud SQL replacement, CIDR conflict, IAM propagation, local state) are accurate and actionable.

⚠️ Minor inconsistencies to be aware of

Redundant terraform init in Step 2. The guide says to run terraform init manually, then ./import.sh — but import.sh already runs terraform init internally (line 17 of import.sh). Running it twice is harmless (it's idempotent), just slightly redundant.
import.sh saves a plan file (tfplan), but Step 3 runs plain terraform apply. import.sh ends with terraform plan -out=tfplan and suggests terraform apply tfplan. The guide's Step 3 says just terraform apply (without using the saved plan file). If time passes between the import and the apply, the saved plan could be stale. Using terraform apply tfplan would be slightly safer since it guarantees applying exactly what was reviewed — or just re-run terraform plan before applying.
terraform.tfvars is optional. All variables in variables.tf already have defaults that match the example file. The cp terraform.tfvars.example terraform.tfvars step is good practice for documentation purposes, but it won't change any behavior since the defaults are identical.

🔴 One gap not covered in the guide

The GitHub Actions SA needs iam.serviceAccountUser on the Cloud Run SA to deploy with --service-account. This is covered by google_service_account_iam_member.github_actions_act_as_cloudrun_sa in iam.tf — but this binding is only created after terraform apply. The guide doesn't explicitly warn that the first post-Terraform CI deploy might fail with an IAM error if the binding hasn't propagated (60s window mentioned in the guide). The mitigation is already there (wait a minute and retry), but it's worth calling out explicitly in the guide.

Summary

The PR is well-architected and production-ready from a security standpoint. The migration guide steps are legitimate — follow them in order. The two most important actions before merging to production:

Run gcloud sql instances describe querypal-db --format=json and reconcile any differences with terraform/database.tf before running ./import.sh.
Populate all 6 Secret Manager secrets before triggering the first CI deployment.

…README - Add production architecture Mermaid diagram showing Cloud Run services, VPC connector, Secret Manager, Cloud SQL, and external dependencies - Add network security model table (frontend public / backend internal) - Add secret management table listing all Secret Manager secrets - Add IaC ownership table (Terraform vs CI pipeline) - Add CI/CD pipeline Mermaid flowchart showing Workload Identity auth, image build/push, and deploy steps with secret injection - Add Terraform to technology stack table Also remove overly broad github_actions_secret_accessor IAM binding from terraform/iam.tf — the GitHub Actions SA never reads secret values directly; Cloud Run reads them at startup using the Cloud Run SA identity. https://claude.ai/code/session_01SRRzCWrpwgMpdYFurMVn7m

github-actions · 2026-05-25T19:18:12Z

🎉 This PR is included in version 2.11.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

claude added 2 commits May 15, 2026 07:50

ChingEnLin force-pushed the feat/infra-security-hardening branch from 3a55fe4 to cdb3ddb Compare May 16, 2026 20:21

ChingEnLin assigned ChingEnLin and unassigned ChingEnLin May 16, 2026

ChingEnLin merged commit 245c14c into dev May 17, 2026
3 checks passed

ChingEnLin deleted the feat/infra-security-hardening branch May 17, 2026 09:32

ChingEnLin mentioned this pull request May 17, 2026

chore: merge dev into production #34

Merged

8 tasks

ChingEnLin mentioned this pull request May 25, 2026

Release: QueryArgus audits, model selection, cross-collection $lookup #38

Merged

7 tasks

github-actions Bot added the released label May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: harden Cloud Run security with Secret Manager, VPC connector, and private backend#32

feat: harden Cloud Run security with Secret Manager, VPC connector, and private backend#32
ChingEnLin merged 3 commits into
devfrom
feat/infra-security-hardening

ChingEnLin commented May 15, 2026

Uh oh!

ChingEnLin commented May 16, 2026

Uh oh!

ChingEnLin commented May 16, 2026

Uh oh!

claude Bot commented May 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ChingEnLin commented May 15, 2026

Summary

Architecture after this PR

Pre-deploy checklist (one-time infrastructure setup)

Test plan

Uh oh!

ChingEnLin commented May 16, 2026

Migration Guide: Moving to Full IaC-managed Infrastructure

What Terraform manages vs what CI still owns

Step 1 — Check that database.tf matches the real Cloud SQL instance

Step 2 — Import existing Cloud SQL into Terraform state

Step 3 — Apply Terraform to create new resources

Step 4 — Populate Secret Manager with actual values

Step 5 — Trigger a CI deployment (push to production)

Step 6 — Delete the old GitHub Secrets

Risks to be aware of

Uh oh!

ChingEnLin commented May 16, 2026

Uh oh!

claude Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review + Migration Guide Validation

PR Code Review

.github/workflows/google-cloudrun-docker.yml

terraform/iam.tf

terraform/database.tf

terraform/network.tf

frontend/nginx.conf

scripts/migrate_db.sh

Migration Guide Validation

✅ What's correct

⚠️ Minor inconsistencies to be aware of

🔴 One gap not covered in the guide

Summary

Uh oh!

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Step 1 — Check that `database.tf` matches the real Cloud SQL instance

Step 5 — Trigger a CI deployment (push to `production`)

claude Bot commented May 16, 2026 •

edited

Loading

`.github/workflows/google-cloudrun-docker.yml`

`terraform/iam.tf`

`terraform/database.tf`

`terraform/network.tf`

`frontend/nginx.conf`

`scripts/migrate_db.sh`