Skip to content

feat: harden Cloud Run security with Secret Manager, VPC connector, and private backend#32

Merged
ChingEnLin merged 3 commits into
devfrom
feat/infra-security-hardening
May 17, 2026
Merged

feat: harden Cloud Run security with Secret Manager, VPC connector, and private backend#32
ChingEnLin merged 3 commits into
devfrom
feat/infra-security-hardening

Conversation

@ChingEnLin
Copy link
Copy Markdown
Owner

$(cat <<'EOF'

Summary

  • Secret Manager: All sensitive runtime secrets (Azure credentials, Gemini API key, DB credentials) are removed from GitHub Secrets and moved into GCP Secret Manager. Cloud Run reads them natively at startup via --set-secrets — they never appear in workflow logs, build args, or env var dumps.

  • VPC Connector: A Serverless VPC Access connector is provisioned by Terraform (terraform/network.tf), connecting both Cloud Run services to the project's default VPC. This is the foundation for private Cloud SQL access and internal service-to-service routing.

  • Private backend: The backend Cloud Run service now runs with --ingress=internal — it is completely unreachable from the public internet. The frontend nginx container proxies all /api/* requests to the backend's internal URL (injected at runtime as BACKEND_URL). Browsers call /api/... on the frontend, which routes internally.

  • Terraform IaC (terraform/): Manages the VPC connector, Secret Manager secret structures, a dedicated Cloud Run service account (least-privilege vs default Compute SA), and Cloud SQL (importable via terraform/import.sh — no data migration needed for the existing instance). The CI pipeline continues to own image builds and Cloud Run deployments.

  • Data migration script (scripts/migrate_db.sh): Ready-to-use script to migrate PostgreSQL data between Cloud SQL instances via Cloud SQL Auth Proxy, in case the database ever needs to be rebuilt.

Architecture after this PR

Internet → Frontend Cloud Run (public, --ingress=all)
                │  nginx proxy /api/* → BACKEND_URL (via VPC connector)
                ↓
           Backend Cloud Run (--ingress=internal, VPC only)
                │  Cloud SQL Proxy (unix socket)
                ↓
           Cloud SQL PostgreSQL

Pre-deploy checklist (one-time infrastructure setup)

Before merging to production, a GCP admin must run Terraform once:

cd terraform
cp terraform.tfvars.example terraform.tfvars
terraform init
./import.sh          # import existing Cloud SQL into Terraform state
terraform apply

Then populate Secret Manager with actual values:

echo -n "VALUE" | gcloud secrets versions add querypal-azure-tenant-id --data-file=-
echo -n "VALUE" | gcloud secrets versions add querypal-azure-client-id --data-file=-
echo -n "VALUE" | gcloud secrets versions add querypal-azure-client-secret --data-file=-
echo -n "VALUE" | gcloud secrets versions add querypal-gemini-api-key --data-file=-
echo -n "VALUE" | gcloud secrets versions add querypal-db-user --data-file=-
echo -n "VALUE" | gcloud secrets versions add querypal-db-pass --data-file=-

Test plan

  • CI backend tests (flake8, black, pytest) pass
  • CI frontend tests (vitest) pass
  • CI build verification passes
  • After infra setup: deploy to production and confirm frontend loads
  • Confirm /api/health responds through the nginx proxy
  • Confirm direct requests to the backend Cloud Run URL are rejected (403/404 from GFE)
  • Confirm secrets are not visible in Cloud Run environment variable UI (they show as secret references)

https://claude.ai/code/session_01SRRzCWrpwgMpdYFurMVn7m
EOF
)


Generated by Claude Code

claude added 2 commits May 15, 2026 07:50
…nd private backend

- Secret Manager: move all sensitive env vars (Azure credentials, Gemini key,
  DB credentials) out of GitHub Secrets and into GCP Secret Manager; Cloud Run
  reads them at runtime via --set-secrets, so secrets are never exposed in
  workflow logs or build args.

- VPC Connector: add Serverless VPC Access connector (terraform/network.tf) so
  Cloud Run services can reach Cloud SQL and each other over the private VPC
  network.

- Private backend: set backend Cloud Run ingress to 'internal', blocking all
  public internet access. Frontend nginx now proxies /api/* to the backend's
  internal URL (with BACKEND_URL injected as a runtime env var), so the browser
  never needs a direct connection to the backend.

- Terraform IaC: terraform/ directory manages the VPC connector, Secret Manager
  secrets, Cloud Run service account, and Cloud SQL (importable via import.sh).
  CI continues to own image builds and Cloud Run deployments.

- Data migration script: scripts/migrate_db.sh migrates PostgreSQL data between
  Cloud SQL instances via Cloud SQL Auth Proxy if the database ever needs to be
  rebuilt.

https://claude.ai/code/session_01SRRzCWrpwgMpdYFurMVn7m
…ITE_API_BASE_URL

GitHub Actions does not interpolate ${{ env.X }} inside the top-level env:
block, so the full SA email could not reference PROJECT_ID there. Replaced
CLOUD_RUN_SA with CLOUD_RUN_SA_NAME and build the email inline in the flags
blocks where expression context is available.

Added a comment explaining VITE_API_BASE_URL=/api — it is the nginx location
prefix, not a full URL, because the browser calls the frontend's own origin
and nginx proxies /api/* to the internal backend.

https://claude.ai/code/session_01SRRzCWrpwgMpdYFurMVn7m
Copy link
Copy Markdown
Owner Author

Migration Guide: Moving to Full IaC-managed Infrastructure

Before merging this to production, a one-time setup is needed. Steps must be run in order — the sequence matters.


What Terraform manages vs what CI still owns

Resource Terraform action Impact on running app
VPC connector CREATE (new) None until next CI deploy
Secret Manager secrets CREATE (new, empty) None until values are added
Cloud Run service account CREATE (new) None until next CI deploy
IAM bindings CREATE (new) None
Cloud SQL instance querypal-db IMPORT (existing) Zero — instance is untouched
Cloud SQL database querypal IMPORT (existing) Zero — database is untouched
Cloud Run services ❌ not managed by Terraform CI pipeline owns deploys

Step 1 — Check that database.tf matches the real Cloud SQL instance

Before importing, verify the config reflects what actually exists, especially tier, database_version, and backup_configuration. A mismatch caught before import is just a config fix. A mismatch caught after import shows up as a planned change — if that change is a replacement, that means destroy + recreate, which would wipe the database.

gcloud sql instances describe querypal-db --format=json

Fix any discrepancies in terraform/database.tf before proceeding.


Step 2 — Import existing Cloud SQL into Terraform state

This registers the existing instance with Terraform without modifying it. The application stays live throughout.

cd terraform
cp terraform.tfvars.example terraform.tfvars
terraform init
./import.sh

After import, run terraform plan and read it carefully. The plan should show no changes, or only safe in-place modifications (tier, flags, backup settings). If you see any resource marked for replacement (-/+), stop — do not apply. Fix the config until the plan is clean.


Step 3 — Apply Terraform to create new resources

`terraform apply`

Creates: VPC connector (takes 2–5 min to provision), Secret Manager secret shells, dedicated Cloud Run SA, and IAM bindings. Nothing here touches the running application.


Step 4 — Populate Secret Manager with actual values

⚠️ Do this before triggering any CI deployment. If Cloud Run tries to mount a secret with no versions it refuses to start. The old revision stays live (Cloud Run only cuts traffic after the new revision passes health checks), so the app won't go down — but the deploy will time out.

for SECRET_ID in \
  querypal-azure-tenant-id \
  querypal-azure-client-id \
  querypal-azure-client-secret \
  querypal-gemini-api-key \
  querypal-db-user \
  querypal-db-pass; do
  echo -n "Enter value for ${SECRET_ID}: "
  read -rs VALUE && echo
  echo -n "${VALUE}" | gcloud secrets versions add "${SECRET_ID}" --data-file=-
done

Verify each secret has at least one version:

gcloud secrets versions list querypal-gemini-api-key

Step 5 — Trigger a CI deployment (push to production)

The updated workflow now applies:

  • --set-secrets → secrets come from Secret Manager, not env vars
  • --vpc-connector → both services join the VPC
  • --ingress=internal on backend → backend is unreachable from the public internet
  • --service-account → uses the new least-privilege SA instead of the default Compute SA

The frontend remains publicly accessible. Direct requests to the backend's .run.app URL from the internet will receive a 403 from Google's frontend — only the frontend nginx proxy (coming through the VPC connector) can reach it.

Monitor the rollout:

gcloud run services describe querypal-backend --region=europe-west1
gcloud run services describe querypal-frontend --region=europe-west1

Step 6 — Delete the old GitHub Secrets

Once the app is verified working, delete these from repository Settings → Secrets:

AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, GEMINI_API_KEY, DB_USER, DB_PASS


Risks to be aware of

Cloud SQL replacement in the plandeletion_protection = true in database.tf will block a destroy, but it's not a substitute for reading the plan. If you see -/+ on the Cloud SQL instance, fix the config discrepancy first.

VPC connector CIDR conflict — the connector reserves 10.8.0.0/28. If that overlaps an existing subnet in your VPC the apply will fail. Check first:

gcloud compute networks subnets list --filter="region:europe-west1"

If there's a conflict, change vpc_connector_cidr in variables.tf before applying.

IAM propagation delay — new IAM bindings can take up to ~60 seconds to take effect. If the first CI deploy immediately after terraform apply fails with a permission error, wait a minute and re-run.

Terraform state is local by defaultterraform.tfstate is gitignored. Losing it means Terraform loses track of what it manages and will try to recreate everything on the next apply. Enable the GCS backend (commented out in main.tf) before running this in a team or from CI.


Generated by Claude Code

@ChingEnLin ChingEnLin force-pushed the feat/infra-security-hardening branch from 3a55fe4 to cdb3ddb Compare May 16, 2026 20:21
@ChingEnLin ChingEnLin assigned ChingEnLin and unassigned ChingEnLin May 16, 2026
@ChingEnLin
Copy link
Copy Markdown
Owner Author

@claude review this pr, and check if the step described in this comment #32 (comment) is legit

@claude
Copy link
Copy Markdown

claude Bot commented May 16, 2026

Claude finished @ChingEnLin's task in 2m 30s —— View job


PR Review + Migration Guide Validation

Tasks:

  • Gather context and understand request
  • Read changed files (workflow, Terraform, nginx, scripts)
  • Review PR code quality and security
  • Validate migration guide steps against actual code
  • Post comprehensive review

PR Code Review

Overall this is a solid, well-structured security hardening PR. The architecture — public frontend nginx proxy → internal backend via VPC connector → Cloud SQL — is the right pattern for Cloud Run. The code is clean and well-commented. Below are specific findings by file.

.github/workflows/google-cloudrun-docker.yml

Issues:

  1. --allow-unauthenticated on the backend is surprising alongside --ingress=internal (line 90). These two flags don't conflict — --ingress=internal restricts which network path can reach the service, and --allow-unauthenticated disables Cloud Run's IAM auth token check. With this combo, any VPC-internal caller can hit the backend without presenting a service account token. That's intentional here (nginx doesn't add auth tokens), but it means if anything else lands on your VPC it can call the backend freely. For stronger defense-in-depth, consider having nginx pass a shared secret header that the backend validates, or use Cloud Run service identity auth between frontend and backend.

  2. --vpc-egress=private-ranges-only on the backend (line 88). This routes only RFC-1918 traffic through the VPC connector; public internet calls still go via Cloud Run's default internet egress. The backend calls Azure AD (login.microsoftonline.com) and Gemini API — those will continue to work since they go out the non-VPC path. This is fine functionally but worth documenting explicitly so future maintainers don't wonder why the VPC connector is attached but internet calls still work.

  3. --vpc-egress=all-traffic on the frontend (line 122). This is correct — the frontend needs to route traffic to $BACKEND_URL (an internal .run.app URL) through the VPC connector. All egress goes via VPC connector here, which is right.

  4. No .terraform.lock.hcl committed — minor, but provider lock files should be in version control to get reproducible terraform init runs. Fix this →

terraform/iam.tf

Issue — over-broad secretmanager.secretAccessor at project level (lines 10–14 and 32–36):

resource "google_project_iam_member" "cloudrun_secret_accessor" {
  role    = "roles/secretmanager.secretAccessor"
  member  = "serviceAccount:${google_service_account.cloudrun_sa.email}"
}

This grants the Cloud Run SA access to all secrets in the project. For least privilege, prefer per-secret bindings using google_secret_manager_secret_iam_member. For a small single-app project this is a reasonable trade-off, but worth knowing.

Issue — github_actions_secret_accessor binding is unnecessary (lines 32–36). The CI workflow never reads secret values — it only references secret names in the secrets: block, and Cloud Run fetches them at runtime using the Cloud Run SA. The GitHub Actions SA doesn't need secretmanager.secretAccessor. Fix this →

terraform/database.tf

Good: deletion_protection = true and ssl_mode = "ENCRYPTED_ONLY" are both correct defensive settings.

Potential import mismatch risk: The database.tf hardcodes tier = "db-f1-micro", POSTGRES_15, specific backup settings, and log_connections = on. If the real instance differs on any of these, terraform plan after import will show a diff. The migration guide correctly warns about this — the -/+ replace risk is real.

terraform/network.tf

Cost note: min_instances = 2 means two e2-micro VMs are always running (~$10–15/month). This is the minimum allowed for VPC connectors. Acceptable, but worth being aware of.

frontend/nginx.conf

Good: The envsubst '$PORT $BACKEND_URL' scope in docker-entrypoint.sh is correctly limited to avoid clobbering nginx's own $uri, $proxy_host, etc. variables. The trailing slash on proxy_pass $BACKEND_URL/ correctly strips the /api prefix. The 300s timeouts are appropriate for long-running AI queries.

Minor: proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto; passes through the incoming value rather than hardcoding https. Since Cloud Run's load balancer sets this header, it will be https in production. Correct behavior.

scripts/migrate_db.sh

Good: Clean argument parsing, proper trap cleanup EXIT, and temp file handling.

Minor — sleep 3 after starting proxies (line 85) is brittle. If the proxies are slow to start (cold network, IAM propagation) the dump will fail with a connection error. A retry loop with pg_isready would be more robust, but for an infrequently-run migration script this is acceptable.


Migration Guide Validation

Bottom line: the steps are legitimate and safe. The sequence (verify → import → plan carefully → populate secrets → deploy → clean up old secrets) is the correct order. Here are specific things to note:

✅ What's correct

  • Step 1 (verify database.tf before import) — essential step, correctly explained. The -/+ replacement risk is real.
  • Step 2 (import.sh) — the script itself is correct. The || echo "Already imported" guards make it re-runnable safely.
  • Step 4 (populate secrets before deploying) — the warning is accurate. Cloud Run refuses to start a revision if --set-secrets references a secret with no versions. The old revision stays live (GFE holds traffic until health check passes), so the app won't go down — but the deploy will sit in "deploying" state until it times out.
  • Step 5 (push to production to trigger CI) — matches the workflow trigger on line 12 of the workflow file.
  • Step 6 (delete old GitHub Secrets) — correct cleanup.
  • All risk warnings (Cloud SQL replacement, CIDR conflict, IAM propagation, local state) are accurate and actionable.

⚠️ Minor inconsistencies to be aware of

  1. Redundant terraform init in Step 2. The guide says to run terraform init manually, then ./import.sh — but import.sh already runs terraform init internally (line 17 of import.sh). Running it twice is harmless (it's idempotent), just slightly redundant.

  2. import.sh saves a plan file (tfplan), but Step 3 runs plain terraform apply. import.sh ends with terraform plan -out=tfplan and suggests terraform apply tfplan. The guide's Step 3 says just terraform apply (without using the saved plan file). If time passes between the import and the apply, the saved plan could be stale. Using terraform apply tfplan would be slightly safer since it guarantees applying exactly what was reviewed — or just re-run terraform plan before applying.

  3. terraform.tfvars is optional. All variables in variables.tf already have defaults that match the example file. The cp terraform.tfvars.example terraform.tfvars step is good practice for documentation purposes, but it won't change any behavior since the defaults are identical.

🔴 One gap not covered in the guide

The GitHub Actions SA needs iam.serviceAccountUser on the Cloud Run SA to deploy with --service-account. This is covered by google_service_account_iam_member.github_actions_act_as_cloudrun_sa in iam.tf — but this binding is only created after terraform apply. The guide doesn't explicitly warn that the first post-Terraform CI deploy might fail with an IAM error if the binding hasn't propagated (60s window mentioned in the guide). The mitigation is already there (wait a minute and retry), but it's worth calling out explicitly in the guide.


Summary

The PR is well-architected and production-ready from a security standpoint. The migration guide steps are legitimate — follow them in order. The two most important actions before merging to production:

  1. Run gcloud sql instances describe querypal-db --format=json and reconcile any differences with terraform/database.tf before running ./import.sh.
  2. Populate all 6 Secret Manager secrets before triggering the first CI deployment.

…README

- Add production architecture Mermaid diagram showing Cloud Run services,
  VPC connector, Secret Manager, Cloud SQL, and external dependencies
- Add network security model table (frontend public / backend internal)
- Add secret management table listing all Secret Manager secrets
- Add IaC ownership table (Terraform vs CI pipeline)
- Add CI/CD pipeline Mermaid flowchart showing Workload Identity auth,
  image build/push, and deploy steps with secret injection
- Add Terraform to technology stack table

Also remove overly broad github_actions_secret_accessor IAM binding from
terraform/iam.tf — the GitHub Actions SA never reads secret values directly;
Cloud Run reads them at startup using the Cloud Run SA identity.

https://claude.ai/code/session_01SRRzCWrpwgMpdYFurMVn7m
@ChingEnLin ChingEnLin merged commit 245c14c into dev May 17, 2026
3 checks passed
@ChingEnLin ChingEnLin deleted the feat/infra-security-hardening branch May 17, 2026 09:32
@ChingEnLin ChingEnLin mentioned this pull request May 17, 2026
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants