docs: refresh infra docs for post-Hetzner architecture by abtreece · Pull Request #57 · fullstaq-ruby/infra

abtreece · 2026-04-29T02:08:58Z

Closes #55.

Summary

Refreshes docs/ to describe the current infrastructure (single Hetzner VM running Caddy + Sinatra/Puma API server + Prometheus, provisioned by Ansible) instead of the pre-July-2024 architecture (GKE Autopilot + Nginx Ingress + Cloud Run apiserver). The infrastructure overview diagram is also refreshed: the stale infrastructure-overview.drawio.svg is replaced by a Mermaid block embedded in infrastructure-overview.md so future diagram changes are reviewable as text diffs.

Files changed

docs/infrastructure-overview.md — rewritten section by section against current IaC.
- The two pre-existing GCP-service-account sections are folded into a single CI/CD authentication section, split per-caller:
  - fullstaq-ruby/server-edition → GCP via Workload Identity Federation (APT/YUM repo buckets + GCS CI artifacts bucket).
  - fullstaq-ruby/server-edition → Azure via Federated Identity Credentials (Azure Blob CI artifacts + CI cache containers + Key Vault GPG key).
  - fullstaq-ruby/infra → API server only, via a GitHub-issued OIDC JWT (audience backend.fullstaqruby.org) sent to POST /admin/upgrade_apiserver. The infra workflow does not authenticate to GCP or Azure APIs.
- Caddy section: there is no backend.fullstaqruby.org vhost; both apt. and yum. vhosts handle /admin/* via reverse_proxy to the apiserver Unix socket (per ansible/files/Caddyfile). CI calls /admin/* via https://apt.fullstaqruby.org.
- Google Cloud project section: corrected to a single project (fsruby-server-edition2, display name "Fullstaq Ruby Server Edition"), provisioned by terraform-hisec/gcloud_project.tf and populated by terraform/. The hisec/non-hisec boundary lives at the Terraform-state and access-group layer, not at a GCP project boundary.
- API server section: Sinatra/Puma on a Unix socket under systemd; sibling apiserver-deployer.service performs self-update from a tarball attached to a GitHub Release.
- VM (Hetzner) section: Terraform-managed forward DNS (backend.fullstaqruby.org, apt.fullstaqruby.org, yum.fullstaqruby.org) is distinguished from the manually-set Hetzner PTR record.
- CI artifacts / cache sections: artifacts are dual-cloud (public GCS + private Azure container); cache is Azure-only.
- Container registry section: dropped (no registry resources are managed in this repo).
- GPG private key section: Key Vault name uses the templated form ${var.key_vault_prefix}infraowners (currently fsruby2infraowners).
docs/infrastructure-overview.drawio.svg — deleted. Replaced by the Mermaid block in infrastructure-overview.md.
docs/editing-diagrams.md — deleted. Mermaid is edited inline; no diagrams.net round-trip is needed.
docs/deploy.md — replaces the gcloud container clusters get-credentials + kubectl apply -k ../kubernetes steps with a single ansible-playbook step matching Step 11 of the bootstrapping guide. Adds a callout that apiserver code changes deploy via the GitHub Actions workflow.
docs/infrastructure-as-code.md — drops Kustomize and the kubernetes/ directory bullet; adds Ansible to the tools list and an ansible/ directory bullet.
docs/infrastructure-bootstrapping.md — intro updated to mention Terraform + Ansible (not Kubernetes/Kustomize). The rest of the file already reflected the post-migration setup.
docs/pull_request_template.md — diagram-update checkbox now points to the Mermaid block.
README.md — drops the link to the deleted editing-diagrams.md.
.editorconfig — removes the duplicate [config.ru] block (the tab/4 one); only the correct space/2 rule remains.

Verification

eclint check $(git ls-files) passes.
grep -rin 'kubernetes\|kustomize\|kubectl\|gke\|nginx\|cloud run' docs/ returns only intentional historical mentions (e.g. "the previous GKE Autopilot setup was replaced by this VM in the July 2024 rearchitecture").
Each rewritten claim is traceable to current IaC: terraform/{dns,gcloud_auth,backend,repo_buckets,ci_storage}.tf, terraform-hisec/{gcloud_project,key_vault,backend}.tf, ansible/main.yml, ansible/files/{Caddyfile,apiserver.service}, .github/workflows/apiserver.yml.

Note: PAT-based CI bot

The "Github CI bot account" section describes a PAT-based bot. Retiring/converting that account is already tracked in #18 ("Change fullstaq-ruby-ci-bot account into a Github app") and is therefore intentionally not in scope here. The text remains as-is so the doc reflects the current state until #18 lands.

Closes fullstaq-ruby#55. Brings docs/ in line with the post-July-2024 architecture (single Hetzner VM running Caddy + Sinatra/Puma API server + Prometheus, provisioned by Ansible), replacing references to the previous GKE Autopilot + Nginx Ingress + Cloud Run apiserver setup. Files changed: - docs/infrastructure-overview.md — rewritten section by section. Every claim is grounded in current IaC. The two GCP-service-account sections are folded into a single "CI/CD authentication" section that splits per-caller: server-edition uses GCP WIF (APT/YUM repo buckets + GCS CI artifacts bucket) and Azure Federated Identity Credentials (Azure Blob CI artifacts + CI cache + Key Vault GPG key); infra repo's apiserver workflow only mints a GitHub OIDC JWT (audience backend.fullstaqruby.org) and POSTs to /admin/upgrade_apiserver — it does not authenticate to GCP or Azure APIs. The Caddy section is corrected: there is no backend.fullstaqruby.org vhost; both apt. and yum. vhosts handle /admin/* via reverse_proxy to the apiserver Unix socket. The "Google Cloud projects" claim of two projects is corrected — there is one project, fsruby-server-edition2, provisioned by terraform-hisec/gcloud_project.tf and populated by terraform/; the hisec/non-hisec separation lives at the Terraform-state and access-group layer. Container registry section dropped (no registry resources are managed in this repo). Key Vault name uses the templated form ${var.key_vault_prefix}infraowners (currently fsruby2infraowners). CI artifacts/cache split is now explicit (artifacts dual-cloud, cache Azure-only). VM section distinguishes Terraform-managed forward DNS from the manually-set Hetzner PTR record. - docs/infrastructure-overview.drawio.svg — deleted. Replaced by an inline Mermaid diagram in infrastructure-overview.md so future diagram changes are reviewable as text diffs. - docs/editing-diagrams.md — deleted (no longer needed without the drawio round-trip). - docs/deploy.md — replaces the gcloud-clusters/kubectl steps with a single ansible-playbook step matching bootstrapping Step 11. Adds a callout that apiserver code changes deploy via the GitHub Actions workflow. - docs/infrastructure-as-code.md — drops Kustomize and the kubernetes/ directory bullet; adds Ansible to the tool list and an ansible/ directory bullet. - docs/infrastructure-bootstrapping.md — intro updated to mention Terraform + Ansible (not Kubernetes/Kustomize); the rest of the file already reflected the post-migration setup. - docs/pull_request_template.md — diagram-update checkbox now points to the Mermaid block instead of the deleted drawio file. - README.md — drops the link to the deleted editing-diagrams.md. - .editorconfig — removes the duplicate [config.ru] block (the tab/4 one); only the correct space/2 rule remains. Note: the "Github CI bot account" section is kept as-is. Retiring that PAT-based bot is already tracked in fullstaq-ruby#18 and is therefore out of scope here.

FooBarWidget · 2026-05-15T15:20:40Z

I'll have a good look. So far my first impression is that the new diagram lacks a lot of detail that was in the older diagram. I'm also not sure whether a detailed but automatically rendered diagram is still readable compared to a manually drawn one.

FooBarWidget · 2026-05-20T13:51:23Z

 # Infrastructure bootstrapping

-We try to codify infrastructure as much as possible using Terraform and Kubernetes YAML. However:
+We try to codify infrastructure as much as possible using Terraform and Ansible. However:


There should be an instruction step in this document for deploying the API server.

FooBarWidget · 2026-05-20T13:53:58Z

+ * `ansible/` — Configuration of the backend VM (Caddy, the API server, Prometheus, and OS hardening). Administered by [Infra Maintainers](roles.md) and applied manually; see [Deployment guide](deploy.md).

- * `.github/workflows/apiserver.yml` — Deploys the API server.
+ * `.github/workflows/apiserver.yml` — Builds and deploys the API server.


Nowadays it's .github/workflows/ (multiple workflows that together do the build and deployment).

FooBarWidget · 2026-05-20T13:54:45Z

-    ~~~bash
-    kubectl apply --context=gke_fullstaq-ruby_us-east4_fullstaq-ruby-autopilot -k ../kubernetes
-    ~~~
+> The API server itself is not deployed by this playbook. Code changes under `apiserver/` are released by the `.github/workflows/apiserver.yml` workflow, which packages a tarball, attaches it to a GitHub Release, and triggers `POST /admin/upgrade_apiserver` on the live host.


Nowadays it's the entire .github/workflows/ folder (multiple workflows that together do the build and deployment)

FooBarWidget · 2026-05-20T14:03:07Z

+
+All Google Cloud resources live in a single project, `fsruby-server-edition2` (display name "Fullstaq Ruby Server Edition"). The `google_project` resource itself is provisioned in `terraform-hisec/gcloud_project.tf` so that creating/deleting the project requires Infra Owner access, but resources _inside_ the project (buckets, IAM, Workload Identity Federation) are managed in `terraform/` by Infra Maintainers.
+
+The hisec / non-hisec separation is enforced at the **Terraform state and access-group layer**, not via separate GCP projects. See [Terraform state (normal)](#terraform-state-normal) and [Terraform state (hisec)](#terraform-state-hisec).


The section is good but I don't really understand this latter sentence. Maybe it makes sense for a reader who was familiar with the previous situation (two GCP projects) but the docs should be optimized for readers who will only be familiar with the current situation going forward, with no regard to past state.

FooBarWidget · 2026-05-20T14:09:31Z

- Can deploy new versions of the API server.
+- **`fullstaq-ruby/server-edition` → Google Cloud** uses [Workload Identity Federation](https://cloud.google.com/iam/docs/workload-identity-federation) (defined in `terraform/gcloud_auth.tf`). Two pools (`github-ci-test`, `github-ci-deploy`) gate access by GitHub repository owner and Actions environment. Through these pools, server-edition CI jobs gain write access to the APT/YUM repo buckets and the GCP CI artifacts bucket — see `terraform/repo_buckets.tf` and `terraform/ci_storage.tf`. The CI cache lives in Azure (see below), not on GCP.
+- **`fullstaq-ruby/server-edition` → Azure** uses [Federated Identity Credentials](https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation) on Entra ID applications (defined in `terraform-hisec/`). These authenticate workflows that read or write Azure Blob Storage (the CI artifacts and CI cache containers) and Azure Key Vault (the GPG signing key).
+- **`fullstaq-ruby/infra` → API server** uses a GitHub-issued OIDC JWT (audience `backend.fullstaqruby.org`) sent as a bearer token to `POST /admin/upgrade_apiserver`. The infra repo's `apiserver.yml` workflow does **not** authenticate to GCP or Azure APIs — the rollout mechanism is entirely on the VM (see [API server](#api-server)). The same JWT mechanism is used by `server-edition` to call `/admin/restart_web_server` after a publish.


Accurate section. But note apiserver.yml has changed.

FooBarWidget · 2026-05-20T14:18:14Z

 - Administered by role: Infra Maintainers

-The Kubernetes cluster runs our Nginx web server. This cluster is in Autopilot mode.
+A single Ubuntu (≥ 24.04) VPS hosted at Hetzner runs every backend service (Caddy, the API server, the API server deployer, Prometheus + node_exporter, fail2ban, AppArmor, unattended-upgrades, ufw). Its forward DNS records (`backend.fullstaqruby.org`, `apt.fullstaqruby.org`, `yum.fullstaqruby.org`) are managed in `terraform/dns.tf`; its static IPs are referenced from `terraform/variables.tf`. The PTR record (`backend.fullstaqruby.org`) is set manually at the Hetzner provider during bootstrapping (see [bootstrapping](infrastructure-bootstrapping.md) Step 7), not via Terraform.


Calling it "every backend service" while including things like Prometheus, fail2ban, etc. is overstating it.

FooBarWidget · 2026-05-20T14:20:33Z

+A single Ubuntu (≥ 24.04) VPS hosted at Hetzner runs every backend service (Caddy, the API server, the API server deployer, Prometheus + node_exporter, fail2ban, AppArmor, unattended-upgrades, ufw). Its forward DNS records (`backend.fullstaqruby.org`, `apt.fullstaqruby.org`, `yum.fullstaqruby.org`) are managed in `terraform/dns.tf`; its static IPs are referenced from `terraform/variables.tf`. The PTR record (`backend.fullstaqruby.org`) is set manually at the Hetzner provider during bootstrapping (see [bootstrapping](infrastructure-bootstrapping.md) Step 7), not via Terraform.

-## DNS, static IPs, Ingresses
+The VM is configured entirely by Ansible (`ansible/main.yml`). The playbook covers OS hardening (SSH, fail2ban, AppArmor, ufw, autoreboot, unattended-upgrades) and the service stack (Prometheus, Caddy, apiserver-deployer, apiserver). There is no Kubernetes — the previous GKE Autopilot setup was replaced by this VM in the July 2024 rearchitecture.


We should exclude mentioning the specifics of OS hardening (SSH, fail2ban, etc) because that makes the text too easy to drift from the playbook. The specifics are also not that relevant in this document. As for the "service stack", just mentioning Caddy and API server are enough. Splitting the API server into 'apiserver' vs 'apiserver-deployer' is too fine-grained for this document. Should also not mention Kubernetes.

FooBarWidget · 2026-05-20T14:28:49Z

-The Server Edition's CI/CD system stores artifacts in this bucket, for the purpose of implementing [resumption](https://github.com/fullstaq-ruby/server-edition/blob/main/dev-handbook/ci-cd-resumption.md). Objects in this bucket only live for 30 days.
+The Server Edition's CI/CD system stores artifacts for [CI/CD resumption](https://github.com/fullstaq-ruby/server-edition/blob/main/dev-handbook/ci-cd-resumption.md) in two buckets (see `terraform/ci_storage.tf`):
+
+- A **GCS bucket** (`${var.gcloud_bucket_prefix}-server-edition-ci-artifacts`) — publicly readable; the `test` environment writes via WIF, the `deploy` environment reads. Objects expire after 30 days.


We should just mention the interpolated name directly rather than putting the interpolation variable in this document.

WIF is not a commonly understood abbreviation so it should be spelled out.

The expiration policy should not be mentioned in detail to avoid drift from Terraform. Just saying that object do expire is enough. Important: on the Azure Blob container we expire based on access time, not modification time.

If I recall correctly, the Azure Blob container here is not used... yet. We're still writing artifacts to the GCS bucket. The idea was to one day migrate that away to Azure, but no work has been done on that front so far. The network bandwidth and latency from Github hosted runners to Azure are expected to be better than GCS, but unclear whether the tooling is fast enough. The Azure CLI's startup time is quite big.

Addressed 1, 2, and 4 in 2b7f036.

Interpolated names: now fsruby-server-edition-ci-artifacts and fsruby2seredci1.

WIF: spelled out as "Workload Identity Federation (WIF)" in this section.

Azure container not in use: verified against server-edition — upload-artifact.sh/download-artifact.sh use gsutil cp gs://$CI_ARTIFACTS_BUCKET/... (GCS only); grepping all 8 workflows finds zero azcopy/az storage blob/server-edition-ci-artifacts references. Bullet now states the container is provisioned for a future migration but unused. Added a note that the cache container in the same storage account is in active use (sccache, via CACHE_CONTAINER consumed by internal-scripts/ci-cd/build-jemalloc-binaries/build.sh).

On expiration policy — want to confirm before changing. You wrote "we expire based on access time, not modification time", but in terraform/ci_storage.tf the artifacts container uses delete_after_days_since_creation_greater_than = "30" (creation time, line 46), while delete_after_days_since_last_access_time_greater_than = "90" is only applied to the cache container (line 60). So creation-vs-access differs per container today. Did you mean:

(a) Just remove the "30 days" detail from the doc (drift avoidance), keeping Terraform as-is?

(b) Also switch the artifacts container to access-time in Terraform?

Just remove the "30 days" detail.

And what I meant is: the cache container expires based on access time. The artifacts container doesn't.

FooBarWidget · 2026-05-20T14:30:00Z

 - Administered by role: Infra Owners, Infra Maintainers

-The GPG private key is used to sign APT and YUM repositories. We store the canonical copy in Secrets Manager in the `fullstaq-ruby-hisec` Google Cloud project. We store a secondary copy in the Secret Manager in the `fullstaq-ruby` Google Cloud project.
+The GPG private key is used to sign APT and YUM repositories. It is stored in the Azure Key Vault for Infra Owners — `${var.key_vault_prefix}infraowners`, currently `fsruby2infraowners` (see `terraform-hisec/key_vault.tf`). GitHub Actions in the `test` and `deploy` environments are granted read access via Entra ID Federated Identity Credentials.


For consistency with earlier text we should call this Github OIDC.

FooBarWidget · 2026-05-20T14:34:48Z

-## Google Cloud service account for Infrastructure CI/CD
+The following diagram shows the major infrastructure components and how they relate to each other. The role that administers each component is given in the section heading below.
+
+```mermaid


I rendered the Mermaid diagram but I don't think it's easier to read (nor better-looking) than a hand-drawn one, so I prefer an update based on the hand-drawn diagram.

FooBarWidget · 2026-05-20T14:55:02Z

+- **`fullstaq-ruby/server-edition` → Azure** uses [Federated Identity Credentials](https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation) on Entra ID applications (defined in `terraform-hisec/`). These authenticate workflows that read or write Azure Blob Storage (the CI artifacts and CI cache containers) and Azure Key Vault (the GPG signing key).
+- **`fullstaq-ruby/infra` → API server** uses a GitHub-issued OIDC JWT (audience `backend.fullstaqruby.org`) sent as a bearer token to `POST /admin/upgrade_apiserver`. The infra repo's `apiserver.yml` workflow does **not** authenticate to GCP or Azure APIs — the rollout mechanism is entirely on the VM (see [API server](#api-server)). The same JWT mechanism is used by `server-edition` to call `/admin/restart_web_server` after a publish.

 ## API server


apiserver/README.md also needs to be updated. It still mentions Google Cloud but it's no longer hosted there.

Addressed in 109b1b4. Rewrote apiserver/README.md end-to-end since most of it was stale, not just the GCP reference:

Hosting: now states the service is a systemd-managed Puma process on the Hetzner backend host, on /run/apiserver/server.sock, reverse-proxied by Caddy under the /admin/* paths of apt.fullstaqruby.org and yum.fullstaqruby.org — no dedicated apiserver.* hostname.

Endpoints: corrected the count (3, not 1) and documented GET /, POST /admin/upgrade_apiserver, and POST /admin/restart_web_server with their respective GitHub repo + environment scoping.

Auth: replaced the GCP identity-token section with the GitHub Actions OIDC mechanism (audience backend.fullstaqruby.org, with repository/sub/runner_environment/environment claim checks). Also noted that the endpoints are no longer callable from a human/local machine.

CD: replaced "deployed by the Infrastructure project's CI" with the actual flow — .github/workflows/apiserver.yml builds + releases, then calls /admin/upgrade_apiserver with an OIDC token; the on-host apiserver-deployer oneshot unit pulls the GH release and swaps the /opt/apiserver/versions/latest symlink.

There's now a merge conflict on apiserver/README.md.

FooBarWidget · 2026-05-21T08:59:42Z

As discussed in #47 (review), this document should be updated to cover the changes made in #47.

Going to handle this differently — rather than rolling #47's docs into #57, I'll add the docs to #47 itself before it merges. Rationale: each PR should ship with the docs that describe its own post-merge state, so a reader of main is never looking at docs that describe infra not yet on main, and merge ordering between feature PRs and doc PRs stays independent. #57 will stay scoped to true-ups for things that are already on main (post-Hetzner refresh).

Will close the loop here once #47's doc commit lands.

- Use interpolated bucket/storage account names instead of Terraform vars - Spell out Workload Identity Federation on first use in the section - Note that the Azure Blob artifacts container is provisioned but not yet used by CI; artifacts still go to the GCS bucket. The cache container in the same storage account is actively used (sccache). Addresses FooBarWidget review on PR fullstaq-ruby#57.

The previous README described a Google Cloud Run-hosted API server authenticated with Google Cloud identity tokens. Post-Hetzner, the service runs as a systemd-managed Puma process on the backend host behind Caddy on a Unix socket, authenticates via GitHub Actions OIDC, and is deployed via the .github/workflows/apiserver.yml workflow plus the apiserver-deployer oneshot unit. The README now reflects all four. Addresses FooBarWidget review on PR fullstaq-ruby#57.

Documents the new APT/YUM archive infrastructure introduced in this PR: the two new public-read GCS buckets, the deliberate absence of CI write access (frozen-mirror invariant enforced in IAM), the Azure DNS zones and apex NS delegation for the archive subdomains, and the 404 fallback behavior in query-latest-repo-versions.rb that lets the web server start cleanly before the first migration runs. Addresses FooBarWidget's note on PR fullstaq-ruby#57 that fullstaq-ruby#47's changes should ship with their own documentation.

abtreece force-pushed the docs/post-hetzner-refresh branch from 66e915d to cfbe8a1 Compare April 29, 2026 02:56

abtreece requested a review from FooBarWidget May 15, 2026 03:30

abtreece force-pushed the docs/post-hetzner-refresh branch from cfbe8a1 to 85b83bf Compare May 15, 2026 03:48

FooBarWidget requested changes May 20, 2026

View reviewed changes

FooBarWidget reviewed May 20, 2026

View reviewed changes

FooBarWidget mentioned this pull request May 21, 2026

feat: add archive infrastructure for EOL distribution packages #47

Open

4 tasks

FooBarWidget reviewed May 21, 2026

View reviewed changes

abtreece added 2 commits May 22, 2026 09:20


		All Google Cloud resources live in a single project, `fsruby-server-edition2` (display name "Fullstaq Ruby Server Edition"). The `google_project` resource itself is provisioned in `terraform-hisec/gcloud_project.tf` so that creating/deleting the project requires Infra Owner access, but resources _inside_ the project (buckets, IAM, Workload Identity Federation) are managed in `terraform/` by Infra Maintainers.

		The hisec / non-hisec separation is enforced at the Terraform state and access-group layer, not via separate GCP projects. See [Terraform state (normal)](#terraform-state-normal) and [Terraform state (hisec)](#terraform-state-hisec).

Uh oh!

Conversation

abtreece commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files changed

Verification

Note: PAT-based CI bot

Uh oh!

FooBarWidget commented May 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abtreece commented Apr 29, 2026 •

edited

Loading