Skip to content

docs: refresh infra docs for post-Hetzner architecture#57

Open
abtreece wants to merge 3 commits into
fullstaq-ruby:mainfrom
abtreece:docs/post-hetzner-refresh
Open

docs: refresh infra docs for post-Hetzner architecture#57
abtreece wants to merge 3 commits into
fullstaq-ruby:mainfrom
abtreece:docs/post-hetzner-refresh

Conversation

@abtreece

@abtreece abtreece commented Apr 29, 2026

Copy link
Copy Markdown
Collaborator

Closes #55.

Summary

Refreshes docs/ to describe the current infrastructure (single Hetzner VM running Caddy + Sinatra/Puma API server + Prometheus, provisioned by Ansible) instead of the pre-July-2024 architecture (GKE Autopilot + Nginx Ingress + Cloud Run apiserver). The infrastructure overview diagram is also refreshed: the stale infrastructure-overview.drawio.svg is replaced by a Mermaid block embedded in infrastructure-overview.md so future diagram changes are reviewable as text diffs.

Files changed

  • docs/infrastructure-overview.md — rewritten section by section against current IaC.

    • The two pre-existing GCP-service-account sections are folded into a single CI/CD authentication section, split per-caller:
      • fullstaq-ruby/server-edition → GCP via Workload Identity Federation (APT/YUM repo buckets + GCS CI artifacts bucket).
      • fullstaq-ruby/server-edition → Azure via Federated Identity Credentials (Azure Blob CI artifacts + CI cache containers + Key Vault GPG key).
      • fullstaq-ruby/infra → API server only, via a GitHub-issued OIDC JWT (audience backend.fullstaqruby.org) sent to POST /admin/upgrade_apiserver. The infra workflow does not authenticate to GCP or Azure APIs.
    • Caddy section: there is no backend.fullstaqruby.org vhost; both apt. and yum. vhosts handle /admin/* via reverse_proxy to the apiserver Unix socket (per ansible/files/Caddyfile). CI calls /admin/* via https://apt.fullstaqruby.org.
    • Google Cloud project section: corrected to a single project (fsruby-server-edition2, display name "Fullstaq Ruby Server Edition"), provisioned by terraform-hisec/gcloud_project.tf and populated by terraform/. The hisec/non-hisec boundary lives at the Terraform-state and access-group layer, not at a GCP project boundary.
    • API server section: Sinatra/Puma on a Unix socket under systemd; sibling apiserver-deployer.service performs self-update from a tarball attached to a GitHub Release.
    • VM (Hetzner) section: Terraform-managed forward DNS (backend.fullstaqruby.org, apt.fullstaqruby.org, yum.fullstaqruby.org) is distinguished from the manually-set Hetzner PTR record.
    • CI artifacts / cache sections: artifacts are dual-cloud (public GCS + private Azure container); cache is Azure-only.
    • Container registry section: dropped (no registry resources are managed in this repo).
    • GPG private key section: Key Vault name uses the templated form ${var.key_vault_prefix}infraowners (currently fsruby2infraowners).
  • docs/infrastructure-overview.drawio.svgdeleted. Replaced by the Mermaid block in infrastructure-overview.md.

  • docs/editing-diagrams.mddeleted. Mermaid is edited inline; no diagrams.net round-trip is needed.

  • docs/deploy.md — replaces the gcloud container clusters get-credentials + kubectl apply -k ../kubernetes steps with a single ansible-playbook step matching Step 11 of the bootstrapping guide. Adds a callout that apiserver code changes deploy via the GitHub Actions workflow.

  • docs/infrastructure-as-code.md — drops Kustomize and the kubernetes/ directory bullet; adds Ansible to the tools list and an ansible/ directory bullet.

  • docs/infrastructure-bootstrapping.md — intro updated to mention Terraform + Ansible (not Kubernetes/Kustomize). The rest of the file already reflected the post-migration setup.

  • docs/pull_request_template.md — diagram-update checkbox now points to the Mermaid block.

  • README.md — drops the link to the deleted editing-diagrams.md.

  • .editorconfig — removes the duplicate [config.ru] block (the tab/4 one); only the correct space/2 rule remains.

Verification

  • eclint check $(git ls-files) passes.
  • grep -rin 'kubernetes\|kustomize\|kubectl\|gke\|nginx\|cloud run' docs/ returns only intentional historical mentions (e.g. "the previous GKE Autopilot setup was replaced by this VM in the July 2024 rearchitecture").
  • Each rewritten claim is traceable to current IaC: terraform/{dns,gcloud_auth,backend,repo_buckets,ci_storage}.tf, terraform-hisec/{gcloud_project,key_vault,backend}.tf, ansible/main.yml, ansible/files/{Caddyfile,apiserver.service}, .github/workflows/apiserver.yml.

Note: PAT-based CI bot

The "Github CI bot account" section describes a PAT-based bot. Retiring/converting that account is already tracked in #18 ("Change fullstaq-ruby-ci-bot account into a Github app") and is therefore intentionally not in scope here. The text remains as-is so the doc reflects the current state until #18 lands.

@abtreece abtreece force-pushed the docs/post-hetzner-refresh branch from 66e915d to cfbe8a1 Compare April 29, 2026 02:56
@abtreece abtreece requested a review from FooBarWidget May 15, 2026 03:30
Closes fullstaq-ruby#55.

Brings docs/ in line with the post-July-2024 architecture (single
Hetzner VM running Caddy + Sinatra/Puma API server + Prometheus,
provisioned by Ansible), replacing references to the previous
GKE Autopilot + Nginx Ingress + Cloud Run apiserver setup.

Files changed:

- docs/infrastructure-overview.md — rewritten section by section.
  Every claim is grounded in current IaC. The two GCP-service-account
  sections are folded into a single "CI/CD authentication" section
  that splits per-caller: server-edition uses GCP WIF (APT/YUM repo
  buckets + GCS CI artifacts bucket) and Azure Federated Identity
  Credentials (Azure Blob CI artifacts + CI cache + Key Vault GPG
  key); infra repo's apiserver workflow only mints a GitHub OIDC JWT
  (audience backend.fullstaqruby.org) and POSTs to
  /admin/upgrade_apiserver — it does not authenticate to GCP or
  Azure APIs. The Caddy section is corrected: there is no
  backend.fullstaqruby.org vhost; both apt. and yum. vhosts handle
  /admin/* via reverse_proxy to the apiserver Unix socket. The
  "Google Cloud projects" claim of two projects is corrected — there
  is one project, fsruby-server-edition2, provisioned by
  terraform-hisec/gcloud_project.tf and populated by terraform/; the
  hisec/non-hisec separation lives at the Terraform-state and
  access-group layer. Container registry section dropped (no
  registry resources are managed in this repo). Key Vault name
  uses the templated form ${var.key_vault_prefix}infraowners
  (currently fsruby2infraowners). CI artifacts/cache split is now
  explicit (artifacts dual-cloud, cache Azure-only). VM section
  distinguishes Terraform-managed forward DNS from the
  manually-set Hetzner PTR record.

- docs/infrastructure-overview.drawio.svg — deleted. Replaced by an
  inline Mermaid diagram in infrastructure-overview.md so future
  diagram changes are reviewable as text diffs.

- docs/editing-diagrams.md — deleted (no longer needed without the
  drawio round-trip).

- docs/deploy.md — replaces the gcloud-clusters/kubectl steps with
  a single ansible-playbook step matching bootstrapping Step 11.
  Adds a callout that apiserver code changes deploy via the
  GitHub Actions workflow.

- docs/infrastructure-as-code.md — drops Kustomize and the
  kubernetes/ directory bullet; adds Ansible to the tool list and
  an ansible/ directory bullet.

- docs/infrastructure-bootstrapping.md — intro updated to mention
  Terraform + Ansible (not Kubernetes/Kustomize); the rest of the
  file already reflected the post-migration setup.

- docs/pull_request_template.md — diagram-update checkbox now
  points to the Mermaid block instead of the deleted drawio file.

- README.md — drops the link to the deleted editing-diagrams.md.

- .editorconfig — removes the duplicate [config.ru] block (the
  tab/4 one); only the correct space/2 rule remains.

Note: the "Github CI bot account" section is kept as-is. Retiring
that PAT-based bot is already tracked in fullstaq-ruby#18 and is therefore out
of scope here.
@abtreece abtreece force-pushed the docs/post-hetzner-refresh branch from cfbe8a1 to 85b83bf Compare May 15, 2026 03:48
@FooBarWidget

Copy link
Copy Markdown
Member

I'll have a good look. So far my first impression is that the new diagram lacks a lot of detail that was in the older diagram. I'm also not sure whether a detailed but automatically rendered diagram is still readable compared to a manually drawn one.

# Infrastructure bootstrapping

We try to codify infrastructure as much as possible using Terraform and Kubernetes YAML. However:
We try to codify infrastructure as much as possible using Terraform and Ansible. However:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an instruction step in this document for deploying the API server.

* `ansible/` — Configuration of the backend VM (Caddy, the API server, Prometheus, and OS hardening). Administered by [Infra Maintainers](roles.md) and applied manually; see [Deployment guide](deploy.md).

* `.github/workflows/apiserver.yml` — Deploys the API server.
* `.github/workflows/apiserver.yml` — Builds and deploys the API server.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowadays it's .github/workflows/ (multiple workflows that together do the build and deployment).

Comment thread docs/deploy.md
~~~bash
kubectl apply --context=gke_fullstaq-ruby_us-east4_fullstaq-ruby-autopilot -k ../kubernetes
~~~
> The API server itself is not deployed by this playbook. Code changes under `apiserver/` are released by the `.github/workflows/apiserver.yml` workflow, which packages a tarball, attaches it to a GitHub Release, and triggers `POST /admin/upgrade_apiserver` on the live host.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowadays it's the entire .github/workflows/ folder (multiple workflows that together do the build and deployment)


All Google Cloud resources live in a single project, `fsruby-server-edition2` (display name "Fullstaq Ruby Server Edition"). The `google_project` resource itself is provisioned in `terraform-hisec/gcloud_project.tf` so that creating/deleting the project requires Infra Owner access, but resources _inside_ the project (buckets, IAM, Workload Identity Federation) are managed in `terraform/` by Infra Maintainers.

The hisec / non-hisec separation is enforced at the **Terraform state and access-group layer**, not via separate GCP projects. See [Terraform state (normal)](#terraform-state-normal) and [Terraform state (hisec)](#terraform-state-hisec).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section is good but I don't really understand this latter sentence. Maybe it makes sense for a reader who was familiar with the previous situation (two GCP projects) but the docs should be optimized for readers who will only be familiar with the current situation going forward, with no regard to past state.

- Can deploy new versions of the API server.
- **`fullstaq-ruby/server-edition` → Google Cloud** uses [Workload Identity Federation](https://cloud.google.com/iam/docs/workload-identity-federation) (defined in `terraform/gcloud_auth.tf`). Two pools (`github-ci-test`, `github-ci-deploy`) gate access by GitHub repository owner and Actions environment. Through these pools, server-edition CI jobs gain write access to the APT/YUM repo buckets and the GCP CI artifacts bucket — see `terraform/repo_buckets.tf` and `terraform/ci_storage.tf`. The CI cache lives in Azure (see below), not on GCP.
- **`fullstaq-ruby/server-edition` → Azure** uses [Federated Identity Credentials](https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation) on Entra ID applications (defined in `terraform-hisec/`). These authenticate workflows that read or write Azure Blob Storage (the CI artifacts and CI cache containers) and Azure Key Vault (the GPG signing key).
- **`fullstaq-ruby/infra` → API server** uses a GitHub-issued OIDC JWT (audience `backend.fullstaqruby.org`) sent as a bearer token to `POST /admin/upgrade_apiserver`. The infra repo's `apiserver.yml` workflow does **not** authenticate to GCP or Azure APIs — the rollout mechanism is entirely on the VM (see [API server](#api-server)). The same JWT mechanism is used by `server-edition` to call `/admin/restart_web_server` after a publish.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accurate section. But note apiserver.yml has changed.

- Administered by role: Infra Maintainers

The Kubernetes cluster runs our Nginx web server. This cluster is in Autopilot mode.
A single Ubuntu (≥ 24.04) VPS hosted at Hetzner runs every backend service (Caddy, the API server, the API server deployer, Prometheus + node_exporter, fail2ban, AppArmor, unattended-upgrades, ufw). Its forward DNS records (`backend.fullstaqruby.org`, `apt.fullstaqruby.org`, `yum.fullstaqruby.org`) are managed in `terraform/dns.tf`; its static IPs are referenced from `terraform/variables.tf`. The PTR record (`backend.fullstaqruby.org`) is set manually at the Hetzner provider during bootstrapping (see [bootstrapping](infrastructure-bootstrapping.md) Step 7), not via Terraform.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling it "every backend service" while including things like Prometheus, fail2ban, etc. is overstating it.

A single Ubuntu (≥ 24.04) VPS hosted at Hetzner runs every backend service (Caddy, the API server, the API server deployer, Prometheus + node_exporter, fail2ban, AppArmor, unattended-upgrades, ufw). Its forward DNS records (`backend.fullstaqruby.org`, `apt.fullstaqruby.org`, `yum.fullstaqruby.org`) are managed in `terraform/dns.tf`; its static IPs are referenced from `terraform/variables.tf`. The PTR record (`backend.fullstaqruby.org`) is set manually at the Hetzner provider during bootstrapping (see [bootstrapping](infrastructure-bootstrapping.md) Step 7), not via Terraform.

## DNS, static IPs, Ingresses
The VM is configured entirely by Ansible (`ansible/main.yml`). The playbook covers OS hardening (SSH, fail2ban, AppArmor, ufw, autoreboot, unattended-upgrades) and the service stack (Prometheus, Caddy, apiserver-deployer, apiserver). There is no Kubernetes — the previous GKE Autopilot setup was replaced by this VM in the July 2024 rearchitecture.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should exclude mentioning the specifics of OS hardening (SSH, fail2ban, etc) because that makes the text too easy to drift from the playbook. The specifics are also not that relevant in this document. As for the "service stack", just mentioning Caddy and API server are enough. Splitting the API server into 'apiserver' vs 'apiserver-deployer' is too fine-grained for this document. Should also not mention Kubernetes.

Comment thread docs/infrastructure-overview.md Outdated
The Server Edition's CI/CD system stores artifacts in this bucket, for the purpose of implementing [resumption](https://github.com/fullstaq-ruby/server-edition/blob/main/dev-handbook/ci-cd-resumption.md). Objects in this bucket only live for 30 days.
The Server Edition's CI/CD system stores artifacts for [CI/CD resumption](https://github.com/fullstaq-ruby/server-edition/blob/main/dev-handbook/ci-cd-resumption.md) in two buckets (see `terraform/ci_storage.tf`):

- A **GCS bucket** (`${var.gcloud_bucket_prefix}-server-edition-ci-artifacts`) — publicly readable; the `test` environment writes via WIF, the `deploy` environment reads. Objects expire after 30 days.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just mention the interpolated name directly rather than putting the interpolation variable in this document.

WIF is not a commonly understood abbreviation so it should be spelled out.

The expiration policy should not be mentioned in detail to avoid drift from Terraform. Just saying that object do expire is enough. Important: on the Azure Blob container we expire based on access time, not modification time.

If I recall correctly, the Azure Blob container here is not used... yet. We're still writing artifacts to the GCS bucket. The idea was to one day migrate that away to Azure, but no work has been done on that front so far. The network bandwidth and latency from Github hosted runners to Azure are expected to be better than GCS, but unclear whether the tooling is fast enough. The Azure CLI's startup time is quite big.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed 1, 2, and 4 in 2b7f036.

  • Interpolated names: now fsruby-server-edition-ci-artifacts and fsruby2seredci1.
  • WIF: spelled out as "Workload Identity Federation (WIF)" in this section.
  • Azure container not in use: verified against server-editionupload-artifact.sh/download-artifact.sh use gsutil cp gs://$CI_ARTIFACTS_BUCKET/... (GCS only); grepping all 8 workflows finds zero azcopy/az storage blob/server-edition-ci-artifacts references. Bullet now states the container is provisioned for a future migration but unused. Added a note that the cache container in the same storage account is in active use (sccache, via CACHE_CONTAINER consumed by internal-scripts/ci-cd/build-jemalloc-binaries/build.sh).

On expiration policy — want to confirm before changing. You wrote "we expire based on access time, not modification time", but in terraform/ci_storage.tf the artifacts container uses delete_after_days_since_creation_greater_than = "30" (creation time, line 46), while delete_after_days_since_last_access_time_greater_than = "90" is only applied to the cache container (line 60). So creation-vs-access differs per container today. Did you mean:

  • (a) Just remove the "30 days" detail from the doc (drift avoidance), keeping Terraform as-is?
  • (b) Also switch the artifacts container to access-time in Terraform?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove the "30 days" detail.

And what I meant is: the cache container expires based on access time. The artifacts container doesn't.

- Administered by role: Infra Owners, Infra Maintainers

The GPG private key is used to sign APT and YUM repositories. We store the canonical copy in Secrets Manager in the `fullstaq-ruby-hisec` Google Cloud project. We store a secondary copy in the Secret Manager in the `fullstaq-ruby` Google Cloud project.
The GPG private key is used to sign APT and YUM repositories. It is stored in the Azure Key Vault for Infra Owners — `${var.key_vault_prefix}infraowners`, currently `fsruby2infraowners` (see `terraform-hisec/key_vault.tf`). GitHub Actions in the `test` and `deploy` environments are granted read access via Entra ID Federated Identity Credentials.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with earlier text we should call this Github OIDC.

## Google Cloud service account for Infrastructure CI/CD
The following diagram shows the major infrastructure components and how they relate to each other. The role that administers each component is given in the section heading below.

```mermaid

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rendered the Mermaid diagram but I don't think it's easier to read (nor better-looking) than a hand-drawn one, so I prefer an update based on the hand-drawn diagram.

- **`fullstaq-ruby/server-edition` → Azure** uses [Federated Identity Credentials](https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation) on Entra ID applications (defined in `terraform-hisec/`). These authenticate workflows that read or write Azure Blob Storage (the CI artifacts and CI cache containers) and Azure Key Vault (the GPG signing key).
- **`fullstaq-ruby/infra` → API server** uses a GitHub-issued OIDC JWT (audience `backend.fullstaqruby.org`) sent as a bearer token to `POST /admin/upgrade_apiserver`. The infra repo's `apiserver.yml` workflow does **not** authenticate to GCP or Azure APIs — the rollout mechanism is entirely on the VM (see [API server](#api-server)). The same JWT mechanism is used by `server-edition` to call `/admin/restart_web_server` after a publish.

## API server

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apiserver/README.md also needs to be updated. It still mentions Google Cloud but it's no longer hosted there.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 109b1b4. Rewrote apiserver/README.md end-to-end since most of it was stale, not just the GCP reference:

  • Hosting: now states the service is a systemd-managed Puma process on the Hetzner backend host, on /run/apiserver/server.sock, reverse-proxied by Caddy under the /admin/* paths of apt.fullstaqruby.org and yum.fullstaqruby.org — no dedicated apiserver.* hostname.
  • Endpoints: corrected the count (3, not 1) and documented GET /, POST /admin/upgrade_apiserver, and POST /admin/restart_web_server with their respective GitHub repo + environment scoping.
  • Auth: replaced the GCP identity-token section with the GitHub Actions OIDC mechanism (audience backend.fullstaqruby.org, with repository/sub/runner_environment/environment claim checks). Also noted that the endpoints are no longer callable from a human/local machine.
  • CD: replaced "deployed by the Infrastructure project's CI" with the actual flow — .github/workflows/apiserver.yml builds + releases, then calls /admin/upgrade_apiserver with an OIDC token; the on-host apiserver-deployer oneshot unit pulls the GH release and swaps the /opt/apiserver/versions/latest symlink.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's now a merge conflict on apiserver/README.md.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in #47 (review), this document should be updated to cover the changes made in #47.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to handle this differently — rather than rolling #47's docs into #57, I'll add the docs to #47 itself before it merges. Rationale: each PR should ship with the docs that describe its own post-merge state, so a reader of main is never looking at docs that describe infra not yet on main, and merge ordering between feature PRs and doc PRs stays independent. #57 will stay scoped to true-ups for things that are already on main (post-Hetzner refresh).

Will close the loop here once #47's doc commit lands.

abtreece added 2 commits May 22, 2026 09:20
- Use interpolated bucket/storage account names instead of Terraform vars
- Spell out Workload Identity Federation on first use in the section
- Note that the Azure Blob artifacts container is provisioned but not yet
  used by CI; artifacts still go to the GCS bucket. The cache container
  in the same storage account is actively used (sccache).

Addresses FooBarWidget review on PR fullstaq-ruby#57.
The previous README described a Google Cloud Run-hosted API server
authenticated with Google Cloud identity tokens. Post-Hetzner, the
service runs as a systemd-managed Puma process on the backend host
behind Caddy on a Unix socket, authenticates via GitHub Actions OIDC,
and is deployed via the .github/workflows/apiserver.yml workflow plus
the apiserver-deployer oneshot unit. The README now reflects all four.

Addresses FooBarWidget review on PR fullstaq-ruby#57.
abtreece added a commit to abtreece/infra that referenced this pull request May 22, 2026
Documents the new APT/YUM archive infrastructure introduced in this PR:
the two new public-read GCS buckets, the deliberate absence of CI write
access (frozen-mirror invariant enforced in IAM), the Azure DNS zones
and apex NS delegation for the archive subdomains, and the 404 fallback
behavior in query-latest-repo-versions.rb that lets the web server
start cleanly before the first migration runs.

Addresses FooBarWidget's note on PR fullstaq-ruby#57 that fullstaq-ruby#47's changes should ship
with their own documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update documentation to reflect current architecture

2 participants