Skip to content

Commit 0d80ad5

Browse files
cooper667Copilot
andauthored
chore: Nightly prod to staging sync (#203)
* Add nightly prod->staging sync CronJob with Auth0 user import Backup phase: pg_dump prod, azcopy sync prod LFS, Auth0 users-export all into adr-snapshots storage account. Apply phase: pg_restore into adr-s Postgres, azcopy sync to adr-s-datalake, Auth0 users-imports into dev tenant (upsert=true), CKAN solr reindex. Replaces the AWS-era adx_toolbox sync_development.sh + auth0_backup_users cron. The same artefacts now serve both staging refresh and DR backups, governed by Azure Blob lifecycle policy on adr-snapshots. See deploy/sync/secrets.yaml.template for the secret shape that needs to exist in adr-s before the CronJob runs. * Add docs/nightly-sync.md Architecture, one-time Azure + Auth0 + k8s setup, ops runbook, failure modes, cost estimate. Cross-references the CKAN saml2auth process_user() flow and the retired adx_toolbox scripts that this work supersedes. * Make pg_restore actually work; drop the users-imports step pg_restore was silently failing because the refactor moved Postgres creds to libpq env vars but pg_restore (unlike pg_dump) requires -d/--dbname on the command line. Now passed explicitly. Restore is also now loud-fail rather than warn-and-continue — silent partial restores were how this pipeline could silently rot. Prod and staging share one Auth0 tenant (canonical dev-udfgla0l.eu.auth0.com), so the nightly users-imports into a separate dev tenant was a no-op. Removed the function, the _sanitise_for_dev helper, the AUTH0_DEV_* config fields, and the SKIP_AUTH0_IMPORT flag. The Auth0 export still runs to produce a nightly backup blob. Other fixes: - _blob_url() helper builds proper container/prefix URLs so the destination LFS path resolves under snapshots/lfs/ rather than pointing at a non-existent lfs container. - pg_env() exports PGHOST/PGUSER/PGPASSWORD/PGDATABASE so the password never appears on a subprocess command line (and thus not in CalledProcessError tracebacks). - run() rebuilds CalledProcessError with redacted args before re-raising for the same reason. - ckan search-index rebuild now passes --force and treats the rebuild result as advisory — individual datasets with serialization quirks shouldn't kill the whole sync. - ckan -c path is /tmp/production.ini (where base.ini + secrets.ini are merged at container startup), not /etc/ckan/. - SKIP_PG_BACKUP env var lets us iterate on apply-phase work without re-dumping prod for ~8 min each run. - Derive the datastore admin URL from the ckan admin URL by path swap; drop the now-redundant PROD/STAGING_DATASTORE_PG_URL env vars. The limited 'datastore' role can't SELECT the UUID-named resource tables anyway. * sync: scale ckan+datapusher to 0 around restore; clear Solr on reindex - Apply phase now scales ckan + datapusher to 0 before DROP DATABASE, scales back to 1 after. Datapusher connects as the 'datastore' role, which the sync (running as ckan_admin) can't terminate from SQL. - Drop+recreate filters pg_terminate_backend by usename=current_user so an Azure system superuser session can't abort the whole DROP. - pg_restore now logs the ~7 expected errors (5 benign duplicate-index entries from ckanext-harvest + 2 real prod data dups) and continues instead of aborting, matching the AWS-era behaviour. - search-index rebuild now uses --clear; without it stale Solr docs point at deleted rows and 500 the home page via ckanext-restricted. - Bump activeDeadlineSeconds to 4h; datastore restore alone is ~90 min. - RBAC: grant patch on deployments + deployments/scale. - Doc updates: architecture (scale-down step), failure modes, one-time setup (RBAC note). * sync: run ckan db upgrade + extension initdb after pg_restore The AWS-era sync was data-only — it relied on dev and prod schemas being identical. When staging code adds a new extension or migration ahead of prod, the post-sync staging DB is missing those tables and CKAN logs CRITI 'requires database setup' messages until someone remembers to run initdb by hand. Run ckan db upgrade plus the per-extension initdb commands (unaids/versions/validation) as part of the apply phase. All idempotent; failures are logged but don't abort the sync. * sync: auto-discover plugin initdb commands Replace the hardcoded list (unaids/versions/validation) with discovery from ckan.plugins in the live config. For each enabled plugin, try ckan <plugin> initdb then init-db; swallow exit 2 ('no such Click subcommand') so plugins without an initdb stay silent. Adding a new extension that needs initdb now works without editing sync.py — just enable it in ckan.plugins. Cost: ~5-7 min on a 138-min job (20 plugins x 2 attempts x ~10s CKAN CLI startup). * update time and docs * sync: fix namespace consistency in _scale; capture multi-line ckan.plugins - _scale() now takes cfg and uses cfg.ckan_namespace instead of reading NAMESPACE env var. If CKAN_NAMESPACE is overridden the scale/wait calls now follow it instead of hard-coding adr-s. - _enabled_plugins() previously read only the first line of ckan.plugins =; deploy/base.ini continues the value on an indented second line, so 6 plugins (incl. text_view, auth) were being skipped and their initdb steps never ran after a restore. Awk now captures the first match plus any indented continuation lines. Both raised by Copilot review on PR #203. * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
1 parent 82ecc93 commit 0d80ad5

5 files changed

Lines changed: 787 additions & 0 deletions

File tree

deploy/sync/Dockerfile.sync

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Image for the nightly prod -> staging sync CronJob.
2+
# Target: adracr.azurecr.io/adr-sync
3+
FROM ubuntu:24.04
4+
5+
ENV DEBIAN_FRONTEND=noninteractive
6+
7+
RUN apt-get update -qq && apt-get install -y -qq \
8+
ca-certificates curl gnupg lsb-release tar python3 python3-pip python3-requests \
9+
&& install -d /usr/share/postgresql-common/pgdg \
10+
&& curl -fsSL https://www.postgresql.org/media/keys/ACCC4CF8.asc \
11+
-o /usr/share/postgresql-common/pgdg/apt.postgresql.org.asc \
12+
&& echo "deb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.asc] https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" \
13+
> /etc/apt/sources.list.d/pgdg.list \
14+
&& apt-get update -qq \
15+
&& apt-get install -y -qq postgresql-client-16 \
16+
&& curl -fsSL https://aka.ms/downloadazcopy-v10-linux -o /tmp/azcopy.tgz \
17+
&& tar -xzf /tmp/azcopy.tgz -C /tmp \
18+
&& install /tmp/azcopy_linux_*/azcopy /usr/local/bin/azcopy \
19+
&& curl -fsSL -o /usr/local/bin/kubectl \
20+
"https://dl.k8s.io/release/$(curl -fsSL https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" \
21+
&& chmod +x /usr/local/bin/kubectl \
22+
&& rm -rf /var/lib/apt/lists/* /tmp/azcopy*
23+
24+
COPY sync.py /usr/local/bin/sync.py
25+
RUN chmod +x /usr/local/bin/sync.py
26+
27+
ENTRYPOINT ["python3", "/usr/local/bin/sync.py"]

deploy/sync/cronjob.yaml

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Nightly prod -> staging sync CronJob.
2+
# Apply with: kubectl apply -f deploy/sync/cronjob.yaml
3+
# Manual one-off run: kubectl create job --from=cronjob/adr-sync adr-sync-manual-$(date +%s) -n adr-s
4+
apiVersion: batch/v1
5+
kind: CronJob
6+
metadata:
7+
name: adr-sync
8+
namespace: adr-s
9+
spec:
10+
schedule: "0 1 * * *" # 01:00 UTC daily
11+
timeZone: "Etc/UTC"
12+
concurrencyPolicy: Forbid
13+
successfulJobsHistoryLimit: 3
14+
failedJobsHistoryLimit: 7
15+
jobTemplate:
16+
spec:
17+
backoffLimit: 0 # don't retry — re-run manually after fixing
18+
activeDeadlineSeconds: 14400 # 4 hours; datastore restore alone is ~90 min
19+
template:
20+
spec:
21+
restartPolicy: Never
22+
serviceAccountName: adr-sync # needs `kubectl exec` on the ckan deployment
23+
containers:
24+
- name: sync
25+
image: adracr.azurecr.io/adr-sync:latest
26+
imagePullPolicy: Always
27+
resources:
28+
requests:
29+
cpu: "500m"
30+
memory: "1Gi"
31+
limits:
32+
cpu: "2"
33+
memory: "4Gi"
34+
envFrom:
35+
- secretRef:
36+
name: adr-sync-secrets
37+
env:
38+
- name: CKAN_NAMESPACE
39+
value: "adr-s"
40+
- name: CKAN_DEPLOYMENT
41+
value: "deploy/ckan"
42+
---
43+
apiVersion: v1
44+
kind: ServiceAccount
45+
metadata:
46+
name: adr-sync
47+
namespace: adr-s
48+
---
49+
apiVersion: rbac.authorization.k8s.io/v1
50+
kind: Role
51+
metadata:
52+
name: adr-sync
53+
namespace: adr-s
54+
rules:
55+
- apiGroups: ["apps"]
56+
resources: ["deployments"]
57+
verbs: ["get", "patch"]
58+
- apiGroups: ["apps"]
59+
resources: ["deployments/scale"]
60+
verbs: ["get", "patch", "update"]
61+
- apiGroups: [""]
62+
resources: ["pods", "pods/exec"]
63+
verbs: ["get", "list", "create", "watch"]
64+
---
65+
apiVersion: rbac.authorization.k8s.io/v1
66+
kind: RoleBinding
67+
metadata:
68+
name: adr-sync
69+
namespace: adr-s
70+
subjects:
71+
- kind: ServiceAccount
72+
name: adr-sync
73+
namespace: adr-s
74+
roleRef:
75+
apiGroup: rbac.authorization.k8s.io
76+
kind: Role
77+
name: adr-sync

deploy/sync/secrets.yaml.template

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Template for the k8s Secret consumed by the adr-sync CronJob.
2+
#
3+
# Do NOT commit a populated copy. Either:
4+
# 1. Use sealed-secrets / external-secrets and commit the encrypted form.
5+
# 2. Apply manually with `kubectl apply -f` from a local copy that lives
6+
# only on the operator's machine.
7+
#
8+
# All values are strings. The SAS tokens should be issued with the minimum
9+
# necessary permissions:
10+
# - PROD_LFS_SAS: read+list on adr-p-datalake
11+
# - SNAPSHOTS_SAS: read+list+write+delete on adr-snapshots (full)
12+
# - STAGING_LFS_SAS: read+list+write+delete on adr-s-datalake
13+
#
14+
# Auth0 M2M scopes: read:users create:users (for users-exports job).
15+
# Use the canonical tenant domain (not the custom domain) for AUTH0_PROD_DOMAIN —
16+
# the Management API token endpoint validates against the canonical hostname.
17+
apiVersion: v1
18+
kind: Secret
19+
metadata:
20+
name: adr-sync-secrets
21+
namespace: adr-s
22+
type: Opaque
23+
stringData:
24+
# Postgres connection URLs — use the ckan_admin (or equivalent) role with
25+
# access to BOTH the ckan and datastore databases. The sync derives the
26+
# datastore URL by swapping the path component. Do NOT use the limited
27+
# `datastore` role here; it can't SELECT the UUID-named resource tables.
28+
PROD_CKAN_PG_URL: "postgresql://ckan_admin:PASS@adr-p-eun-db001.postgres.database.azure.com/ckan?sslmode=require"
29+
STAGING_CKAN_PG_URL: "postgresql://ckan_admin:PASS@adr-s-eun-db001.postgres.database.azure.com/ckan?sslmode=require"
30+
31+
# Storage
32+
SNAPSHOTS_ACCOUNT: "adrsnapshotsta"
33+
SNAPSHOTS_CONTAINER: "snapshots"
34+
SNAPSHOTS_SAS: "sv=...&sig=..." # full rwl(d) on the container
35+
PROD_LFS_ACCOUNT: "adrpeunsta"
36+
PROD_LFS_CONTAINER: "adr-p-datalake"
37+
PROD_LFS_SAS: "sv=...&sig=..." # read+list only
38+
STAGING_LFS_ACCOUNT: "adrseunsta"
39+
STAGING_LFS_CONTAINER: "adr-s-datalake"
40+
STAGING_LFS_SAS: "sv=...&sig=..." # rwl(d)
41+
42+
# Auth0 — single tenant, M2M creds for the users-exports backup job.
43+
# Env var names are AUTH0_PROD_* for backward-compat with earlier secret revisions;
44+
# there is no AUTH0_DEV_* anymore because prod and staging share one tenant.
45+
AUTH0_PROD_DOMAIN: "dev-udfgla0l.eu.auth0.com" # canonical, not the vanity custom domain
46+
AUTH0_PROD_CLIENT_ID: "..."
47+
AUTH0_PROD_CLIENT_SECRET: "..."
48+
49+
# Slack (optional — omit to disable notifications)
50+
SLACK_WEBHOOK_URL: "https://hooks.slack.com/services/..."

0 commit comments

Comments
 (0)