Commit 0d80ad5
chore: Nightly prod to staging sync (#203)
* Add nightly prod->staging sync CronJob with Auth0 user import
Backup phase: pg_dump prod, azcopy sync prod LFS, Auth0 users-export
all into adr-snapshots storage account.
Apply phase: pg_restore into adr-s Postgres, azcopy sync to adr-s-datalake,
Auth0 users-imports into dev tenant (upsert=true), CKAN solr reindex.
Replaces the AWS-era adx_toolbox sync_development.sh + auth0_backup_users
cron. The same artefacts now serve both staging refresh and DR backups,
governed by Azure Blob lifecycle policy on adr-snapshots.
See deploy/sync/secrets.yaml.template for the secret shape that needs to
exist in adr-s before the CronJob runs.
* Add docs/nightly-sync.md
Architecture, one-time Azure + Auth0 + k8s setup, ops runbook,
failure modes, cost estimate. Cross-references the CKAN
saml2auth process_user() flow and the retired adx_toolbox scripts
that this work supersedes.
* Make pg_restore actually work; drop the users-imports step
pg_restore was silently failing because the refactor moved Postgres
creds to libpq env vars but pg_restore (unlike pg_dump) requires
-d/--dbname on the command line. Now passed explicitly. Restore is
also now loud-fail rather than warn-and-continue — silent partial
restores were how this pipeline could silently rot.
Prod and staging share one Auth0 tenant (canonical
dev-udfgla0l.eu.auth0.com), so the nightly users-imports into a
separate dev tenant was a no-op. Removed the function, the
_sanitise_for_dev helper, the AUTH0_DEV_* config fields, and the
SKIP_AUTH0_IMPORT flag. The Auth0 export still runs to produce a
nightly backup blob.
Other fixes:
- _blob_url() helper builds proper container/prefix URLs so the
destination LFS path resolves under snapshots/lfs/ rather than
pointing at a non-existent lfs container.
- pg_env() exports PGHOST/PGUSER/PGPASSWORD/PGDATABASE so the
password never appears on a subprocess command line (and thus
not in CalledProcessError tracebacks).
- run() rebuilds CalledProcessError with redacted args before
re-raising for the same reason.
- ckan search-index rebuild now passes --force and treats the
rebuild result as advisory — individual datasets with
serialization quirks shouldn't kill the whole sync.
- ckan -c path is /tmp/production.ini (where base.ini + secrets.ini
are merged at container startup), not /etc/ckan/.
- SKIP_PG_BACKUP env var lets us iterate on apply-phase work
without re-dumping prod for ~8 min each run.
- Derive the datastore admin URL from the ckan admin URL by path
swap; drop the now-redundant PROD/STAGING_DATASTORE_PG_URL env
vars. The limited 'datastore' role can't SELECT the UUID-named
resource tables anyway.
* sync: scale ckan+datapusher to 0 around restore; clear Solr on reindex
- Apply phase now scales ckan + datapusher to 0 before DROP DATABASE,
scales back to 1 after. Datapusher connects as the 'datastore' role,
which the sync (running as ckan_admin) can't terminate from SQL.
- Drop+recreate filters pg_terminate_backend by usename=current_user
so an Azure system superuser session can't abort the whole DROP.
- pg_restore now logs the ~7 expected errors (5 benign duplicate-index
entries from ckanext-harvest + 2 real prod data dups) and continues
instead of aborting, matching the AWS-era behaviour.
- search-index rebuild now uses --clear; without it stale Solr docs
point at deleted rows and 500 the home page via ckanext-restricted.
- Bump activeDeadlineSeconds to 4h; datastore restore alone is ~90 min.
- RBAC: grant patch on deployments + deployments/scale.
- Doc updates: architecture (scale-down step), failure modes,
one-time setup (RBAC note).
* sync: run ckan db upgrade + extension initdb after pg_restore
The AWS-era sync was data-only — it relied on dev and prod schemas
being identical. When staging code adds a new extension or migration
ahead of prod, the post-sync staging DB is missing those tables and
CKAN logs CRITI 'requires database setup' messages until someone
remembers to run initdb by hand.
Run ckan db upgrade plus the per-extension initdb commands
(unaids/versions/validation) as part of the apply phase. All
idempotent; failures are logged but don't abort the sync.
* sync: auto-discover plugin initdb commands
Replace the hardcoded list (unaids/versions/validation) with
discovery from ckan.plugins in the live config. For each enabled
plugin, try ckan <plugin> initdb then init-db; swallow exit 2
('no such Click subcommand') so plugins without an initdb stay
silent. Adding a new extension that needs initdb now works without
editing sync.py — just enable it in ckan.plugins.
Cost: ~5-7 min on a 138-min job (20 plugins x 2 attempts x ~10s
CKAN CLI startup).
* update time and docs
* sync: fix namespace consistency in _scale; capture multi-line ckan.plugins
- _scale() now takes cfg and uses cfg.ckan_namespace instead of reading
NAMESPACE env var. If CKAN_NAMESPACE is overridden the scale/wait
calls now follow it instead of hard-coding adr-s.
- _enabled_plugins() previously read only the first line of
ckan.plugins =; deploy/base.ini continues the value on an indented
second line, so 6 plugins (incl. text_view, auth) were being skipped
and their initdb steps never ran after a restore. Awk now captures
the first match plus any indented continuation lines.
Both raised by Copilot review on PR #203.
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>1 parent 82ecc93 commit 0d80ad5
5 files changed
Lines changed: 787 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
0 commit comments