Skip to content

Add production deployment configuration and CI/CD#194

Draft
cooper667 wants to merge 63 commits into
ckan211-python310-migration-staging-1from
ckan211-prod-deploy-pr
Draft

Add production deployment configuration and CI/CD#194
cooper667 wants to merge 63 commits into
ckan211-python310-migration-staging-1from
ckan211-prod-deploy-pr

Conversation

@cooper667
Copy link
Copy Markdown

  • Add deploy/ folder with Dockerfile.prod, nginx, uwsgi configs
  • Add production.ini (secrets externalized to secrets.ini)
  • Add entrypoint that merges production.ini + secrets.ini at startup
  • Add build-deploy.yml GitHub Actions workflow
  • Add dependabot.yml
  • Update supervisor config with nginx and uwsgi programs

- Add deploy/ folder with Dockerfile.prod, nginx, uwsgi configs
- Add production.ini (secrets externalized to secrets.ini)
- Add entrypoint that merges production.ini + secrets.ini at startup
- Add build-deploy.yml GitHub Actions workflow
- Add dependabot.yml
- Update supervisor config with nginx and uwsgi programs
Previous commits were force-pushed away from upstream repos.
Change GitHub environment URL for staging deployments to reflect
the new domain.
…ploads

Updates ckanext-unaids to 5e557c3 which adds CSRF token to file upload
authorization requests, fixing 400 errors when uploading files in CKAN 2.11.
Support all package types (dataset, dataset-2, etc.) in download routes.
DataPusher was failing with 404 for resources using custom package types.
- Change staging domain from dev-adr to dev.adr.fjelltopp.org
- Enable saml2auth plugin and configure Auth0 IDP
- Re-enable login/register redirect to SAML2 login
- Update ckanext-unaids submodule URL to fork
Bake production.ini into image so config changes flow through CI/CD.
Secrets are still merged at runtime via entrypoint from secrets.ini.

After this deploys, run:
kubectl patch deployment ckan -n adr-s --type='json' -p='[
  {"op": "replace", "path": "/spec/template/spec/volumes/3/projected/sources", "value": [
    {"secret": {"name": "jwt-keys"}},
    {"secret": {"name": "ckan-ini-secrets"}}
  ]}
]'
- Dockerfile bakes config as /etc/ckan/base.ini
- Entrypoint merges base.ini + secrets.ini → /etc/ckan/production.ini
- Allows subPath mounts for secrets without overwriting base config

After deploy, apply subPath mount patch (see commit message).
Config merge order at startup: base.ini < env.ini < secrets.ini

- deploy/base.ini: common config (baked into image)
- deploy/staging.ini: staging-specific (CI creates ConfigMap)
- deploy/production.ini: prod-specific (CI creates ConfigMap)
- Entrypoint merges all three into /tmp/production.ini
- CI workflow creates ckan-env-config ConfigMap per environment
@cooper667 cooper667 force-pushed the ckan211-prod-deploy-pr branch from 10a5add to 21d7e3b Compare February 2, 2026 17:31
Point submodule back to fjelltopp/ckanext-unaids instead of fork,
using the same commit as the base branch.
…d_request_access

CKAN 2.11 does not register a site_read auth function. The previously
recorded submodule commit (1678265) had a stricter ValueError guard that
re-raised in non-testing environments, causing a 500 on any
restricted_request_access URL. The new commit (9b530ef) correctly
swallows the missing-auth-function ValueError unconditionally.
CKAN startup (60s readiness delay + migrations + uWSGI init) takes
~6-7 minutes. The 5m timeout was causing false-failure CI exits even
when the deploy succeeded.
The nightly sync replaces staging's DB with a prod snapshot, wiping any
api_token rows that only existed on staging. This breaks DataPusher —
it holds a ckan.datapusher.api_token JWT whose DB row no longer exists,
causing every push job to 401 on callback.

Fix: add refresh_datapusher_token() to the sync, called inside the
scale-to-0 window after both DB restores complete and before CKAN
scales back up:

  1. Generates a fresh HS256 JWT using the same secret CKAN uses
     (api_token.jwt.encode.secret, passed in as CKAN_JWT_SECRET).
  2. Inserts a matching row into staging's api_token table via psql.
  3. Patches ckan-ini-secrets with the new token so CKAN starts
     with a token that already exists in the restored DB.

No extra pod restart needed — CKAN picks up the correct secret on
its normal post-restore scale-up.

Also:
- Add secrets: get/patch RBAC rule to the adr-sync Role.
- Document CKAN_JWT_SECRET in secrets.yaml.template.
Move from the downgraded 9b530ef back to the tip of development
(1678265) plus one patch commit (058dd78) that fixes the site_read
ValueError on CKAN 2.11 without regressing any other changes.
Prevents stacked deploys when commits are pushed in quick succession.
A new push cancels the in-flight CI run before it reaches kubectl,
avoiding the window where both the old and new pods are 0/1.
pg_restore --no-privileges strips every GRANT from the restored datastore
DB. The datastore_ro role then can't SELECT from _table_metadata, so
DataPusher's datastore_search call 500s, push_to_datastore aborts before
pushing rows, and the 'complete' callback that creates datatables_view
never fires. Uploads appear to succeed but views never show up.

Run 'ckan datastore set-permissions' after the restore and pipe the
canonical GRANT script into psql as ckan_admin against the datastore DB.
Comment thread docs/nightly-sync.md

## Auth0 layout

There's one Auth0 tenant (canonical `dev-udfgla0l.eu.auth0.com`) with the custom domain `auth-hivtools.unaids.org` promoted on top.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. There are 2?!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is the prod one. Have we dropped the dev one?

Comment thread .gitignore Outdated
Comment thread deploy/production.ini
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about SMTP?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cooper667 added 2 commits May 29, 2026 12:19
These predate the Azure/AKS + GitHub Actions pipeline and have no live
consumer:

- Jenkinsfile        — Jenkins pipeline; logged into AWS ECR, drove ci_setup/ci_test
- ci_setup.sh        — Jenkins-only stack bringup ($WORKSPACE/CHANGE_ID)
- ci_test.sh         — Jenkins daily test job
- build_ckan.yml     — published ckan_base to ghcr; prod Dockerfile is self-contained
- container_build_and_push.sh — manual Docker Hub base-image push, superseded by 'adx build'

Frozen since 2022-23 while the rest of the repo moved to CKAN 2.11/Py3.10.
Jenkins is decommissioned. Local dev (adx + docker-compose) is unaffected.
rebuild_solr_index.sh is tracked, so ignoring it was a no-op that just
hid it from git status. Drop the stale ignore lines.
* ci: gate staging build/deploy on extension test suite

Re-adds the CKAN extension tests that died with the Jenkins pipeline,
now as a 'test' job in build-deploy.yml that build/deploy depend on.

Brings up the docker-compose dev stack (the same flow the old
ci_setup.sh/ci_test.sh drove via adx) on pinned submodule commits and
runs the 5 Fjelltopp suites:
  unaids, validation, scheming, dhis2harvester, emailasusername

Gate semantics:
- push: tests must pass before build and deploy run
- redeploy (workflow_dispatch + image_tag): tests skipped, image
  already tested when built
- test failure blocks both build and deploy

Tests run against the dev image + bind-mounted submodules, matching
local 'adx test'. Not yet verified green under CKAN 2.11/Py3.10.

* ci: run tests on PRs + cache pipenv venv and React node_modules

- Add pull_request trigger so the extension-test job runs as a visible
  status check; build/deploy stay gated off pull_request events.
- Cache .adxvenv on Pipfile.lock — the dominant cost is bootstrap's
  'pipenv sync --dev' (CKAN + ~19 extensions), not the image build.
- Cache the unaids React node_modules on its yarn.lock.

Both caches are safe on miss, so the first run is a clean cold signal.

* ci: run all suites (no fail-fast) and save caches on failure

- Run all 5 extension suites and aggregate, so one run reports the full
  picture instead of stopping at the first failing suite.
- Split cache restore/save so .adxvenv and node_modules are saved even
  when tests fail, making triage runs warm instead of cold.

* test: fix extension test suite under CKAN 2.11/Py3.10

- Add 'mock' and 'pyfakefs' to dev-packages: ckanext-validation,
  -scheming and -emailasusername import the standalone 'mock' package
  (and pyfakefs in validation), which weren't installed, causing
  pytest collection errors. Dev-only — prod uses 'pipenv sync' without
  --dev, so these don't ship.
- run_tests: blank CKAN_SMTP_SERVER for test runs so suites never reach
  the dev stack's smtp4dev. Fixes ckanext-unaids'
  test_send_dataset_transfer_emails_errors, which asserts mail sending
  fails when no server is configured.

* deps: pin frictionless==5.13.1 and pyfakefs==4.6.* to match extensions

The validation/unaids extensions pin frictionless[ckan]==5.13.1 and
pyfakefs==4.6.* in their own requirements and pass their own CI against
those. Our merged Pipfile had frictionless>=5.0.0,<6.0.0 (drifted to a
newer 5.x that dropped Resource.__create__) and pyfakefs=* (6.x dropped
the CreateFile API), so ckanext-validation's test suite failed in our
stack only. Aligning the pins fixes it without touching the submodule;
frictionless is prod-facing but this matches the version the extensions
are built and tested against.
The Jenkins/Py2-era testing section referenced nosetests-2.7 and a
ckan-nosetests alias that no longer exist. adx test wraps ckan-pytest
(pytest in CKAN's venv); update the core-tests command to the pytest
equivalent against the mounted /etc/ckan/test-core.ini. Also drop the
dead nosetests.xml entry from .gitignore.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants