|
| 1 | +# Spec: `seed-test-data` — dev-only test-data seeding component |
| 2 | + |
| 3 | +## Goal |
| 4 | + |
| 5 | +A new component that runs **inside the Kubernetes cluster on dev/E2E deployments only** and |
| 6 | +seeds a realistic slice of SeqSet-citation test data, so that the SeqSet + citation features |
| 7 | +(PR #6304) are visibly populated on a fresh dev deployment without a human clicking through the UI. |
| 8 | + |
| 9 | +On each fresh dev deploy it will: |
| 10 | + |
| 11 | +1. Register a seed user + create a group. |
| 12 | +2. Submit a handful of sequences to the **dummy organism**, drive them through the dummy |
| 13 | + preprocessing pipeline, and **release** them (so they get accessions). |
| 14 | +3. Create a **SeqSet** referencing those released accessions (focal + background). |
| 15 | +4. Insert a **manual ("CURATED") citation** of that SeqSet into the database. |
| 16 | + |
| 17 | +## Why reuse the integration tests |
| 18 | + |
| 19 | +The whole submit→preprocess→release→seqset flow is already implemented as Playwright page |
| 20 | +objects in `integration-tests/`. Rather than re-deriving the backend REST choreography, the |
| 21 | +seed component is a thin entrypoint that drives those same page objects against the in-cluster |
| 22 | +website. This keeps the seed path and the test path exercising identical code. |
| 23 | + |
| 24 | +Reused page objects (all under `integration-tests/tests/pages/`): |
| 25 | + |
| 26 | +| Page object | Method(s) used | Source | |
| 27 | +|---|---|---| |
| 28 | +| `AuthPage` | `createAccount` / `tryLoginOrRegister` | `auth.page.ts:12` | |
| 29 | +| `GroupPage` | `createGroup` | `group.page.ts:25` | |
| 30 | +| `SubmissionPage` | `fillSubmissionFormDummyOrganism`, `fillSequenceData`, `acceptTerms`, `completeSubmission` | `submission.page.ts:101,116,136` | |
| 31 | +| `ReviewPage` | `waitForAllProcessed`, `releaseAndGoToReleasedSequences` | `review.page.ts:121,160` | |
| 32 | +| `SeqSetPage` | `gotoList`, `createSeqSet` | `seqset.page.ts:15,31` | |
| 33 | + |
| 34 | +Reused helpers: `buildTestGroup` (`utils/testGroup.ts:23`), sequence constants in |
| 35 | +`test-helpers/test-data.ts`, and the dummy-organism display name **"Test Dummy Organism"** |
| 36 | +(`kubernetes/loculus/values.yaml:1776`). |
| 37 | + |
| 38 | +## The citation mechanism |
| 39 | + |
| 40 | +Citations land in the DB two ways (`backend/.../db/migration/V1.29__add_seqset_citations_table.sql`): |
| 41 | + |
| 42 | +- `origin = 'CROSSREF'` — written by the scheduled `SeqSetCrossRefCitationsTask` that polls the |
| 43 | + CrossRef cited-by API every 6h. Not reproducible on a dev cluster (no real CrossRef, no real DOIs). |
| 44 | +- `origin = 'CURATED'` — now written by the **`POST /create-curated-citation`** backend endpoint |
| 45 | + added in this branch (superuser-only). See `SeqSetCitationsController.createCuratedCitation`. |
| 46 | + |
| 47 | +The seed job creates the manual citation by calling that endpoint with a **super-user** token: |
| 48 | + |
| 49 | +``` |
| 50 | +POST /create-curated-citation (Authorization: Bearer <super-user JWT>) |
| 51 | +{ |
| 52 | + "seqSetId": "<from createSeqSet>", |
| 53 | + "seqSetVersion": 1, |
| 54 | + "source": { |
| 55 | + "sourceDOI": "10.0000/seed-citation-1", |
| 56 | + "title": "Seed reference publication", |
| 57 | + "year": 2024, |
| 58 | + "contributors": [{ "givenName": "Ada", "surname": "Lovelace" }] |
| 59 | + } |
| 60 | +} |
| 61 | +``` |
| 62 | + |
| 63 | +The endpoint enforces `authenticatedUser.isSuperUser` (else 403), validates the SeqSet exists |
| 64 | +(else 404), upserts the citation source (reusing an existing DOI row if present), then links it to |
| 65 | +the SeqSet version. The link is by `(seqset_id, seqset_version)`, so **no minted DOI is required**. |
| 66 | + |
| 67 | +> **Implications for the seed job:** no DB secret needed — it's a plain authenticated HTTP call via |
| 68 | +> Playwright's `page.request` (or `fetch`). The citation step must use a token with the `super_user` |
| 69 | +> realm role. Recommend logging in as the existing dev superuser (`superuser`/`superuser`, created |
| 70 | +> when `createTestAccounts: true`) for that one call, while the submit/seqset steps use the seed user. |
| 71 | +
|
| 72 | +## Component shape |
| 73 | + |
| 74 | +A Kubernetes **Job** (not a long-running Deployment) gated on a new dev-only value. Built from the |
| 75 | +`integration-tests/` image (Playwright + node_modules already present) with a non-test entrypoint. |
| 76 | + |
| 77 | +``` |
| 78 | +integration-tests/ |
| 79 | + seed/ |
| 80 | + SPEC.md <- this file |
| 81 | + seed.ts <- standalone entrypoint (launches chromium, composes page objects, then pg insert) |
| 82 | + Dockerfile <- (new or extended) builds an image usable as both test-runner and seeder |
| 83 | +``` |
| 84 | + |
| 85 | +`seed.ts` outline (all calls are existing page-object methods unless noted): |
| 86 | + |
| 87 | +```ts |
| 88 | +const browser = await chromium.launch({ headless: true }); |
| 89 | +const page = await browser.newPage({ baseURL: process.env.PLAYWRIGHT_TEST_BASE_URL }); |
| 90 | + |
| 91 | +// idempotency: bail if the seed user already exists (login succeeds) |
| 92 | +if (await new AuthPage(page).login(SEED_USER, SEED_PW)) { log('already seeded'); process.exit(0); } |
| 93 | + |
| 94 | +await new AuthPage(page).createAccount(seedAccount); |
| 95 | +const groupId = await new GroupPage(page).createGroup(buildTestGroup('seed-group')); |
| 96 | + |
| 97 | +const accessions: string[] = []; |
| 98 | +for (const s of SEED_SEQUENCES) { // ~3 sequences |
| 99 | + const review = await submissionPage.completeSubmission( |
| 100 | + { ...s, groupId: String(groupId) }, s.sequenceData); // dummy-organism form |
| 101 | + await review.waitForAllProcessed(); // dummy pipeline runs in-cluster |
| 102 | + await review.releaseAndGoToReleasedSequences(); |
| 103 | + accessions.push(await readAccession(page)); // small helper (parse released table/URL) |
| 104 | +} |
| 105 | + |
| 106 | +const { seqSetId, seqSetVersion } = // createSeqSet returns id+version (parse from URL) |
| 107 | + await new SeqSetPage(page).createSeqSet({ |
| 108 | + name: 'Seed SeqSet', description: 'Auto-seeded for dev', |
| 109 | + focalAccessions: [accessions[0]], backgroundAccessions: accessions.slice(1), |
| 110 | + }); |
| 111 | + |
| 112 | +// citation: call the superuser-only endpoint with a super-user token |
| 113 | +const superUserToken = await getToken('superuser', 'superuser'); // keycloak password grant |
| 114 | +await page.request.post(`${BACKEND_URL}/create-curated-citation`, { |
| 115 | + headers: { authorization: `Bearer ${superUserToken}` }, |
| 116 | + data: { |
| 117 | + seqSetId, seqSetVersion, |
| 118 | + source: { |
| 119 | + sourceDOI: '10.0000/seed-citation-1', title: 'Seed reference publication', |
| 120 | + year: 2024, contributors: [{ givenName: 'Ada', surname: 'Lovelace' }], |
| 121 | + }, |
| 122 | + }, |
| 123 | +}); |
| 124 | +await browser.close(); |
| 125 | +``` |
| 126 | + |
| 127 | +Two small additions to the page-object layer are needed (both trivial, reusable by future tests): |
| 128 | +- `SeqSetPage.createSeqSet` should return `{ seqSetId, seqSetVersion }` (parse from the post-create URL). |
| 129 | +- a `readAccession(page)` helper to pull the accession of a just-released sequence. |
| 130 | + |
| 131 | +## Kubernetes wiring |
| 132 | + |
| 133 | +New template `kubernetes/loculus/templates/seed-test-data-job.yaml`: |
| 134 | + |
| 135 | +- `kind: Job`, gated: `{{- if .Values.seedTestData.enabled }}` (whole file). |
| 136 | +- Image: `ghcr.io/loculus-project/integration-tests:{{ $dockerTag }}` (new image built in CI from |
| 137 | + `integration-tests/Dockerfile`), `command: ["node", "seed/seed.js"]`. |
| 138 | +- Env: |
| 139 | + - `PLAYWRIGHT_TEST_BASE_URL: http://loculus-website-service:3000` (verified service name, |
| 140 | + `templates/website-service.yaml`). |
| 141 | + - `DB_URL` / `DB_USERNAME` / `DB_PASSWORD` from the `database` secret (same refs as backend). |
| 142 | +- **Ordering / readiness:** website + backend + dummy-preprocessing must be up before it runs. |
| 143 | + Two viable mechanisms (pick one): |
| 144 | + 1. **ArgoCD PostSync hook** (mirror `templates/ingest.yaml:127` `loculus-ingest-trigger`): |
| 145 | + `argocd.argoproj.io/hook: PostSync`, `backoffLimit`, `ttlSecondsAfterFinished: 600`. |
| 146 | + Cleanest fit with how this repo already bootstraps post-deploy work. |
| 147 | + 2. Plain Job + an init-container that curls `…/website` and `…/backend` health until ready. |
| 148 | + > **Recommendation:** PostSync hook (option 1) — consistent with `ingest-trigger`. |
| 149 | +- `backoffLimit: 1`, `ttlSecondsAfterFinished: 600`, `restartPolicy: Never`. |
| 150 | + |
| 151 | +### Values |
| 152 | + |
| 153 | +`kubernetes/loculus/values.yaml` (default OFF, production-safe): |
| 154 | +```yaml |
| 155 | +seedTestData: |
| 156 | + enabled: false |
| 157 | + user: { username: seed_user, password: seed_user } |
| 158 | + organism: dummy-organism |
| 159 | + sequenceCount: 3 |
| 160 | +``` |
| 161 | +`kubernetes/loculus/values_e2e_and_dev.yaml` (turn ON for dev/E2E): |
| 162 | +```yaml |
| 163 | +seedTestData: |
| 164 | + enabled: true |
| 165 | +``` |
| 166 | +Add the `seedTestData` object to `values.schema.json`, then: |
| 167 | +`npx prettier@3.6.2 --write kubernetes/loculus/values.schema.json` and |
| 168 | +`helm lint kubernetes/loculus -f kubernetes/loculus/values.yaml` (per `kubernetes/AGENTS.md`). |
| 169 | + |
| 170 | +## Idempotency & safety |
| 171 | + |
| 172 | +- Re-running on an already-seeded cluster is a no-op (seed user login check up front). |
| 173 | +- `enabled: false` by default → never runs in production. The CURATED-citation SQL and the |
| 174 | + `database` secret mount only exist on dev because the whole template is gated. |
| 175 | +- Uses the dummy organism only, so no real pathogen data or real DOIs/CrossRef calls. |
| 176 | + |
| 177 | +## Decisions |
| 178 | + |
| 179 | +1. **Citation mechanism — DECIDED: new superuser-only `POST /create-curated-citation` endpoint** |
| 180 | + (implemented in this branch). Seed job calls it with a super-user token; no DB secret needed. |
| 181 | +2. **Submission driver — DECIDED: Playwright UI**, reusing the integration-test page objects. |
| 182 | + |
| 183 | +## Open questions for reviewer |
| 184 | + |
| 185 | +1. **Trigger:** ArgoCD PostSync hook (recommended) vs. readiness-gated plain Job. |
| 186 | +2. **Image:** extend the existing `integration-tests` image with a `seed/` entrypoint |
| 187 | + (recommended) vs. a separate slimmer image. |
| 188 | +``` |
0 commit comments