You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/adr/0001-oss-packages-design-decisions.md
+10-7Lines changed: 10 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -226,10 +226,10 @@ Five sub-workers run concurrently (npm, Maven, OSV, GitHub, Docker Hub), all wri
226
226
|`packages`| Upsert on `purl`. Each worker only writes columns it owns; ecosystem isolation means column-level conflicts cannot occur in practice. |
227
227
|`packages_universe`| Incremental upsert keyed on `purl`. The deps.dev import only touches rows whose underlying deps.dev snapshot date has advanced since the previous import (initial run is a one-time full backfill). |
228
228
|`versions`| Append-only via `INSERT … ON CONFLICT DO NOTHING`. Yanked/deprecated status is a separate targeted `UPDATE (is_yanked = true) WHERE …`. |
229
-
|`repos`| Registry workers (npm, Maven) do **not** write directly to `repos`. They write `package_repos` rows. The GitHub enricher — triggered when `repos.last_synced_at IS NULL` — upserts `repos` with metadata. Docker Hub worker adds `docker_*` columns on top. |
229
+
|`repos`| Registry workers (npm, Maven) do **not** write `repos` enrichment metadata. They INSERT a minimal `repos(url, host)` row — `url` (canonical) and `host` (coarse classification) are both derived from the declared repository URL — solely to create the FK target their `package_repos`link needs. `owner`/`name`/`stars`/`description` and all other metadata stay NULL and remain enricher-owned; existing rows are never updated by registry workers. The GitHub enricher — triggered when `repos.last_synced_at IS NULL` — upserts `repos` with metadata. Docker Hub worker adds `docker_*` columns on top. |
230
230
|`package_repos`| Composite PK `(package_id, repo_url)`. Each `source` value ('declared', 'deps_dev', 'heuristic', 'manual') is a separate row — sources do not overwrite each other. |
231
231
|`advisories`| Upsert on `osv_id`. OSV is the single source of truth; no other worker writes to this table. |
232
-
|`maintainers` / `package_maintainers`|Upsert on `(ecosystem, username)`. Never delete — history is preserved. |
232
+
|`maintainers` / `package_maintainers`|`maintainers`: upsert on `(ecosystem, username)`, never deleted — the identity history is preserved. `package_maintainers`: reflects the **current** link set — the npm worker replaces a package's links each ingest (delete + reinsert), so prior link rows are not retained. |
233
233
|`downloads_daily`| Append-only time-series. Each `(package_id, date)` row is written once. npm and Maven workers own disjoint rows by ecosystem. Historical timelines are preserved — workers do not overwrite past dates. |
234
234
|`downloads_last_30d`| Upsert on `(purl, end_date)`. Written by the weekly ranking worker only. The cached `packages_universe.downloads_last_30d` column must be updated in the same pass. |
235
235
@@ -375,10 +375,13 @@ Local verification against the live OSV dataset (2026-05-28) showed the multi-ra
375
375
376
376
**Strategy**: daily delta poll via the CouchDB changes feed, not a full sync.
377
377
378
-
1. Call `replicate.npmjs.com/_changes?since=<last_seq>&limit=<batch>` to get package names changed in the last 24h.
379
-
2. For each changed name, fetch the full document from `registry.npmjs.com/<package>`.
380
-
3. Normalize into `packages`, `versions`, `maintainers`, and `package_maintainers` using the write rules above.
381
-
4. Downloads (timeline): call `api.npmjs.org/downloads/range/<start>:<end>/<package>` in batches of **128 packages per call** against the critical list. Write per-day rows to `downloads_daily`. Update `packages_universe.downloads_last_30d` at the end of each pass.
378
+
1. Call `replicate.npmjs.com/_changes?since=<last_seq>&limit=<batch>` to get package names changed since the last run.
379
+
2. Filter changed names against the `packages` table (~700k packages). Only packages already tracked in Tier 2 are re-ingested — unknown packages in the changes feed are ignored.
380
+
3. For each matching changed name, fetch the full document from `registry.npmjs.com/<package>`.
381
+
4. Normalize into `packages`, `versions`, `maintainers`, and `package_maintainers` using the write rules above.
382
+
4. Downloads: two Temporal workflows — `backfillDailyDownloads` (per-day rows into `downloads_daily`) and `refreshLast30dDownloads` (rolling 30-day windows into `downloads_last_30d`). Both are self-healing: they detect and fill missing windows on each run rather than assuming continuity. Both currently source packages from a static watch list. Once the deps.dev BQ import is operational, `backfillDailyDownloads` will source from `packages` (Tier 2 critical slice) and `refreshLast30dDownloads` will source from `packages_universe` (full Tier 3 population).
383
+
384
+
**Rolling-30-day window shape**: each `downloads_last_30d` window uses `end_date = 1st of calendar month, start_date = end_date − 30 days` — not a true calendar month. This ensures every window covers exactly 30 days, making download counts directly comparable across months for criticality scoring. Calendar months (28–31 days) would skew comparisons.
382
385
383
386
The `_changes``seq` cursor is persisted between runs (in a `worker_state` row or env var) so each poll starts where the last one ended. If the cursor is lost the worker falls back to a full re-sync of the critical list.
384
387
@@ -557,7 +560,7 @@ SET count = EXCLUDED.count,
557
560
start_date =EXCLUDED.start_date;
558
561
```
559
562
560
-
The weekly ranking worker must write both the `downloads_last_30d` row and the cached `packages_universe.downloads_last_30d` column in the same pass — if only one write happens, the ranking function silently uses a stale value. Code review must enforce this.
563
+
The writer of a `downloads_last_30d` row must also update the cached `packages_universe.downloads_last_30d` column in the same pass when writing the latest window — if only one write happens, the ranking function silently uses a stale value. Code review must enforce this. For npm, this responsibility belongs to the npm worker's `refreshLast30dDownloads` workflow.
561
564
562
565
A package promoted from Tier 3 to Tier 2 (becomes critical) will have rolling-window history but no daily history; the daily timeline starts from the promotion date forward.
0 commit comments