feat: store multiple repo licenses as array IN-1099#4105
Conversation
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
|
|
There was a problem hiding this comment.
Pull request overview
This PR upgrades repository license storage from a single SPDX string to a multi-value array and threads that change through the git integration ingestion path, the TypeScript DAL, and Tinybird analytics so multiple detected licenses can be persisted and queried.
Changes:
- Replace
public.repositories.licensewithpublic.repositories.licenses(varchar[]) and update git integration to write detected license arrays. - Update DAL repository type + queries to select
licenses. - Extend Tinybird repository/project datasets and pipes to ingest
licensesand expose flattened(repoUrl, licenseId)pairs asrepoLicenses.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| services/libs/tinybird/pipes/insightsProjects_filtered.pipe | Exposes repoLicenses in the filtered insights projects output. |
| services/libs/tinybird/pipes/insights_projects_populated_copy.pipe | Pulls repo licenses and produces flattened repoLicenses tuples per project. |
| services/libs/tinybird/datasources/repositories.datasource | Adds licenses Array(String) to the repositories datasource schema. |
| services/libs/tinybird/datasources/insights_projects_populated_ds.datasource | Adds repoLicenses Array(Tuple(String, String)) to populated projects schema. |
| services/libs/data-access-layer/src/repositories/index.ts | Renames repository field to licenses and updates selects accordingly. |
| services/apps/git_integration/src/crowdgit/worker/repository_worker.py | Writes detected licenses array to Postgres during first-batch processing. |
| services/apps/git_integration/src/crowdgit/services/license/license_service.py | Returns a list of SPDX IDs (or [] / ['NOASSERTION']) instead of a single value. |
| services/apps/git_integration/src/crowdgit/database/crud.py | Renames and updates CRUD method to persist licenses array column. |
| backend/src/database/migrations/V1778600068__removeLicenseAddLicensesToRepositories.sql | Drops license column and adds licenses array column. |
| backend/src/database/migrations/U1778600068__removeLicenseAddLicensesToRepositories.sql | Rollback migration to drop licenses and re-add license. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ALTER TABLE public.repositories DROP COLUMN license; | ||
| ALTER TABLE public.repositories ADD COLUMN licenses VARCHAR(255)[]; |
| ALTER TABLE public.repositories DROP COLUMN licenses; | ||
| ALTER TABLE public.repositories ADD COLUMN license VARCHAR(255); |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c471e3b. Configure here.
| # licensee puts per-file confidence inside each matched_file's matcher object. | ||
| confidence_by_spdx: dict[str, float] = {} | ||
| for mf in matched_files: | ||
| spdx = (mf.get("matched_license") or {}).get("spdx_id") or "" |
There was a problem hiding this comment.
Confidence map may never populate due to field access
Medium Severity
The code accesses (mf.get("matched_license") or {}).get("spdx_id") assuming matched_license is a dict. If the licensee gem serializes matched_license as a plain string (the SPDX ID directly, e.g. "MIT"), the or {} fallback won't trigger (since a non-empty string is truthy), and calling .get("spdx_id") on a string raises AttributeError. This would be caught by the outer except Exception on line 83, causing the function to always return [] — silently disabling license detection for all repositories. Even if matched_license is correctly a dict, the confidence_by_spdx map remains empty if the key differs between matched_files and licenses entries, causing all licenses to bypass confidence filtering.
Reviewed by Cursor Bugbot for commit c471e3b. Configure here.


Summary
license VARCHAR(255)column withlicenses VARCHAR(255)[]array onpublic.repositorieslicense_service.pyto return all detected licenses (applying the 98% confidence threshold per-license; returns["NOASSERTION"]when files exist but none pass,[]when nothing found)update_repository_license→update_repository_licensesincrud.pyto write the full arrayIRepositoryTypeScript type and bothSELECTqueries to uselicenseslicenses Array(String)to Tinybirdrepositories.datasource(Sequin sync)repoLicenses Array(Tuple(String, String))— flat(repoUrl, licenseId)pairs — ininsights_projects_populated_copy.pipeviaarrayFlatten(groupArray(arrayMap(...)))repoLicensestoinsights_projects_populated_ds.datasourceand expose throughinsightsProjects_filtered.pipeChanges
V1778600068__removeLicenseAddLicensesToRepositories.sql/U*.sqllicense_service.py,crud.py,repository_worker.pyservices/libs/data-access-layer/src/repositories/index.tsrepositories.datasource,insights_projects_populated_ds.datasourceinsights_projects_populated_copy.pipe,insightsProjects_filtered.pipeTicket
https://linuxfoundation.atlassian.net/browse/IN-1099
Note
Medium Risk
Medium risk due to a Postgres schema change and downstream contract updates (git worker writes, TS DAL reads, Tinybird schemas/pipes) that can break deployments if not migrated and synced in lockstep.
Overview
Stores multiple licenses per repository. Replaces
public.repositories.licensewithlicenses(VARCHAR[]) via forward/backward migrations.Updates git integration license detection to return all SPDX IDs above the confidence threshold (otherwise
[]or["NOASSERTION"]), and writes the array via renamedupdate_repository_licenses.Propagates the new
licensesfield through the TypeScript DAL selects/types and Tinybird ingestion/analytics by addinglicensesto therepositoriesdatasource and emitting/exposingrepoLicenses(repo URL, license) tuples in populated project pipes.Reviewed by Cursor Bugbot for commit c471e3b. Bugbot is set up for automated code reviews on this repo. Configure here.