Skip to content

feat: fetch and store repo license via licensee IN-1105#4095

Merged
gaspergrom merged 8 commits into
mainfrom
feat/IN-1105-fetch-store-repo-license-via-licensee
May 11, 2026
Merged

feat: fetch and store repo license via licensee IN-1105#4095
gaspergrom merged 8 commits into
mainfrom
feat/IN-1105-fetch-store-repo-license-via-licensee

Conversation

@gaspergrom
Copy link
Copy Markdown
Contributor

@gaspergrom gaspergrom commented May 8, 2026

Summary

  • Adds license column (VARCHAR(255)) to public.repositories via a new migration
  • Installs the licensee Ruby gem (v9.15.3, the last version compatible with Ruby 2.7 on Debian Bullseye) in the git integration Docker image, along with libgit2 build and runtime deps required by the rugged gem
  • Implements LicenseService that runs licensee detect --json <repo_path> and extracts the SPDX identifier from the JSON output
  • Wires the service into the repository worker's first-batch hook, alongside the existing software-value and vulnerability-scanner calls
  • Persists the detected SPDX ID (e.g. MIT, Apache-2.0, BSD-3-Clause) to public.repositories.license via a new update_repository_license CRUD helper

Changes

  • backend/src/database/migrations/V1778154987__addLicenseToRepositories.sql — add license column
  • backend/src/database/migrations/U1778154987__addLicenseToRepositories.sql — undo migration
  • scripts/services/docker/Dockerfile.git_integration — install licensee v9.15.3 + libgit2 deps
  • services/apps/git_integration/src/crowdgit/services/license/license_service.py — new service
  • services/apps/git_integration/src/crowdgit/services/license/__init__.py — module init
  • services/apps/git_integration/src/crowdgit/services/__init__.py — export LicenseService
  • services/apps/git_integration/src/crowdgit/worker/repository_worker.py — wire service
  • services/apps/git_integration/src/crowdgit/database/crud.py — add update_repository_license

Note

Medium Risk
Adds a new license column and a new external dependency (licensee Ruby gem) executed during repository processing, which may impact worker runtime and database migrations if detection is slow or the tool behaves unexpectedly.

Overview
Adds a new public.repositories.license (VARCHAR(255)) column with forward/undo migrations and updates the data-access layer to select this field.

Extends the git-integration worker to run a new LicenseService on the first clone batch, using the licensee gem to detect an SPDX ID and persist it via a new update_repository_license DB helper.

Updates the git-integration Docker image to install Ruby + licensee (and required libgit2 build/runtime deps) so license detection can run in production.

Reviewed by Cursor Bugbot for commit ef58578. Bugbot is set up for automated code reviews on this repo. Configure here.

@gaspergrom gaspergrom self-assigned this May 8, 2026
@gaspergrom gaspergrom requested review from Copilot and themarolt May 8, 2026 09:11
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

themarolt
themarolt previously approved these changes May 8, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds repository license detection to the git integration pipeline and persists the detected SPDX identifier into the main public.repositories table, enabling downstream consumers to query repository license metadata.

Changes:

  • Adds a license column to public.repositories (with rollback migration).
  • Extends the git integration Docker image to install the licensee gem and its libgit2 build/runtime dependencies.
  • Introduces LicenseService (invokes licensee detect --json) and wires it into the repository worker’s first-batch processing, persisting results via a new CRUD helper.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
backend/src/database/migrations/V1778154987__addLicenseToRepositories.sql Adds license column to public.repositories.
backend/src/database/migrations/U1778154987__addLicenseToRepositories.sql Drops license column on rollback.
scripts/services/docker/Dockerfile.git_integration Installs Ruby + licensee and required libgit2/toolchain deps in the git integration image.
services/apps/git_integration/src/crowdgit/services/license/license_service.py New async service to execute licensee and parse SPDX from JSON output.
services/apps/git_integration/src/crowdgit/services/license/init.py Exports LicenseService from the license service module.
services/apps/git_integration/src/crowdgit/services/init.py Re-exports LicenseService at the services package level.
services/apps/git_integration/src/crowdgit/worker/repository_worker.py Runs license detection on first clone batch and writes the result to DB.
services/apps/git_integration/src/crowdgit/database/crud.py Adds update_repository_license helper to persist SPDX ID.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/apps/git_integration/src/crowdgit/services/license/license_service.py Outdated
Comment thread services/apps/git_integration/src/crowdgit/database/crud.py Outdated
Comment thread services/libs/data-access-layer/src/repositories/index.ts
gaspergrom added 3 commits May 8, 2026 10:37
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
…N-1105

Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
@gaspergrom gaspergrom force-pushed the feat/IN-1105-fetch-store-repo-license-via-licensee branch from b02ba60 to 58d4968 Compare May 8, 2026 09:37
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
Copilot AI review requested due to automatic review settings May 8, 2026 09:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment thread services/apps/git_integration/src/crowdgit/database/crud.py Outdated
Comment thread services/libs/data-access-layer/src/repositories/index.ts
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
Comment thread services/apps/git_integration/src/crowdgit/database/crud.py
@gaspergrom gaspergrom requested a review from themarolt May 8, 2026 10:18
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
Copilot AI review requested due to automatic review settings May 8, 2026 10:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
themarolt
themarolt previously approved these changes May 8, 2026
Copilot AI review requested due to automatic review settings May 11, 2026 08:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment on lines +11 to +19
async def detect(self, repo_path: str) -> str | None:
"""Run licensee against repo_path and return the SPDX identifier, or None."""
try:
output = await run_shell_command(
["licensee", "detect", "--json", repo_path], timeout=60
)
except CommandExecutionError:
self.logger.info(f"licensee found no license in {repo_path}")
return None
)
await self.maintainer_service.process_maintainers(repository, batch_info)
license_spdx = await self.license_service.detect(batch_info.repo_path)
await update_repository_license(repository.id, license_spdx)
Comment on lines +11 to +16
async def detect(self, repo_path: str) -> str | None:
"""Run licensee against repo_path and return the SPDX identifier, or None."""
try:
output = await run_shell_command(
["licensee", "detect", "--json", repo_path], timeout=60
)
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ef58578. Configure here.

return None
except Exception as e:
self.logger.warning(f"licensee failed: {repr(e)}")
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transient detection errors silently clear existing license data

Medium Severity

LicenseService.detect() returns None for both "no license exists" and all error conditions (timeout, binary not found, parse failure, etc.). The caller in repository_worker.py unconditionally passes this result to update_repository_license, which will overwrite a previously valid license (e.g. "MIT") with NULL when the tool fails transiently. The IS DISTINCT FROM guard won't help because 'MIT' IS DISTINCT FROM NULL evaluates to TRUE. A persistent issue like a missing licensee binary would gradually erase all stored license data.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ef58578. Configure here.

@gaspergrom gaspergrom merged commit 2537232 into main May 11, 2026
19 checks passed
@gaspergrom gaspergrom deleted the feat/IN-1105-fetch-store-repo-license-via-licensee branch May 11, 2026 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants