Skip to content

Commit d2e767e

Browse files
authored
feat: vulnerability scanning within git integration (#3892)
Signed-off-by: anil <epipav@gmail.com>
1 parent 76747ce commit d2e767e

25 files changed

Lines changed: 2198 additions & 12 deletions

backend/.env.dist.composed

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ CROWD_S3_HOST="s3"
1111
# Db settings
1212
CROWD_DB_READ_HOST="db"
1313
CROWD_DB_WRITE_HOST="db"
14+
INSIGHTS_DB_WRITE_HOST="db"
1415

1516
# Product DB settings
1617
PRODUCT_DB_HOST=product

backend/.env.dist.local

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,14 @@ CROWD_DB_USERNAME=postgres
3838
CROWD_DB_PASSWORD=example
3939
CROWD_DB_DATABASE=crowd-web
4040

41+
INSIGHTS_DB_WRITE_HOST=localhost
42+
INSIGHTS_DB_USERNAME=postgres
43+
INSIGHTS_DB_PASSWORD=example
44+
INSIGHTS_DB_DATABASE=insights
45+
INSIGHTS_DB_PORT=5432
46+
INSIGHTS_DB_POOL_MAX=10
47+
INSIGHTS_DB_SSLMODE=disable
48+
4149
# Product DB settings
4250
PRODUCT_DB_HOST=localhost
4351
PRODUCT_DB_PORT=5433

scripts/services/docker/Dockerfile.git_integration

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,23 @@ COPY ./services/apps/git_integration/src/crowdgit/services/software_value/ ./
2222
# Build the binary
2323
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags "-w -s" -o software-value ./
2424

25+
# Go builder stage 2: build the vulnerability-scanner binary
26+
FROM golang:1.25-alpine AS go-vuln-builder
27+
28+
WORKDIR /go/src/vulnerability-scanner
29+
30+
# Copy module files first for dependency caching
31+
COPY ./services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/go.mod ./
32+
COPY ./services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/go.sum ./
33+
34+
# Copy source code
35+
COPY ./services/apps/git_integration/src/crowdgit/services/vulnerability_scanner/ ./
36+
37+
# Download dependencies and build, using cache mounts to avoid re-downloading on every build
38+
RUN --mount=type=cache,target=/go/pkg/mod \
39+
--mount=type=cache,target=/root/.cache/go-build \
40+
go mod download && CGO_ENABLED=0 GOOS=linux go build -ldflags "-w -s" -o vulnerability-scanner ./
41+
2542
# Builder stage: install build dependencies, uv, and dependencies
2643
FROM base AS builder
2744

@@ -84,13 +101,17 @@ COPY --from=builder /usr/crowd/app /usr/crowd/app
84101
COPY --from=go-builder /go/src/software-value/software-value /usr/local/bin/software-value
85102
COPY --from=go-builder /go/bin/scc /usr/local/bin/scc
86103

104+
# Copy vulnerability-scanner binary from go-vuln-builder stage
105+
COPY --from=go-vuln-builder /go/src/vulnerability-scanner/vulnerability-scanner /usr/local/bin/vulnerability-scanner
106+
87107
# Add virtual environment bin to PATH
88108
ENV PATH="/usr/crowd/app/.venv/bin:$PATH"
89109

90110
# Make runner script and binaries executable
91111
RUN chmod +x ./src/runner.sh \
92112
&& chmod +x /usr/local/bin/software-value \
93-
&& chmod +x /usr/local/bin/scc
113+
&& chmod +x /usr/local/bin/scc \
114+
&& chmod +x /usr/local/bin/vulnerability-scanner
94115

95116
EXPOSE 8085
96117

services/apps/git_integration/src/crowdgit/enums.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,3 +70,4 @@ class OperationType(str, Enum):
7070
COMMIT = "Commit"
7171
MAINTAINER = "Maintainer"
7272
SOFTWARE_VALUE = "SoftwareValue"
73+
VULNERABILITY_SCAN = "VulnerabilityScanner"

services/apps/git_integration/src/crowdgit/errors.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ class PermissionError(CrowdGitError):
5656
class CommandExecutionError(CrowdGitError):
5757
error_message: str = "Command execution failed"
5858
error_code: ErrorCode = ErrorCode.SHELL_COMMAND_FAILED
59+
returncode: int | None = None
5960

6061

6162
@dataclass

services/apps/git_integration/src/crowdgit/server.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
MaintainerService,
1212
QueueService,
1313
SoftwareValueService,
14+
VulnerabilityScannerService,
1415
)
1516
from crowdgit.settings import WORKER_SHUTDOWN_TIMEOUT_SEC
1617
from crowdgit.worker.repository_worker import RepositoryWorker
@@ -25,13 +26,15 @@ async def lifespan(app: FastAPI) -> AsyncIterator[None]:
2526
queue_service = QueueService()
2627
commit_service = CommitService(queue_service=queue_service)
2728
software_value_service = SoftwareValueService()
29+
vulnerability_scanner_service = VulnerabilityScannerService()
2830
maintainer_service = MaintainerService()
2931

3032
worker_task = None
3133
worker = RepositoryWorker(
3234
clone_service=clone_service,
3335
commit_service=commit_service,
3436
software_value_service=software_value_service,
37+
vulnerability_scanner_service=vulnerability_scanner_service,
3538
maintainer_service=maintainer_service,
3639
queue_service=queue_service,
3740
)

services/apps/git_integration/src/crowdgit/services/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,16 @@
44
from crowdgit.services.maintainer.maintainer_service import MaintainerService
55
from crowdgit.services.queue.queue_service import QueueService
66
from crowdgit.services.software_value.software_value_service import SoftwareValueService
7+
from crowdgit.services.vulnerability_scanner.vulnerability_scanner_service import (
8+
VulnerabilityScannerService,
9+
)
710

811
__all__ = [
912
"BaseService",
1013
"CloneService",
1114
"CommitService",
1215
"SoftwareValueService",
16+
"VulnerabilityScannerService",
1317
"MaintainerService",
1418
"QueueService",
1519
]

services/apps/git_integration/src/crowdgit/services/utils.py

Lines changed: 37 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import asyncio
2+
import logging
23
import re
34
from urllib.parse import urlparse
45

@@ -167,6 +168,7 @@ async def run_shell_command(
167168
cwd: str = None,
168169
timeout: float | None = None,
169170
input_text: str | bytes | None = None,
171+
stderr_logger: logging.Logger | None = None,
170172
) -> str:
171173
"""
172174
Run shell command asynchronously and return output on success, raise exception on failure.
@@ -176,6 +178,7 @@ async def run_shell_command(
176178
cwd: Working directory
177179
timeout: Command timeout in seconds
178180
input_text: Text (str) or bytes to send to stdin (will automatically append newline if not present)
181+
stderr_logger: If provided, a logger whose .info() method is called with each stderr line in real-time
179182
180183
Returns:
181184
str: Command stdout output
@@ -219,17 +222,37 @@ async def run_shell_command(
219222
input_text += "\n"
220223
stdin_input = input_text.encode("utf-8")
221224

222-
# Wait for completion with optional timeout
223-
if timeout:
224-
stdout, stderr = await asyncio.wait_for(
225-
process.communicate(input=stdin_input), timeout=timeout
226-
)
225+
if stderr_logger:
226+
stderr_lines: list[str] = []
227+
228+
async def _run_with_stderr_logging() -> bytes:
229+
async def _stream() -> None:
230+
async for raw_line in process.stderr:
231+
line = _safe_decode(raw_line).rstrip()
232+
if line:
233+
stderr_logger.info(line)
234+
stderr_lines.append(line)
235+
236+
stdout, _ = await asyncio.gather(process.stdout.read(), _stream())
237+
await process.wait()
238+
return stdout
239+
240+
coro = _run_with_stderr_logging()
241+
stdout = await (asyncio.wait_for(coro, timeout=timeout) if timeout else coro)
242+
stdout_text = _safe_decode(stdout).strip() if stdout else ""
243+
stderr_text = "\n".join(stderr_lines)
227244
else:
228-
stdout, stderr = await process.communicate(input=stdin_input)
245+
# Wait for completion with optional timeout
246+
if timeout:
247+
stdout, stderr = await asyncio.wait_for(
248+
process.communicate(input=stdin_input), timeout=timeout
249+
)
250+
else:
251+
stdout, stderr = await process.communicate(input=stdin_input)
229252

230-
# Handle potentially non-UTF-8 encoded output from git commands
231-
stdout_text = _safe_decode(stdout).strip() if stdout else ""
232-
stderr_text = _safe_decode(stderr).strip() if stderr else ""
253+
# Handle potentially non-UTF-8 encoded output from git commands
254+
stdout_text = _safe_decode(stdout).strip() if stdout else ""
255+
stderr_text = _safe_decode(stderr).strip() if stderr else ""
233256

234257
# Check return code
235258
if process.returncode == 0:
@@ -248,8 +271,11 @@ async def run_shell_command(
248271
logger.error(f"Permission error: {stderr_text}")
249272
raise PermissionError(f"Permission denied while running: {command_str}")
250273
else:
251-
logger.error(f"Command error: {stderr_text}")
252-
raise CommandExecutionError(f"Command failed: {command_str} - {stderr_text}")
274+
logger.error(f"Command failed (exit {process.returncode}): {stderr_text}")
275+
raise CommandExecutionError(
276+
f"Command failed (exit {process.returncode}): {command_str} - {stderr_text}",
277+
returncode=process.returncode,
278+
)
253279

254280
except asyncio.TimeoutError:
255281
logger.error(f"Command timed out after {timeout}s: {command_str}")
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Binaries
2+
*.exe
3+
*.exe~
4+
*.dll
5+
*.so
6+
*.dylib
7+
vulnerability-scanner
8+
9+
# Test binary
10+
*.test
11+
12+
# Coverage
13+
*.out
14+
15+
# Go workspace
16+
go.work
17+
18+
# Go dependencies (use vendor/ instead)
19+
vendor/
20+
21+
# Environment
22+
.env
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# Vulnerability Scanner
2+
3+
Scans a cloned git repository for known vulnerabilities using the [Google OSV Scanner SDK](https://pkg.go.dev/github.com/google/osv-scanner/v2/pkg/osvscanner) and persists the results to the insights database.
4+
5+
## How it fits into the pipeline
6+
7+
The repository worker processes repos through a fixed sequence of services per clone batch:
8+
9+
```
10+
Clone → [first batch only] → SoftwareValue → VulnerabilityScanner → Maintainer → Commits → ...
11+
```
12+
13+
It runs on the first batch only, when the full working directory is available. The Python worker calls the scanner via subprocess, passing the local repo path and the canonical repo URL. The Go binary does all the scanning and writes results directly to the database, then exits — the Python side only tracks whether it succeeded or failed.
14+
15+
## Architecture: why a Go binary wrapped in Python
16+
17+
The rest of the git integration is Python, but the OSV Scanner is a Go library with no Python bindings. Rather than shelling out to the `osv-scanner` CLI (which is fragile and adds an extra process layer), we embed the scanner as a Go SDK dependency and call it programmatically. This gives us full control over scan parameters, result access, and error handling.
18+
19+
The binary follows the same **subprocess + JSON stdout** pattern used by the `software-value` service. The Python wrapper calls it, reads the JSON response from stdout, and treats it as a black box. The binary exits with code 0 under normal operation — errors are communicated through the JSON payload. The one exception is scan timeout: if the scan exceeds the time limit, the binary calls `os.Exit(124)` without emitting JSON. The Python wrapper handles this by catching exit code 124 and retrying with `--no-transitive` to skip transitive dependency resolution (see OOM handling section below).
20+
21+
## Key design decisions
22+
23+
### OSV Scanner uses sentinel errors for expected outcomes
24+
25+
OSV Scanner returns sentinel errors for two non-failure cases:
26+
27+
- `ErrVulnerabilitiesFound` — the scanner found vulnerabilities. Like `grep` returning exit code 1 on a match, this is an expected outcome. We treat it as success and surface the results normally.
28+
- `ErrNoPackagesFound` — the repo contains no scannable package manifests (e.g. a pure C or shell project). This is not a failure — the scanner ran correctly and determined there is nothing to check. The scan record gets status `no_packages_found` and the Python side does not treat it as an error.
29+
30+
Any other non-nil error from the scanner is a real failure.
31+
32+
### Vulnerability identity: (repo_url, vulnerability_id, package_name, source_path)
33+
34+
Each vulnerability is uniquely identified by these four fields. The same CVE can appear in multiple packages and multiple lockfiles, so all four are needed to distinguish records. This is the unique key for upserts.
35+
36+
### vulnerability_id and the ID columns
37+
38+
Every vulnerability record has a `vulnerability_id` — the primary identifier OSV assigns to the advisory (e.g. `GHSA-xxxx-xxxx-xxxx`). GHSA is OSV's preferred canonical ID: even when a CVE exists, OSV typically uses the GHSA ID as the primary and lists the CVE as an alias. This is because GitHub's Advisory Database is OSV's main upstream source and GHSA IDs are stable and consistently present.
39+
40+
Beyond the primary ID, each advisory carries a list of aliases (cross-references to the same vulnerability in other databases). We take both the primary ID and all aliases and classify them into three `TEXT[]` columns by prefix:
41+
42+
- `cve_ids` — IDs starting with `CVE-`
43+
- `ghsa_ids` — IDs starting with `GHSA-` (usually includes the primary ID)
44+
- `other_ids` — everything else (e.g. `PYSEC-`, `GO-`, `RUSTSEC-`, `MAL-`)
45+
46+
Each column holds an array, so multiple IDs of the same type are stored as proper discrete values.
47+
48+
### Severity derived from CVSS score, not OSV severity strings
49+
50+
OSV records include a `MaxSeverity` field as a CVSS numeric score. We map it to our own four-tier scale (CRITICAL / HIGH / MEDIUM / LOW) using standard CVSS thresholds rather than trusting the advisory's own severity label, which is inconsistently populated across ecosystems.
51+
52+
### Status: OPEN vs FIX_AVAILABLE vs RESOLVED
53+
54+
A vulnerability is `FIX_AVAILABLE` if the OSV record contains a fixed version in any of the affected ranges — meaning a patch exists but the repo is still on the vulnerable version. It's `OPEN` if no fix is known. `RESOLVED` is set automatically by the database logic (see below) for findings that were present in a previous scan but are no longer detected.
55+
56+
The OSV scanner's flatten operation can occasionally produce multiple entries for the same (vulnerability_id, package_name, source_path) with different package versions — for example when a package appears more than once in the same lockfile. When that happens, we keep the smallest version, since that represents the worst-case exposure for that specific lockfile.
57+
58+
### Database strategy: upsert + mark-resolved, not delete + insert
59+
60+
On each scan, rather than deleting all previous findings and inserting fresh ones, we:
61+
62+
1. Mark all currently active findings for the repo as `RESOLVED`.
63+
2. Upsert the current scan results — this re-activates any finding that is still present and inserts new ones. Findings that genuinely disappeared remain `RESOLVED` with a timestamp.
64+
65+
This preserves history. You can tell when a vulnerability was first detected, when it was last seen, and when it went away — without needing a separate audit log. It also makes it cheap to compute `new_count` (rows that were truly inserted, not updated) and `resolved_count` (rows still carrying the newly-set `resolved_at` after the upsert pass).
66+
67+
### Transitive dependency scanning
68+
69+
By default the scanner resolves the full transitive dependency graph, not just direct dependencies declared in lockfiles. This catches vulnerabilities in indirect deps that the project never explicitly references.
70+
71+
Transitive scanning is expensive for repos with large or deep package ecosystems — projects like ML frameworks with hundreds of transitive Python dependencies can exhaust the 3-minute scan timeout. The scanner handles this automatically:
72+
73+
1. **First scan**: attempt with transitive=true. If it times out, retry immediately with transitive=false (direct deps only). The result of whichever mode succeeded is stored.
74+
2. **Subsequent scans**: reuse the same mode as the previous completed scan (`transitive_deps_scanned` stored per scan record). This avoids flip-flopping between modes and keeps results comparable across scans.
75+
76+
77+
### OOM handling and stale scan cleanup
78+
79+
If the Go binary is killed by the OOM killer (SIGKILL, exit code -9), the scan record it created stays stuck as `running` since the process never got a chance to finalize it. The Python wrapper detects this:
80+
81+
1. On **any** `CommandExecutionError`, it connects to the insights DB and marks all `running` scans for that repo as `failure` with the error message. This cleans up stale records regardless of the failure reason.
82+
2. On **OOM specifically** (returncode -9), it retries the scan with the `--no-transitive` flag, which forces the Go binary to skip transitive dependency resolution — the most memory-intensive part of scanning. This gives large repos a second chance to complete within memory limits.
83+
84+
The `--no-transitive` flag is parsed by the Go binary and overrides the normal transitive mode selection logic.
85+
86+
### Scan tracking
87+
88+
Every invocation creates a row in `vulnerability_scans` before the scan starts (status: `running`) and updates it on completion with duration, counts, and any error. Terminal statuses are `success`, `no_packages_found`, and `failure`. This makes it possible to detect stalled or crashed scans and gives a simple history of scan health per repo.
89+
90+
## Building
91+
92+
The binary is built during Docker image construction (see `Dockerfile.git_integration`). To build locally:
93+
94+
```bash
95+
cd services/apps/git_integration/src/crowdgit/services/vulnerability_scanner
96+
go build -o vulnerability-scanner .
97+
```
98+
99+
100+
Output is a JSON object on stdout:
101+
102+
```json
103+
{ "status": "success", "error_code": null, "error_message": null }
104+
```
105+
106+
On repos with no scannable package manifests:
107+
```json
108+
{ "status": "no_packages_found", "error_code": null, "error_message": null }
109+
```

0 commit comments

Comments
 (0)