How exposed is a project to security failures? The security component reads two
independent signals — the project's OpenSSF Scorecard (lower score → more
risk) and its count of distinct CVEs over 2021–2025 (more CVEs → more risk)
— and distils them into one security-risk score (score, 0–100) that feeds
data/risk/risk.csv as the column security. It also carries informational
signals (semgrep SAST findings, OSS-Fuzz enrollment, OpenSSF Best Practices
badge) that do not enter the score.
Scope: the 897 A/B value-class repos in the risk pipeline (see
value.md). Build step: src/risk/build_security.py.
Each leaf is one column with its data source and the period it represents.
[2025 EOY] = the snapshot pinned to each repo's latest 2025 commit (year
priority 2025→2021); [2021–2025] = a 5-year window; [most recent] = the
latest pull of that source. Raw signals are fetched per-source under
data/sources/; derived columns are computed by build_security.py.
Security → data/risk/security.csv (897 A/B risk repos) → risk.csv col `security`
│
├── OpenSSF Scorecard
│ ├── openssf_score ← scorecard `score` (0–10); local first, deps.dev fallback [2025 EOY]
│ ├── openssf_score_source ← derived ("openssf_local" | "depsdev" | "") [2025 EOY]
│ └── openssf_score_p ← derived (risk pctl; lower Scorecard → higher pctl) [2025 EOY]
│
├── CVEs (OSV.dev)
│ ├── cve_count_5y ← distinct CVE ids mapped to the repo in 2021–2025 [2021–2025]
│ └── cve_count_5y_p ← derived (risk pctl; more CVEs → higher pctl) [2021–2025]
│
├── SAST (semgrep p/default — INFORMATIONAL, not scored)
│ ├── sast_findings_total / _p ← semgrep p/default findings_total + pctl [2025 EOY]
│ ├── sast_findings_error / _p ← high-severity (ERROR) findings + pctl [2025 EOY]
│ └── sast_findings_security / _p ← security-category findings + pctl [2025 EOY]
│
├── ossfuzz_enrolled ← OSS-Fuzz projects index ("True"/"False") [most recent]
├── bestpractices_badge_id ← deps.dev (OpenSSF Best Practices badge tier) [most recent]
├── fetched_at ← checked_at of the OpenSSF score row used [2025 EOY]
│
└── score (the score) ← derived (geometric mean of openssf_score_p, cve_count_5y_p) [composite]
└─ carried into risk.csv as column `security`
- Collect — fetchers pull raw signals into
data/sources/: OpenSSF Scorecard (git/openssf.csv), its deps.dev mirror (git/depsdev.csv), semgrep findings (git/semgrep.csv), OSV CVEs (osv/cves.csv), OSS-Fuzz enrollment (ossfuzz/projects.csv), and the Best Practices badge (depsdev/repos.csv). Each is TTL/sha-gated so re-runs only fetch what's missing or stale. - Snapshot-join — the three sha-pinned long files (openssf, depsdev,
semgrep) are keyed by
(repo, sha). For each repo,build_security.pywalks the per-yearlast_shapriority fromcommits-years.csv(2025→2024→…→2021) and picks the first sha that has rows in that file. If no year matches, it falls back to any sha present for the repo (deterministic lexicographic pick). This is the same snapshot conventionbuild_complexityuses. - Derive — read
openssf_score(with the local→deps.dev fallback below), count distinct CVEs, read semgrep findings, setossfuzz_enrolledandbestpractices_badge_id. - Score —
add_percentiles(...)ranks the population, thenscore= geometric mean ofopenssf_score_pandcve_count_5y_p. - Aggregate —
aggregate_risk.pycarries only this component'sscoreintorisk.csvas the columnsecurity.
openssf_score is taken from the locally-run Scorecard
(data/sources/git/openssf.csv, openssf_score_source = "openssf_local").
When the snapshot picker finds no usable local row for a repo, the build falls
back to the deps.dev-mirrored Scorecard score
(data/sources/git/depsdev.csv, openssf_score_source = "depsdev"). If neither
yields a score, openssf_score and openssf_score_source are empty. (In the
current build, 893 repos use the local score and 4 fall back to deps.dev.)
Pipeline order (src/risk/run_risk_pipeline.py, fetchers run with
--with-fetchers):
commits-years → … → semgrep → … → cves → scorecard → depsdev → … → security-build → aggregate
Eight source files feed the build. The three Git-snapshot long files
(git/openssf.csv, git/depsdev.csv, git/semgrep.csv) carry
repo, repo_id, commit_sha, metric, value, checked_at — one row per check/finding
metric per sha — and are joined on the snapshot sha. The rest join on repo.
Source file (data/sources/) |
Fetcher | Collects | Key |
|---|---|---|---|
value/value.csv |
(value stage) | A/B value-class scope | repo |
github/git/commits-years.csv |
src.sources.git.commits_years |
per-(repo, year) last_sha — the snapshot pin |
repo, year |
git/openssf.csv |
src/sources/openssf/scorecard.py |
OpenSSF Scorecard score + 18 checks per (repo, sha) — see openssf.md |
repo, sha |
git/depsdev.csv |
src/sources/depsdev/fetch.py |
deps.dev-mirrored Scorecard score + checks (fallback when local row missing) |
repo, sha |
git/semgrep.csv |
src/sources/github/fetch_semgrep.py |
semgrep findings per (repo, sha, rulepack-prefixed metric); locked to p_default |
repo, sha |
osv/cves.csv |
src/sources/osv/fetch_cves.py |
per-CVE rows (repo, date, cve, package-source) |
repo |
osv/queried.csv |
src/sources/osv/fetch_cves.py |
sidecar — repos OSV was queried for (confirms true zeros) | repo |
ossfuzz/projects.csv |
src/sources/ossfuzz/fetch_ossfuzz_data.py |
OSS-Fuzz enrollment — see ossfuzz.md | github_repo |
depsdev/repos.csv |
src/sources/depsdev/fetch.py |
non-sha enrichment: bestpractices_badge_id |
repo |
Each row in osv/cves.csv is one (repo, cve, package-source) tuple — multiple
package mappings can produce duplicate (repo, cve) pairs, so the build
dedupes on the CVE id within a repo. The date filter keeps only CVEs whose
date[:4] falls in 2021–2025. Resolution is three-way: a count if the repo
appears in cves.csv; 0 if it's absent but present in queried.csv (a
confirmed zero); "" if it was never queried (unknown — keeps a failed/skipped
fetch from masquerading as zero).
The build reads only the p_default. rulepack prefix from semgrep.csv,
surfacing three counts at the snapshot sha — findings_total (all),
findings_error (high-severity ERROR only), findings_security
(security-category only) — each with an _p percentile. These sast_*_p
percentiles are informational and are NOT inputs to score.
add_percentiles(...) computes direction-aware population percentiles
(0–100) over all 897 repos:
| Column | Basis | Direction (asc) |
|---|---|---|
openssf_score_p |
openssf_score |
False — lower Scorecard score → higher risk pctl |
cve_count_5y_p |
cve_count_5y |
True — more CVEs → higher risk pctl |
sast_findings_total_p |
sast_findings_total |
True — info only |
sast_findings_error_p |
sast_findings_error |
True — info only |
sast_findings_security_p |
sast_findings_security |
True — info only |
score = geometric_mean(openssf_score_p, cve_count_5y_p)
composite_cols = ["openssf_score_p", "cve_count_5y_p"]. score is populated
only when both openssf_score and cve_count_5y are present; otherwise it
is "". The geometric mean means a repo that is bad on either axis (low
Scorecard or many CVEs) carries elevated risk. Because ~78% of risk-scope repos
have zero CVEs and thus share one identical cve_count_5y_p, for the majority
score effectively tracks the OpenSSF axis, with the CVE axis re-ranking only
the minority that carry CVEs.
17 columns, one row per risk repo. Per-signal timestamps stay in each source
file; fetched_at here is the checked_at of the OpenSSF score row that was used.
| Column | Description |
|---|---|
repo, repo_id |
identity |
openssf_score |
OpenSSF Scorecard score (0–10), local or deps.dev mirror |
openssf_score_source |
openssf_local | depsdev | "" |
cve_count_5y |
distinct CVE ids 2021–2025 (0 confirmed-zero; "" unknown) |
ossfuzz_enrolled |
"True"/"False" — enrolled in OSS-Fuzz |
sast_findings_total / _p |
semgrep p/default total findings + pctl (info) |
sast_findings_error / _p |
high-severity (ERROR) findings + pctl (info) |
sast_findings_security / _p |
security-category findings + pctl (info) |
bestpractices_badge_id |
passing | silver | gold | in_progress | "" |
openssf_score_p |
risk pctl of openssf_score (lower-is-worse) |
cve_count_5y_p |
risk pctl of cve_count_5y (more-is-worse) |
score |
security-risk score (geom-mean of the two _p; "" if either missing) |
fetched_at |
checked_at of the OpenSSF score row used |
aggregate_risk.py carries only this component's score into risk.csv,
under the column name security — every other column above stays in
security.csv. The full risk.csv schema is:
repo, repo_id, concentration, complexity, security, funding, workload, score
where security is this component's score, and the final score is the
geometric mean of the present component scores.
Of the 897 A/B risk repos:
| Signal | Repos | % |
|---|---|---|
openssf_score present |
897 | 100.0% |
| — via local Scorecard | 893 | 99.6% |
| — via deps.dev fallback | 4 | 0.4% |
cve_count_5y known |
893 | 99.6% |
score populated |
893 | 99.6% |
| semgrep SAST findings present | 884 | 98.6% |
| OSS-Fuzz enrolled | 130 | 14.5% |
| Best Practices badge (any tier) | 30 | 3.3% |
CVE distribution: 695 repos with zero CVEs, 198 with ≥1 (max 10,602). Best
Practices badge tiers: 18 passing, 10 in_progress, 1 gold, 1 silver.
score quartiles: p25 46 · p50 64 · p75 78 (max 95).
- Two-axis score. Only
openssf_score_pandcve_count_5y_penterscore. Semgrep SAST, OSS-Fuzz, and the Best Practices badge are collected and surfaced but not scored — they are context, not inputs. - CVE mapping is package-name-bound. CVE counts depend on OSV mapping a CVE
to the repo via its published package names. C/Debian-mapped repos with
package-name mismatches under-count (e.g.
cpython→0,linux→7); a0reflects "no mapped CVEs", not necessarily "no vulnerabilities". - CVE axis is coarse for the majority. ~78% of repos have zero CVEs and
share one
cve_count_5y_p, so for themscoremostly tracks the OpenSSF axis; the CVE axis only re-ranks the ~22% that carry CVEs. - Snapshot pinning, not live. The Scorecard/semgrep signals are pinned to
the repo's latest in-window commit (2025→2021), not re-run live, so they
reflect the snapshot sha rather than
HEAD. scoreis not a class. It's a 0–100 risk percentile, not an A–D tier; the A–Dsecurity_classtiering lives in the downstream risk-class layer (see risk.md), not in this component CSV.