The PyPI slice of the Value pipeline: how PyPI download and
dependency data becomes a download-weighted PageRank and an A/B/C/D value class for
every Python package. This page covers the pipeline assembly; for raw-fetch
mechanics (the BigQuery export, the JSON API, fetch scripts) see the source
reference sources/pypi.md.
| Source | Data collected | Raw file (data/sources/pypi/) |
|---|---|---|
| BigQuery PyPI dataset | per-package annual downloads 2021–2025 — manual export (~47 TB / ~$235; mirror installers excluded) | bigquery/bq-package-downloads.csv |
| PyPI JSON API | runtime deps from info.requires_dist (PEP 508 specifiers; only runtime kept) |
raw/package-dependencies.csv |
| External dataset (manually sourced) | package → GitHub URL mapping |
raw/package-github-mapping.csv |
No authentication required except BigQuery for the download export.
PyPI data flows through the shared Value mechanics (full description in
value.md):
- Load downloads from the BigQuery export (~849K packages).
- Top packages — keep packages covering 95% of the ecosystem-wide download total.
- Dependency tree — follow transitive runtime deps from the top set.
- package → repo — parse GitHub URLs from the mapping file.
- PageRank — download-weighted personalized PageRank (α = 0.85) over the dep graph.
- Value class — sort by PageRank desc; cumulative-share cutoffs assign A (≤50%) / B (≤75%) / C (≤90%) / D (rest).
Orchestrated by src.value.pypi_pipeline (fetch-data → fetch-urls → process).
Metric lineage (← = data source, […] = period):
Python (PyPI)
├── downloads_2021..2025 ← BigQuery PyPI dataset [2021–2025]
├── avg_downloads ← derived [2021–2025]
├── avg_downloads_share ← derived [2021–2025]
├── top ← derived (95% cum-dl) [2021–2025]
├── dep edges (package→dep)← pypi.org/pypi/{p}/json [most recent]
├── pagerank ← derived [2021–2025]
├── value_class ← derived [2021–2025]
└── package→repo ← BigQuery github mapping [most recent]
- Value — each package's
value_classis grouped by repo intodata/value/value.csvas theclass_pypicolumn; the strongest class across ecosystems becomesclass. - Risk — A/B-class PyPI repos enter
src.risk.run_risk_pipeline(scope set byrisk_input.value_classesinsrc/settings.json). - Eligibility — A/B repos that also pass the OSI-license and non-EOL gates
reach
data/eligibility/eligibility.csv.
results.csv (data/sources/pypi/) — one row per dep-tree package, with
package, github_repo, avg_downloads, the 2021–2025 columns, top,
pagerank, and value_class.
Carried from the cross-ecosystem tables in value.md:
| Stage | Count |
|---|---|
| Top packages (95% downloads) | 2,460 |
| After dep tree | 3,139 |
| Results | 3,139 |
| With GitHub repo | 1,728 (55%) |
| Class | A | B | C | D | Total |
|---|---|---|---|---|---|
| Packages | 54 | 157 | 414 | 2,514 | 3,139 |
Repos (value.csv) |
53 | 151 | 389 | 2,347 | — |
A+B repos with a GitHub repo: 76%.
- 55% GitHub coverage — the lowest of the four ecosystems. The BigQuery extract
carried only GitHub URLs at fetch time, so non-GitHub upstreams (GitLab,
self-hosted) have no
package → repolink. Because Risk and Eligibility key offgithub_repo, this caps how many PyPI repos those stages can score, even for A/B-class packages (A+B GitHub coverage is 76%, not ~100% like npm/crates).