Skip to content

Commit 9f06a87

Browse files
committed
Add RFC for graph-based popularity metric (refs #833)
1 parent 891ce72 commit 9f06a87

1 file changed

Lines changed: 116 additions & 0 deletions

File tree

docs/popularity-ranking-rfc.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# RFC: Graph-Based Popularity Metric for Packages
2+
3+
Related to Issue #833
4+
5+
## 1. Motivation
6+
7+
PurlDB aggregates metadata for software packages identified by PURLs (Package URLs).
8+
However, not all packages have equal importance within an ecosystem.
9+
10+
A structured popularity metric could help:
11+
12+
- Prioritize indexing and mining operations
13+
- Identify critical and widely used packages
14+
- Improve API filtering and ranking
15+
- Focus computational resources on highly connected packages
16+
17+
This RFC proposes a graph-based popularity metric derived from dependency relationships.
18+
19+
20+
## 2. Current Data Model Observations
21+
22+
After setting up PurlDB locally and reviewing `packagedb/models.py`, I observed:
23+
24+
- Dependencies are represented by the `DependentPackage` model.
25+
- Each `DependentPackage` links to a source `Package` via a ForeignKey.
26+
- The dependency target is stored as a PURL string (`purl` field), not as a ForeignKey to another `Package`.
27+
28+
This effectively forms a directed dependency graph:
29+
30+
Package A ---> Package B
31+
32+
where B must be resolved from its PURL.
33+
34+
35+
## 3. Proposed Approach
36+
37+
### 3.1 Graph Construction (Initial PoC)
38+
39+
- Nodes: Canonical package identities (ignoring version initially)
40+
- Edges: Dependency relationships
41+
- Direction: A → B if A depends on B
42+
43+
For the initial proof of concept:
44+
45+
- Resolve dependency PURLs to canonical package identities.
46+
- Ignore version to avoid graph fragmentation.
47+
- Restrict computation to a single ecosystem (e.g., PyPI).
48+
49+
50+
## 4. Popularity Signals
51+
52+
### 4.1 Dependency Centrality (Primary Signal)
53+
54+
- In-degree (number of reverse dependencies)
55+
- PageRank-style centrality over the dependency graph
56+
57+
This allows packages depended upon by important packages to receive higher scores.
58+
59+
### 4.2 Possible Enhancements (Future Work)
60+
61+
- Freshness factor (based on `release_date`)
62+
- Mining depth (`mining_level`)
63+
- Optional decay for inactive packages
64+
65+
66+
## 5. Computation Strategy
67+
68+
Two possible strategies:
69+
70+
### Option A: Batch Computation (Recommended)
71+
72+
- Compute popularity periodically via scheduled task
73+
- Store result in database (e.g., `popularity_score` field)
74+
- Expose score via REST API
75+
76+
### Option B: On-Demand Computation
77+
78+
- Compute dynamically during API requests
79+
- Likely too expensive for large graphs
80+
81+
Batch computation appears more scalable.
82+
83+
84+
## 6. Scaling Considerations
85+
86+
- Large ecosystems may contain millions of nodes.
87+
- Version-level graphs may introduce excessive fragmentation.
88+
- Initial implementation should:
89+
- Operate at package identity level
90+
- Be ecosystem-scoped
91+
- Store precomputed scores
92+
93+
Future improvements may include:
94+
- Strongly connected component analysis
95+
- Weighted edges
96+
- Version-aware ranking
97+
98+
99+
## 7. Open Questions
100+
101+
1. Should popularity be computed per ecosystem or globally?
102+
2. Should dependency resolution be materialized in a normalized table?
103+
3. Is ignoring version acceptable for the initial PoC?
104+
4. Should optional dependencies be weighted differently?
105+
106+
107+
## 8. Next Steps
108+
109+
If this direction aligns with project goals:
110+
111+
- Implement a minimal PoC for one ecosystem
112+
- Validate ranking quality
113+
- Iterate on scoring methodology
114+
- Integrate into PurlDB API
115+
116+
Feedback before implementation would be greatly appreciated.

0 commit comments

Comments
 (0)