Add RFC for graph-based popularity metric (refs #833)

mohityadav8 · mohityadav8 · commit 9f06a873f1bb · 2026-03-03T01:08:03.000+05:30
diff --git a/docs/popularity-ranking-rfc.md b/docs/popularity-ranking-rfc.md
@@ -0,0 +1,116 @@
+# RFC: Graph-Based Popularity Metric for Packages
+
+Related to Issue #833
+
+## 1. Motivation
+
+PurlDB aggregates metadata for software packages identified by PURLs (Package URLs).  
+However, not all packages have equal importance within an ecosystem.
+
+A structured popularity metric could help:
+
+- Prioritize indexing and mining operations
+- Identify critical and widely used packages
+- Improve API filtering and ranking
+- Focus computational resources on highly connected packages
+
+This RFC proposes a graph-based popularity metric derived from dependency relationships.
+
+
+## 2. Current Data Model Observations
+
+After setting up PurlDB locally and reviewing `packagedb/models.py`, I observed:
+
+- Dependencies are represented by the `DependentPackage` model.
+- Each `DependentPackage` links to a source `Package` via a ForeignKey.
+- The dependency target is stored as a PURL string (`purl` field), not as a ForeignKey to another `Package`.
+
+This effectively forms a directed dependency graph:
+
+    Package A  --->  Package B
+
+where B must be resolved from its PURL.
+
+
+## 3. Proposed Approach
+
+### 3.1 Graph Construction (Initial PoC)
+
+- Nodes: Canonical package identities (ignoring version initially)
+- Edges: Dependency relationships
+- Direction: A → B if A depends on B
+
+For the initial proof of concept:
+
+- Resolve dependency PURLs to canonical package identities.
+- Ignore version to avoid graph fragmentation.
+- Restrict computation to a single ecosystem (e.g., PyPI).
+
+
+## 4. Popularity Signals
+
+### 4.1 Dependency Centrality (Primary Signal)
+
+- In-degree (number of reverse dependencies)
+- PageRank-style centrality over the dependency graph
+
+This allows packages depended upon by important packages to receive higher scores.
+
+### 4.2 Possible Enhancements (Future Work)
+
+- Freshness factor (based on `release_date`)
+- Mining depth (`mining_level`)
+- Optional decay for inactive packages
+
+
+## 5. Computation Strategy
+
+Two possible strategies:
+
+### Option A: Batch Computation (Recommended)
+
+- Compute popularity periodically via scheduled task
+- Store result in database (e.g., `popularity_score` field)
+- Expose score via REST API
+
+### Option B: On-Demand Computation
+
+- Compute dynamically during API requests
+- Likely too expensive for large graphs
+
+Batch computation appears more scalable.
+
+
+## 6. Scaling Considerations
+
+- Large ecosystems may contain millions of nodes.
+- Version-level graphs may introduce excessive fragmentation.
+- Initial implementation should:
+  - Operate at package identity level
+  - Be ecosystem-scoped
+  - Store precomputed scores
+
+Future improvements may include:
+- Strongly connected component analysis
+- Weighted edges
+- Version-aware ranking
+
+
+## 7. Open Questions
+
+1. Should popularity be computed per ecosystem or globally?
+2. Should dependency resolution be materialized in a normalized table?
+3. Is ignoring version acceptable for the initial PoC?
+4. Should optional dependencies be weighted differently?
+
+
+## 8. Next Steps
+
+If this direction aligns with project goals:
+
+- Implement a minimal PoC for one ecosystem
+- Validate ranking quality
+- Iterate on scoring methodology
+- Integrate into PurlDB API
+
+Feedback before implementation would be greatly appreciated.