|
| 1 | +# RFC: Graph-Based Popularity Metric for Packages |
| 2 | + |
| 3 | +Related to Issue #833 |
| 4 | + |
| 5 | +## 1. Motivation |
| 6 | + |
| 7 | +PurlDB aggregates metadata for software packages identified by PURLs (Package URLs). |
| 8 | +However, not all packages have equal importance within an ecosystem. |
| 9 | + |
| 10 | +A structured popularity metric could help: |
| 11 | + |
| 12 | +- Prioritize indexing and mining operations |
| 13 | +- Identify critical and widely used packages |
| 14 | +- Improve API filtering and ranking |
| 15 | +- Focus computational resources on highly connected packages |
| 16 | + |
| 17 | +This RFC proposes a graph-based popularity metric derived from dependency relationships. |
| 18 | + |
| 19 | + |
| 20 | +## 2. Current Data Model Observations |
| 21 | + |
| 22 | +After setting up PurlDB locally and reviewing `packagedb/models.py`, I observed: |
| 23 | + |
| 24 | +- Dependencies are represented by the `DependentPackage` model. |
| 25 | +- Each `DependentPackage` links to a source `Package` via a ForeignKey. |
| 26 | +- The dependency target is stored as a PURL string (`purl` field), not as a ForeignKey to another `Package`. |
| 27 | + |
| 28 | +This effectively forms a directed dependency graph: |
| 29 | + |
| 30 | + Package A ---> Package B |
| 31 | + |
| 32 | +where B must be resolved from its PURL. |
| 33 | + |
| 34 | + |
| 35 | +## 3. Proposed Approach |
| 36 | + |
| 37 | +### 3.1 Graph Construction (Initial PoC) |
| 38 | + |
| 39 | +- Nodes: Canonical package identities (ignoring version initially) |
| 40 | +- Edges: Dependency relationships |
| 41 | +- Direction: A → B if A depends on B |
| 42 | + |
| 43 | +For the initial proof of concept: |
| 44 | + |
| 45 | +- Resolve dependency PURLs to canonical package identities. |
| 46 | +- Ignore version to avoid graph fragmentation. |
| 47 | +- Restrict computation to a single ecosystem (e.g., PyPI). |
| 48 | + |
| 49 | + |
| 50 | +## 4. Popularity Signals |
| 51 | + |
| 52 | +### 4.1 Dependency Centrality (Primary Signal) |
| 53 | + |
| 54 | +- In-degree (number of reverse dependencies) |
| 55 | +- PageRank-style centrality over the dependency graph |
| 56 | + |
| 57 | +This allows packages depended upon by important packages to receive higher scores. |
| 58 | + |
| 59 | +### 4.2 Possible Enhancements (Future Work) |
| 60 | + |
| 61 | +- Freshness factor (based on `release_date`) |
| 62 | +- Mining depth (`mining_level`) |
| 63 | +- Optional decay for inactive packages |
| 64 | + |
| 65 | + |
| 66 | +## 5. Computation Strategy |
| 67 | + |
| 68 | +Two possible strategies: |
| 69 | + |
| 70 | +### Option A: Batch Computation (Recommended) |
| 71 | + |
| 72 | +- Compute popularity periodically via scheduled task |
| 73 | +- Store result in database (e.g., `popularity_score` field) |
| 74 | +- Expose score via REST API |
| 75 | + |
| 76 | +### Option B: On-Demand Computation |
| 77 | + |
| 78 | +- Compute dynamically during API requests |
| 79 | +- Likely too expensive for large graphs |
| 80 | + |
| 81 | +Batch computation appears more scalable. |
| 82 | + |
| 83 | + |
| 84 | +## 6. Scaling Considerations |
| 85 | + |
| 86 | +- Large ecosystems may contain millions of nodes. |
| 87 | +- Version-level graphs may introduce excessive fragmentation. |
| 88 | +- Initial implementation should: |
| 89 | + - Operate at package identity level |
| 90 | + - Be ecosystem-scoped |
| 91 | + - Store precomputed scores |
| 92 | + |
| 93 | +Future improvements may include: |
| 94 | +- Strongly connected component analysis |
| 95 | +- Weighted edges |
| 96 | +- Version-aware ranking |
| 97 | + |
| 98 | + |
| 99 | +## 7. Open Questions |
| 100 | + |
| 101 | +1. Should popularity be computed per ecosystem or globally? |
| 102 | +2. Should dependency resolution be materialized in a normalized table? |
| 103 | +3. Is ignoring version acceptable for the initial PoC? |
| 104 | +4. Should optional dependencies be weighted differently? |
| 105 | + |
| 106 | + |
| 107 | +## 8. Next Steps |
| 108 | + |
| 109 | +If this direction aligns with project goals: |
| 110 | + |
| 111 | +- Implement a minimal PoC for one ecosystem |
| 112 | +- Validate ranking quality |
| 113 | +- Iterate on scoring methodology |
| 114 | +- Integrate into PurlDB API |
| 115 | + |
| 116 | +Feedback before implementation would be greatly appreciated. |
0 commit comments