Recommend a list of users that you may want to connect with
- Clarifying questions
- What is the primary business objective of the system?
- What's the primary use case of the system?
- Are there specific factors needs to be considered for recommendations?
- Are friendships/connections symmetrical?
- What is the scale of the system? (users, connections)
- can we assume the social graph is not very dynamic?
- Do we need continual training?
- How do we collect negative samples? (not clicked, negative feedback).
- How fast the system needs to be?
- Is personalization needed? Yes
-
Use case(s) and business goal
- use case: PYMMK: recommend a list of users to connect with on social media app (e.g. facebook, linkedin)
- business objective: maximize number of formed connections
-
Requirements;
- Scalability: 1 B total users, on avg. 10000 connection per user
-
Constraints:
- Privacy and compliance with data protection regulations.
-
Data: Sources and Availability:
-
Assumptions:
- symmetric firendships
-
ML Formulation:
- Objective:
- maximize number of formed connections
- I/O: I: user_id, O: ranked list of recommended users sorted by the relevance to the user
- ML Category: two options:
- Ranking problem:
- pointwise LTR - binary classifier (user_i, user_j) -> p(connection)
- cons: doesn't capture social connections
- Graph representation (edge prediction)
- supplement with graph info (nodes, edges)
- input: social graph, predict edge b/w nodes
- Ranking problem:
- Objective:
-
Offline
- GNN model: binary classification -> ROC-AUC
- Recommendation system: binary relationships -> mAP
-
Online
- No of friend requests sent over X time
- No of friend requests accepted over X time
- High level architecture
- Node-level predictions
- Edge-level predictions
-
Data Sources
- Users,
- demographics, edu and work backgrounds, skills, etc
- note: standardized data (e.g. cs / computer science)
- User-user connections,
- User-user interactions,
- Users,
-
Labelling
-
Feature selection
-
User:
- ID, username
- Demographics (Age, gender, location)
- Account/Network info: No of connections, followers, following, requests, etc, account age
- Interaction history (No of likes, shares, comments)
- Context (device, time of day, etc)
-
User-user connections:
- Connection: IDs(user1, user2), connection type, timestamp, location
- edu and work affinity: major similarity, companies in common, industry similarity, etc
- social affinity: No. mutual connections (time discounted)
-
User-user interactions:
- IDs(u user1, user2), interaction type, timestamp
-
-
Model selection
- We choose GNN
- operate on graph data
- predict prob of edge
- input: graph (node and edge features)
- output: embedding of each node
- use similarities b/w node embeddings for edge prediction
- We choose GNN
-
Model Training
- snapshot of G at t. model predict connections at t+1
- Dataset
- create a snapshot at time t
- compute node and edge features
- create labels using snapshot at t + 1 (if connection formed, positive)
- Model eval and HP tuning
- Iterations
- Prediction pipeline
- Candidate generation
- Friends of Friends (FoF) - rule based - from 1B to 1K.1K = 1M candidates -> FoF service
- Scoring service (using GNN model -> embeddings -> similarity scores)
- sort by score
- Candidate generation
- pre-compute PYMK tables for each / active users and store in DB
- re-rank based on business logic
- A/B Test
- Deployment and release
- Scaling (SW and ML systems)
- Monitoring
- Updates
- add a lightweight ranker
- bias problem
- delayed feedback problem (user accepts after days)
- personalized random walk (for baseline)