Search System

1. Problem Formulation

Clarifying questions
- Is it a generalized search engine (like google) or specialized (like amazon product)?
- What is the primary (business) objective of the search system?
- What are the specific use cases and scenarios where it will be applied?
- What are the system requirements (such as response time, accuracy, scalability, and integration with existing systems or platforms)?
- What is the expected scale of the system in terms of data and user interactions?
- Is their any data available? What format?
- Personalized? not required
- How many languages needs to be supported?
- What types of items (products) are available on the platform, and what attributes are associated with them?
- What are the common user search behaviors and patterns? Do users frequently use filters, sort options, or advanced search features?
- Are there specific search-related challenges unique to the use case (e-commerce)? such as handling product availability, pricing, and customer reviews?
Use case(s) and business goal
- Use case: user enters text query into search box, system shows the most relevant items (products)
- business goal: increase CTR, conversion rate, etc
Requirements
- response time, accuracy, scalability (50M DAU)
Constraints
- budget limitations, hardware limitations, or legal and privacy constraints
Data: sources and availability
- Sources:
Assumptions
ML formulation:
- ML Objective: retrieve items that are most relevant to a text query
  - we can define relevance as weighted summary of click, successful session, conversion, etc.
- ML I/O: I: text query from a user, O: ranked list of most relevant items on an e-commerce platform
- ML category: MM input search system -> retrieval and ranking
  - ranking: MM input -> multi-label classification (click, success, convert, etc)
  - we can use a multi-task classifier

2. Metrics

Offline
- Precision@k, Recall@k, MRR, mAP, NDCG
- we choose NDCG (non-binary relevance)
Online
- CTR: problem: doesn't track relevancy, click baits
- success session rate: dwell time > T or add to cart
- total dwell time
- conversion rate

3. Architectural Components

Multimodal search (text, photo, video) for product content from text query:
Multi-layer architecture
- Query Understanding -> Candidate generation -> stage 1 Ranker -> stage 2 Ranker -> Blender -> Filter
Query understanding
- spell checker
- query normalization
- query expansion (e.g. add alternative) / relaxation (e.g. remove "good")
- Intent/Domain classification
Candidate generation
- focus on recall, millions/billions into 10Ks
Ranking
- ML based
- multi-stage ranker: if more than 10k items to select from or QPS > 10k
- 100k items: stage 1 (liner model) -> stage 2 (DNN model) -> 500 items
Blender:
- outputs a SERP (search engine result page)
- blends results from multiple sources e.g. textual (inverted index, semantic) search, visual search, etc.

Retrieval

from 100 B to 100k
IR: compares query text with document text
Document types:
- item (product) title
- item description
- item reviews
- item category
inverted index:
- index DS, mapping from words into their locations in a set of documents (e.g. ABC -> documents 1, 7)
after query expansion (e.g. black pants into black and pants or suit-pants or trousers etc), do a search in inverted index db and find relevant items with relevance score
relevance score
- weighted linear combination of:
  - terms match (e.g. TF-IDF score)(e.g. w = 0.5),
  - item popularity (e.g. no of reviews, or bought) (e.g. w=0.125),
  - intent match score (e.g. 0.125/2),
  - domain match score,
  - personalization score (e.g. age, gender, location, interests)

Ranking:

see the next sections.

4. Data Collection and Preparation

Data sources:
- Users
- Queries
- Items (products)
- Context
Labeling:
- use online user engagement data to generate positive and negative labels

5. Feature Engineering

Feature selection
- User:
  - ID, username,
  - Demographics (age, gender, location)
  - User interaction history (click rate, purchase rate, etc)
  - User interests (e.g. categories)
- Context:
  - device,
  - time of the day,
  - recent hype results
  - previous queries
- Query features:
  - query historical engagement (by other users)
  - query intent / domain
  - query embeddings
- Item (product) features
  - Title (exact text + embeddings)
  - Description (exact text + embeddings)
  - Reviews data (avg reviews, no of reviews, review textual data (text + embeddings))
  - category
  - page rank
  - engagement radius
- User-Item(product) features
  - distance (e.g. for shipment)
  - historical engagement by the user (e.g. document type)
- Query-Item(product) features
  - text match (title, description, category)
  - unigram or bigram search (title, description, category) - TF-IDF score
  - historical engagement (e.g. click rate of Item for that query)

6. Model Development and Offline Evaluation

Ranking

Model Selection
- Two options:
  - Pointwise LTR model: <user, item> -> relevance score
    - approximate it as a binary classification problem p(relevant)
  - Pairwise LTR model: <user, item1, item2> -> item1 score > item2 score ?
    - loss function if the predicted order is correct
    - more natural to ranking, more complicated
- Multi - Stage ranking
  - 100k items (focus on recall) -> 500 items (focus on precision) -> 500 items in correct order
  - Stage 1: We use a pointwise LTR -> binary classifier
    - latency: microseconds
    - suggestion: LR or small MART (multiple additive regression trees)
    - use ROC AUC for metric
  - Stage 2: Pairwise LTR model
    - Two options (choose based on train data availability and capacity):
      - LambdaMART: a variation of MART, obj fcn changed to improve pairwise ranking
      - LambdaRank: NN based model, pairwise loss (minimize inversions in ranking)
    - use NDCG for metric
Training Dataset
- Pointwise approach
  - positive samples: user engaged (e.g. click, spent time > T, add to cart, purchased)
  - negative samples: no engagement by the user + random negative samples e.g. from pages 10 and beyond
  - 5 million Q/day -> one positive one negative sample from each query -> 10 million samples a day
  - use a whole week's data at least to capture daily patterns
    - capturing and dealing with seasonal and holiday data
  - train-valid/test split: 70/30 (of 70 million)
  - temporal affect: e.g. use 3 weeks data: first 2/3 of weeks: train, last week valid / test
- Pairwise approach:
  - ranks items according to their relative order, which is closer to the nature of ranking
  - predict doc scores in a way that miimizes No of inversions in the final ranked result
  - Two options for train data generation for pointwise approach
    - human raters: each human rates 10 results per 100K queries * 10 humans = 10M examples
      - expensive, doesn't scale
    - online engagement data
      - assign scores to each engagement type e.g.
        
        impression with no click -> label/score 0
        
        click only -> score 1
        
        spent time after click > T : score 2
        
        add to cart : score 3
        
        purchase: score 4
7. Prediction Service
- Re-ranking
  - business level logic and policies -->
    - filtering inappropriate items
    - diversity (exploration/exploitation)
    - etc
  - Two ways:
    - rule based filters and aggregators
    - ML model
      - Binary Classification (P(inappropriate))
      - Data sources: human raters, user feedback (report, review)
      - Features: same as product features in ranker
      - Models: LR, MART, or DNN (depending on data size, capacity, experiments)
      - More details on harmful content classification
8. Online Testing and Deployment

9. Scaling, Monitoring, and Updates

10. Other talking points
- Positional bias

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search System

1. Problem Formulation

2. Metrics

3. Architectural Components

Retrieval

Ranking:

4. Data Collection and Preparation

5. Feature Engineering

6. Model Development and Offline Evaluation

Ranking

7. Prediction Service

8. Online Testing and Deployment

9. Scaling, Monitoring, and Updates

10. Other talking points

FilesExpand file tree

mlsd-search.md

Latest commit

History

mlsd-search.md

File metadata and controls

Search System

1. Problem Formulation

2. Metrics

3. Architectural Components

Retrieval

Ranking:

4. Data Collection and Preparation

5. Feature Engineering

6. Model Development and Offline Evaluation

Ranking

7. Prediction Service

8. Online Testing and Deployment

9. Scaling, Monitoring, and Updates

10. Other talking points