Skip to content

Add cookbook: Cross-lingual Hybrid Retrieval with BM25 + Multilingual Embeddings#273

Open
geyuxu wants to merge 2 commits intodeepset-ai:mainfrom
geyuxu:add-cross-lingual-hybrid-retrieval
Open

Add cookbook: Cross-lingual Hybrid Retrieval with BM25 + Multilingual Embeddings#273
geyuxu wants to merge 2 commits intodeepset-ai:mainfrom
geyuxu:add-cross-lingual-hybrid-retrieval

Conversation

@geyuxu
Copy link
Copy Markdown

@geyuxu geyuxu commented Feb 23, 2026

Description

This cookbook demonstrates how to build a cross-lingual hybrid retrieval pipeline
using Haystack, combining BM25 keyword search with multilingual dense embeddings
to search across Chinese and English documents.

What it covers:

  • Why BM25 alone fails for cross-lingual queries
  • Using paraphrase-multilingual-MiniLM-L12-v2 for cross-lingual dense retrieval
  • Combining BM25 + dense retrieval with DocumentJoiner (Reciprocal Rank Fusion)
  • Optional cross-encoder re-ranking with TransformersRanker
  • A comparison table showing retrieval coverage across methods
  • A RAG prompt template for multilingual document synthesis

Components used:

  • InMemoryDocumentStore
  • InMemoryBM25Retriever
  • InMemoryEmbeddingRetriever
  • SentenceTransformersDocumentEmbedder / SentenceTransformersTextEmbedder
  • DocumentJoiner (reciprocal_rank_fusion)
  • TransformersRanker
  • PromptBuilder

Why this is useful:

Haystack's cookbook currently has excellent examples for hybrid retrieval
(BM42, SPLADE) but none specifically addressing cross-lingual scenarios.
Many real-world applications involve multilingual document collections
(e.g., international enterprises, academic research, e-commerce across markets).
This cookbook fills that gap.

Tested with:

  • Haystack 2.x (latest)
  • Python 3.10+
  • No external services required (uses InMemoryDocumentStore)

@geyuxu geyuxu requested a review from a team as a code owner February 23, 2026 08:35
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@kacperlukawski kacperlukawski self-requested a review February 23, 2026 14:57
Comment thread notebooks/cross_lingual_hybrid_retrieval.ipynb
Comment thread notebooks/cross_lingual_hybrid_retrieval.ipynb
Comment thread notebooks/cross_lingual_hybrid_retrieval.ipynb
Comment thread notebooks/cross_lingual_hybrid_retrieval.ipynb
Comment thread notebooks/cross_lingual_hybrid_retrieval.ipynb
Copy link
Copy Markdown
Member

@kacperlukawski kacperlukawski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @geyuxu ! Thanks for contributing with a notebook! I got some comments, and happy to help polish the cookbook!

@@ -0,0 +1,7020 @@
{
Copy link
Copy Markdown
Member

@kacperlukawski kacperlukawski Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried playing with the pipeline a bit, but it consistently prefers english documents only, so it isn't accurate. Even the current output proves that:

Retrieved documents:
[1] (lang=en) The European Union has set ambitious carbon neutrality targe...
[2] (lang=en) Urban green spaces play a crucial role in reducing heat isla...
[3] (lang=en) Solar panel efficiency has improved significantly in recent ...

Reply via ReviewNB

@@ -0,0 +1,7020 @@
{
Copy link
Copy Markdown
Member

@kacperlukawski kacperlukawski Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove the BM42 cookbook?

⚠️ Recent evaluations have raised questions about the validity of BM42. Future developments may address these concerns. Please keep this in mind while reviewing the content.

Reply via ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants