Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions index.toml
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,11 @@ title = "RAG Evaluation with Prometheus 2"
notebook = "prometheus2_evaluation.ipynb"
topics = ["Evaluation"]

[[cookbook]]
title = "Ground-Truth Retrieval Evaluation with Recall and MRR"
notebook = "ground_truth_retrieval_evaluation.ipynb"
topics = ["Evaluation", "RAG", "Advanced Retrieval"]

[[cookbook]]
title = "Advanced Prompt Customization for Anthropic"
notebook = "prompt_customization_for_Anthropic.ipynb"
Expand Down
281 changes: 281 additions & 0 deletions notebooks/ground_truth_retrieval_evaluation.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,281 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Ground-Truth Retrieval Evaluation with Recall and MRR\n",
"\n",
"This cookbook shows how to evaluate a retriever before adding answer generation to a RAG pipeline. We will build a small in-memory corpus, define labeled query-document pairs, retrieve documents for each query, and evaluate the retrieved documents with Haystack's `DocumentRecallEvaluator` and `DocumentMRREvaluator`.\n",
"\n",
"The example uses local in-memory components only. It does not require an LLM provider, a vector database, or any external API keys."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install haystack-ai"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"\n",
"Retrieval evaluation isolates the retriever from the generator. This helps diagnose whether weak RAG answers come from missing context or from generation problems.\n",
"\n",
"This kind of evaluation is useful when you have labeled query-document pairs: for each test query, you know which document or documents should be retrieved. That makes it different from LLM-as-judge or faithfulness evaluation, which usually inspect generated answers. Those end-to-end checks are still valuable, but they answer a different question: whether the final answer is good and grounded. Here, we ask whether the retriever returned the expected evidence in the first place.\n",
"\n",
"We will use two retrieval metrics:\n",
"\n",
"- **Recall** answers: What fraction of relevant documents did the retriever return?\n",
"- **MRR** answers: How early did the first relevant document appear?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a small document store\n",
"\n",
"First, create a tiny corpus about retrieval and RAG evaluation concepts. We assign stable document IDs so the ground-truth labels are easy to read."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from haystack import Document, Pipeline\n",
"from haystack.document_stores.in_memory import InMemoryDocumentStore\n",
"from haystack.components.retrievers.in_memory import InMemoryBM25Retriever\n",
"\n",
"\n",
"documents = [\n",
" Document(\n",
" id=\"doc-vector-databases\",\n",
" content=(\n",
" \"Vector databases store embedding vectors and support similarity search. \"\n",
" \"They are often used in RAG systems to find semantically related documents.\"\n",
" ),\n",
" ),\n",
" Document(\n",
" id=\"doc-chunk-size\",\n",
" content=(\n",
" \"Chunk size affects retrieval quality. Smaller chunks can be precise, \"\n",
" \"while larger chunks may preserve more context for generation.\"\n",
" ),\n",
" ),\n",
" Document(\n",
" id=\"doc-mrr\",\n",
" content=(\n",
" \"Mean Reciprocal Rank, or MRR, evaluates ranking quality by looking at \"\n",
" \"the position of the first relevant result returned by a retriever.\"\n",
" ),\n",
" ),\n",
" Document(\n",
" id=\"doc-faithfulness\",\n",
" content=(\n",
" \"Faithfulness evaluation checks whether generated answers are grounded \"\n",
" \"in the retrieved context rather than unsupported model knowledge.\"\n",
" ),\n",
" ),\n",
" Document(\n",
" id=\"doc-recall\",\n",
" content=(\n",
" \"Recall measures what fraction of relevant documents appears in the retrieved \"\n",
" \"set. High recall means the retriever is less likely to miss needed evidence.\"\n",
" ),\n",
" ),\n",
"]\n",
"\n",
"document_store = InMemoryDocumentStore()\n",
"document_store.write_documents(documents)\n",
"doc_by_id = {document.id: document for document in documents}\n",
"\n",
"document_store.count_documents()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define evaluation queries and ground-truth documents\n",
"\n",
"Each example contains a query and the ID of the document that should be considered relevant. In a real project, these labels often come from human annotation, search logs, or an existing benchmark dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"eval_examples = [\n",
" {\n",
" \"query\": \"How do vector databases help retrieval augmented generation?\",\n",
" \"ground_truth_ids\": [\"doc-vector-databases\"],\n",
" },\n",
" {\n",
" \"query\": \"What metric rewards putting the first relevant document near the top?\",\n",
" \"ground_truth_ids\": [\"doc-mrr\"],\n",
" },\n",
" {\n",
" \"query\": \"Why does chunk size matter for retrieval quality?\",\n",
" \"ground_truth_ids\": [\"doc-chunk-size\"],\n",
" },\n",
" {\n",
" \"query\": \"How can I tell whether the retriever missed needed evidence?\",\n",
" \"ground_truth_ids\": [\"doc-recall\"],\n",
" },\n",
"]\n",
"\n",
"queries = [example[\"query\"] for example in eval_examples]\n",
"ground_truth_documents = [\n",
" [doc_by_id[doc_id] for doc_id in example[\"ground_truth_ids\"]] for example in eval_examples\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build and run a retrieval pipeline\n",
"\n",
"Now build a minimal retrieval pipeline with `InMemoryBM25Retriever`. For each query, collect the retrieved `Document` objects. The evaluators compare these retrieved documents with the ground-truth documents."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"retrieval_pipeline = Pipeline()\n",
"retrieval_pipeline.add_component(\n",
" \"retriever\", InMemoryBM25Retriever(document_store=document_store, top_k=3)\n",
")\n",
"\n",
"retrieved_documents = []\n",
"\n",
"for query in queries:\n",
" result = retrieval_pipeline.run({\"retriever\": {\"query\": query}})\n",
" retrieved_documents.append(result[\"retriever\"][\"documents\"])\n",
"\n",
"for example, documents_for_query in zip(eval_examples, retrieved_documents):\n",
" print(f\"Query: {example['query']}\")\n",
" print(f\"Ground truth: {example['ground_truth_ids']}\")\n",
" print(\"Retrieved:\")\n",
" for rank, document in enumerate(documents_for_query, start=1):\n",
" snippet = document.content[:90].rstrip()\n",
" print(f\" {rank}. {document.id} | {snippet}...\")\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate with Recall and MRR\n",
"\n",
"`DocumentRecallEvaluator` checks what fraction of the expected documents appears in the retrieved set. `DocumentMRREvaluator` also considers rank: retrieving a relevant document at rank 1 scores better than retrieving it at rank 3."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from haystack.components.evaluators import DocumentMRREvaluator, DocumentRecallEvaluator\n",
"\n",
"\n",
"recall_evaluator = DocumentRecallEvaluator()\n",
"mrr_evaluator = DocumentMRREvaluator()\n",
"\n",
"recall_result = recall_evaluator.run(\n",
" ground_truth_documents=ground_truth_documents,\n",
" retrieved_documents=retrieved_documents,\n",
")\n",
"mrr_result = mrr_evaluator.run(\n",
" ground_truth_documents=ground_truth_documents,\n",
" retrieved_documents=retrieved_documents,\n",
")\n",
"\n",
"print(f\"Average Recall: {recall_result['score']:.2f}\")\n",
"print(f\"Average MRR: {mrr_result['score']:.2f}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for example, documents_for_query, recall, mrr in zip(\n",
" eval_examples,\n",
" retrieved_documents,\n",
" recall_result[\"individual_scores\"],\n",
" mrr_result[\"individual_scores\"],\n",
"):\n",
" retrieved_ids = [document.id for document in documents_for_query]\n",
" print(f\"Query: {example['query']}\")\n",
" print(f\" Expected: {example['ground_truth_ids']}\")\n",
" print(f\" Retrieved: {retrieved_ids}\")\n",
" print(f\" Recall: {recall:.2f} | MRR: {mrr:.2f}\")\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Interpret the results\n",
"\n",
"In this toy corpus, a high Recall score means the retriever usually includes the labeled document in the top results. A lower Recall score would mean the retriever missed relevant evidence, so changing the retriever, query formulation, filters, chunking strategy, or `top_k` would be worth investigating before tuning generation.\n",
"\n",
"MRR adds ranking sensitivity. If Recall is high but MRR is lower, the relevant document is present but often buried below less useful documents. That can still hurt RAG quality because generators may overweight the first retrieved contexts or run out of context window on larger pipelines."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"To adapt this pattern to a larger RAG system:\n",
"\n",
"- Replace the toy corpus with your indexed documents or document chunks.\n",
"- Build a labeled evaluation set of queries and relevant document IDs.\n",
"- Run the same evaluator components for each retriever or retrieval configuration you want to compare.\n",
"- Combine retrieval metrics with answer-level checks, such as faithfulness or LLM-as-judge evaluation, once the retriever is returning the expected evidence."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11"
}
},
"nbformat": 4,
"nbformat_minor": 4
}