From b73a2a738c54180e476641d1179cf55046ddf2f0 Mon Sep 17 00:00:00 2001 From: Aditya Tiwari Date: Fri, 26 Jun 2026 11:08:16 -0700 Subject: [PATCH 1/4] docs: add retrieval evaluation cookbook skeleton --- .../ground_truth_retrieval_evaluation.ipynb | 94 +++++++++++++++++++ 1 file changed, 94 insertions(+) create mode 100644 notebooks/ground_truth_retrieval_evaluation.ipynb diff --git a/notebooks/ground_truth_retrieval_evaluation.ipynb b/notebooks/ground_truth_retrieval_evaluation.ipynb new file mode 100644 index 0000000..0edca2f --- /dev/null +++ b/notebooks/ground_truth_retrieval_evaluation.ipynb @@ -0,0 +1,94 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Ground-Truth Retrieval Evaluation with Recall and MRR\n", + "\n", + "This cookbook shows how to evaluate a retriever before adding answer generation to a RAG pipeline. We will build a small in-memory corpus, define labeled query-document pairs, retrieve documents for each query, and evaluate the retrieved documents with Haystack's `DocumentRecallEvaluator` and `DocumentMRREvaluator`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install haystack-ai" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "Retrieval evaluation isolates the retriever from the generator. This helps diagnose whether weak RAG answers come from missing context or from generation problems." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a small document store" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define evaluation queries and ground-truth documents" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Build and run a retrieval pipeline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Evaluate with Recall and MRR" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Interpret the results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Next steps" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 708478701c872b1275dbaad90203c33b741229fc Mon Sep 17 00:00:00 2001 From: Aditya Tiwari Date: Fri, 26 Jun 2026 11:12:23 -0700 Subject: [PATCH 2/4] docs: add sample retrieval pipeline for evaluation cookbook --- .../ground_truth_retrieval_evaluation.ipynb | 114 ++++++++++++++++++ 1 file changed, 114 insertions(+) diff --git a/notebooks/ground_truth_retrieval_evaluation.ipynb b/notebooks/ground_truth_retrieval_evaluation.ipynb index 0edca2f..14cd1bb 100644 --- a/notebooks/ground_truth_retrieval_evaluation.ipynb +++ b/notebooks/ground_truth_retrieval_evaluation.ipynb @@ -34,6 +34,62 @@ "## Create a small document store" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from haystack import Document, Pipeline\n", + "from haystack.document_stores.in_memory import InMemoryDocumentStore\n", + "from haystack.components.retrievers.in_memory import InMemoryBM25Retriever\n", + "\n", + "\n", + "documents = [\n", + " Document(\n", + " id=\"doc-vector-databases\",\n", + " content=(\n", + " \"Vector databases store embedding vectors and support similarity search. \"\n", + " \"They are often used in RAG systems to find semantically related documents.\"\n", + " ),\n", + " ),\n", + " Document(\n", + " id=\"doc-chunk-size\",\n", + " content=(\n", + " \"Chunk size affects retrieval quality. Smaller chunks can be precise, \"\n", + " \"while larger chunks may preserve more context for generation.\"\n", + " ),\n", + " ),\n", + " Document(\n", + " id=\"doc-mrr\",\n", + " content=(\n", + " \"Mean Reciprocal Rank, or MRR, evaluates ranking quality by looking at \"\n", + " \"the position of the first relevant result returned by a retriever.\"\n", + " ),\n", + " ),\n", + " Document(\n", + " id=\"doc-faithfulness\",\n", + " content=(\n", + " \"Faithfulness evaluation checks whether generated answers are grounded \"\n", + " \"in the retrieved context rather than unsupported model knowledge.\"\n", + " ),\n", + " ),\n", + " Document(\n", + " id=\"doc-recall\",\n", + " content=(\n", + " \"Recall measures whether all relevant documents appear in the retrieved \"\n", + " \"set. High recall means the retriever is less likely to miss needed evidence.\"\n", + " ),\n", + " ),\n", + "]\n", + "\n", + "document_store = InMemoryDocumentStore()\n", + "document_store.write_documents(documents)\n", + "doc_by_id = {document.id: document for document in documents}\n", + "\n", + "document_store.count_documents()" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -41,6 +97,37 @@ "## Define evaluation queries and ground-truth documents" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "eval_examples = [\n", + " {\n", + " \"query\": \"How do vector databases help retrieval augmented generation?\",\n", + " \"ground_truth_ids\": [\"doc-vector-databases\"],\n", + " },\n", + " {\n", + " \"query\": \"What metric rewards putting the first relevant document near the top?\",\n", + " \"ground_truth_ids\": [\"doc-mrr\"],\n", + " },\n", + " {\n", + " \"query\": \"Why does chunk size matter for retrieval quality?\",\n", + " \"ground_truth_ids\": [\"doc-chunk-size\"],\n", + " },\n", + " {\n", + " \"query\": \"How can I tell whether the retriever missed needed evidence?\",\n", + " \"ground_truth_ids\": [\"doc-recall\"],\n", + " },\n", + "]\n", + "\n", + "queries = [example[\"query\"] for example in eval_examples]\n", + "ground_truth_documents = [\n", + " [doc_by_id[doc_id] for doc_id in example[\"ground_truth_ids\"]] for example in eval_examples\n", + "]" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -48,6 +135,33 @@ "## Build and run a retrieval pipeline" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "retrieval_pipeline = Pipeline()\n", + "retrieval_pipeline.add_component(\n", + " \"retriever\", InMemoryBM25Retriever(document_store=document_store, top_k=3)\n", + ")\n", + "\n", + "retrieved_documents = []\n", + "\n", + "for query in queries:\n", + " result = retrieval_pipeline.run({\"retriever\": {\"query\": query}})\n", + " retrieved_documents.append(result[\"retriever\"][\"documents\"])\n", + "\n", + "for example, documents_for_query in zip(eval_examples, retrieved_documents):\n", + " print(f\"Query: {example['query']}\")\n", + " print(f\"Ground truth: {example['ground_truth_ids']}\")\n", + " print(\"Retrieved:\")\n", + " for rank, document in enumerate(documents_for_query, start=1):\n", + " snippet = document.content[:90].rstrip()\n", + " print(f\" {rank}. {document.id} | {snippet}...\")\n", + " print()" + ] + }, { "cell_type": "markdown", "metadata": {}, From 33b847f271d2078a719716bac4e7d8371ab41bb5 Mon Sep 17 00:00:00 2001 From: Aditya Tiwari Date: Fri, 26 Jun 2026 11:14:23 -0700 Subject: [PATCH 3/4] docs: evaluate retrieval with recall and mrr --- .../ground_truth_retrieval_evaluation.ipynb | 45 +++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/notebooks/ground_truth_retrieval_evaluation.ipynb b/notebooks/ground_truth_retrieval_evaluation.ipynb index 14cd1bb..00d46da 100644 --- a/notebooks/ground_truth_retrieval_evaluation.ipynb +++ b/notebooks/ground_truth_retrieval_evaluation.ipynb @@ -169,6 +169,51 @@ "## Evaluate with Recall and MRR" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from haystack.components.evaluators import DocumentMRREvaluator, DocumentRecallEvaluator\n", + "\n", + "\n", + "recall_evaluator = DocumentRecallEvaluator()\n", + "mrr_evaluator = DocumentMRREvaluator()\n", + "\n", + "recall_result = recall_evaluator.run(\n", + " ground_truth_documents=ground_truth_documents,\n", + " retrieved_documents=retrieved_documents,\n", + ")\n", + "mrr_result = mrr_evaluator.run(\n", + " ground_truth_documents=ground_truth_documents,\n", + " retrieved_documents=retrieved_documents,\n", + ")\n", + "\n", + "print(f\"Average Recall: {recall_result['score']:.2f}\")\n", + "print(f\"Average MRR: {mrr_result['score']:.2f}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for example, documents_for_query, recall, mrr in zip(\n", + " eval_examples,\n", + " retrieved_documents,\n", + " recall_result[\"individual_scores\"],\n", + " mrr_result[\"individual_scores\"],\n", + "):\n", + " retrieved_ids = [document.id for document in documents_for_query]\n", + " print(f\"Query: {example['query']}\")\n", + " print(f\" Expected: {example['ground_truth_ids']}\")\n", + " print(f\" Retrieved: {retrieved_ids}\")\n", + " print(f\" Recall: {recall:.2f} | MRR: {mrr:.2f}\")\n", + " print()" + ] + }, { "cell_type": "markdown", "metadata": {}, From 53959d848d9ec760294f6db3e6ed2c8a79373053 Mon Sep 17 00:00:00 2001 From: Aditya Tiwari Date: Fri, 26 Jun 2026 11:16:21 -0700 Subject: [PATCH 4/4] docs: polish retrieval evaluation cookbook --- index.toml | 5 ++ .../ground_truth_retrieval_evaluation.ipynb | 46 +++++++++++++++---- 2 files changed, 42 insertions(+), 9 deletions(-) diff --git a/index.toml b/index.toml index 88948ba..90897e7 100644 --- a/index.toml +++ b/index.toml @@ -133,6 +133,11 @@ title = "RAG Evaluation with Prometheus 2" notebook = "prometheus2_evaluation.ipynb" topics = ["Evaluation"] +[[cookbook]] +title = "Ground-Truth Retrieval Evaluation with Recall and MRR" +notebook = "ground_truth_retrieval_evaluation.ipynb" +topics = ["Evaluation", "RAG", "Advanced Retrieval"] + [[cookbook]] title = "Advanced Prompt Customization for Anthropic" notebook = "prompt_customization_for_Anthropic.ipynb" diff --git a/notebooks/ground_truth_retrieval_evaluation.ipynb b/notebooks/ground_truth_retrieval_evaluation.ipynb index 00d46da..d1bd2c2 100644 --- a/notebooks/ground_truth_retrieval_evaluation.ipynb +++ b/notebooks/ground_truth_retrieval_evaluation.ipynb @@ -6,7 +6,9 @@ "source": [ "# Ground-Truth Retrieval Evaluation with Recall and MRR\n", "\n", - "This cookbook shows how to evaluate a retriever before adding answer generation to a RAG pipeline. We will build a small in-memory corpus, define labeled query-document pairs, retrieve documents for each query, and evaluate the retrieved documents with Haystack's `DocumentRecallEvaluator` and `DocumentMRREvaluator`." + "This cookbook shows how to evaluate a retriever before adding answer generation to a RAG pipeline. We will build a small in-memory corpus, define labeled query-document pairs, retrieve documents for each query, and evaluate the retrieved documents with Haystack's `DocumentRecallEvaluator` and `DocumentMRREvaluator`.\n", + "\n", + "The example uses local in-memory components only. It does not require an LLM provider, a vector database, or any external API keys." ] }, { @@ -24,14 +26,23 @@ "source": [ "## Introduction\n", "\n", - "Retrieval evaluation isolates the retriever from the generator. This helps diagnose whether weak RAG answers come from missing context or from generation problems." + "Retrieval evaluation isolates the retriever from the generator. This helps diagnose whether weak RAG answers come from missing context or from generation problems.\n", + "\n", + "This kind of evaluation is useful when you have labeled query-document pairs: for each test query, you know which document or documents should be retrieved. That makes it different from LLM-as-judge or faithfulness evaluation, which usually inspect generated answers. Those end-to-end checks are still valuable, but they answer a different question: whether the final answer is good and grounded. Here, we ask whether the retriever returned the expected evidence in the first place.\n", + "\n", + "We will use two retrieval metrics:\n", + "\n", + "- **Recall** answers: What fraction of relevant documents did the retriever return?\n", + "- **MRR** answers: How early did the first relevant document appear?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Create a small document store" + "## Create a small document store\n", + "\n", + "First, create a tiny corpus about retrieval and RAG evaluation concepts. We assign stable document IDs so the ground-truth labels are easy to read." ] }, { @@ -77,7 +88,7 @@ " Document(\n", " id=\"doc-recall\",\n", " content=(\n", - " \"Recall measures whether all relevant documents appear in the retrieved \"\n", + " \"Recall measures what fraction of relevant documents appears in the retrieved \"\n", " \"set. High recall means the retriever is less likely to miss needed evidence.\"\n", " ),\n", " ),\n", @@ -94,7 +105,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Define evaluation queries and ground-truth documents" + "## Define evaluation queries and ground-truth documents\n", + "\n", + "Each example contains a query and the ID of the document that should be considered relevant. In a real project, these labels often come from human annotation, search logs, or an existing benchmark dataset." ] }, { @@ -132,7 +145,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Build and run a retrieval pipeline" + "## Build and run a retrieval pipeline\n", + "\n", + "Now build a minimal retrieval pipeline with `InMemoryBM25Retriever`. For each query, collect the retrieved `Document` objects. The evaluators compare these retrieved documents with the ground-truth documents." ] }, { @@ -166,7 +181,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Evaluate with Recall and MRR" + "## Evaluate with Recall and MRR\n", + "\n", + "`DocumentRecallEvaluator` checks what fraction of the expected documents appears in the retrieved set. `DocumentMRREvaluator` also considers rank: retrieving a relevant document at rank 1 scores better than retrieving it at rank 3." ] }, { @@ -218,14 +235,25 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Interpret the results" + "## Interpret the results\n", + "\n", + "In this toy corpus, a high Recall score means the retriever usually includes the labeled document in the top results. A lower Recall score would mean the retriever missed relevant evidence, so changing the retriever, query formulation, filters, chunking strategy, or `top_k` would be worth investigating before tuning generation.\n", + "\n", + "MRR adds ranking sensitivity. If Recall is high but MRR is lower, the relevant document is present but often buried below less useful documents. That can still hurt RAG quality because generators may overweight the first retrieved contexts or run out of context window on larger pipelines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Next steps" + "## Next steps\n", + "\n", + "To adapt this pattern to a larger RAG system:\n", + "\n", + "- Replace the toy corpus with your indexed documents or document chunks.\n", + "- Build a labeled evaluation set of queries and relevant document IDs.\n", + "- Run the same evaluator components for each retriever or retrieval configuration you want to compare.\n", + "- Combine retrieval metrics with answer-level checks, such as faithfulness or LLM-as-judge evaluation, once the retriever is returning the expected evidence." ] } ],