Skip to content

Commit 215bc81

Browse files
authored
RAG Embeddings recipe (#184)
* updating readme and adding rag embeddings recipe * updating readme * resolving gemini suggestions * resolving gemini typo suggestions
1 parent ca6be61 commit 215bc81

12 files changed

Lines changed: 481 additions & 1 deletion

File tree

genai-on-vertex-ai/gemini/model_upgrades/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,4 @@ The goal is to accelerate the process of upgrading to the latest version of Gemi
2020
| Multi-turn Chat | [view](./multiturn_chat/vertex_colab/multiturn_chat_eval.ipynb) | [view](./multiturn_chat/vertex_script/README.md) | [view](./multiturn_chat/promptfoo/README.md) |
2121
| Instruction Following | [view](./instruction_following/vertex_colab/instruction_following_eval.ipynb) | [view](./instruction_following/vertex_script/README.md) | [view](./instruction_following/promptfoo/README.md) |
2222
| Image-Prompt Alignment | [view](./image_prompt_alignment/vertex_colab/image_prompt_alignment_eval.ipynb) | [view](./image_prompt_alignment/vertex_script/README.md) | [view](./image_prompt_alignment/promptfoo/README.md) |
23-
23+
| RAG Embeddings | [view](./rag_embeddings/vertex_colab/rag_embeddings_eval.ipynb) | [view](./rag_embeddings/vertex_script/README.md) | |

genai-on-vertex-ai/gemini/model_upgrades/rag_embeddings/vertex_colab/baseline_dataset.jsonl

Lines changed: 8 additions & 0 deletions
Large diffs are not rendered by default.

genai-on-vertex-ai/gemini/model_upgrades/rag_embeddings/vertex_colab/candidate_dataset.jsonl

Lines changed: 8 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
You are a professional RAG evaluator.
2+
3+
You will be assessing context retrieval quality. It measures how accurately does the retrieved context cover all the key information present in the reference answer.
4+
5+
You will assign the writing response a score on a scale of 1-5, following the INDIVIDUAL RATING RUBRIC and EVALUATION STEPS.
6+
7+
CRITERIA DEFINITIONS:
8+
Context Retrieval quality: The retrieved context should contain all the key information that is present in the reference. The context can contain additional information that is not found in the reference, but it should not miss any key information found in the reference.
9+
10+
INDIVIDUAL RATING RUBRIC:
11+
1 : The retrieved context does not contain any of the key information from the reference.
12+
2 : The retrieved context covers just a little of the key information from the reference.
13+
3 : The retrieved context covers about half of the key information from the reference.
14+
4 : The retrieved context covers majority of the key information from the reference.
15+
5 : The retrieved context covers all of the key information from the reference.
16+
17+
EVALUATION STEPS:
18+
STEP 1: Assess the context retrieval quality based on the criteria.
19+
STEP 2: Score based on the rubrics.
20+
21+
Give step by step explanations for your scoring, and only choose scores from the individual rating rubric above.
22+
23+
24+
CONTEXT: {retrieved_context}
25+
26+
REFERENCE: {reference}
Lines changed: 250 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": null,
6+
"metadata": {},
7+
"outputs": [],
8+
"source": [
9+
"# Copyright 2025 Google LLC\n",
10+
"#\n",
11+
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
12+
"# you may not use this file except in compliance with the License.\n",
13+
"# You may obtain a copy of the License at\n",
14+
"#\n",
15+
"# https://www.apache.org/licenses/LICENSE-2.0\n",
16+
"#\n",
17+
"# Unless required by applicable law or agreed to in writing, software\n",
18+
"\n",
19+
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
20+
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
21+
"# See the License for the specific language governing permissions and\n",
22+
"# limitations under the License."
23+
]
24+
},
25+
{
26+
"cell_type": "markdown",
27+
"metadata": {},
28+
"source": [
29+
"## **RAG Embeddings Retrieval Eval Recipe**\n",
30+
"\n",
31+
"This Eval Recipe demonstrates how to compare performance of two embedding models on a RAG dataset using [Vertex AI Evaluation Service](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview).\n",
32+
"\n",
33+
"We will be looking at `text-embedding-004` as our baseline model and `text-embedding-005` as our candidate model. Please follow the [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) here to get an understanding of the various text embedding models. "
34+
]
35+
},
36+
{
37+
"cell_type": "markdown",
38+
"metadata": {},
39+
"source": [
40+
"<table align=\"left\">\n",
41+
" <td style=\"text-align: center\">\n",
42+
" <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-rag_retrieval&utm_medium=aRT-clicks&utm_campaign=rag_retrieval&destination=rag_retrieval&url=https%3A%2F%2Fcolab.research.google.com%2Fgithub%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fblob%2Fmain%2Fgenai-on-vertex-ai%2Fgemini%2Fmodel_upgrades%2Frag_embeddings%2Fvertex_colab%2Frag_embeddings_eval.ipynb\">\n",
43+
" <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
44+
" </a>\n",
45+
" </td>\n",
46+
" <td style=\"text-align: center\">\n",
47+
" <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-rag_retrieval&utm_medium=aRT-clicks&utm_campaign=rag_retrieval&destination=rag_retrieval&url=https%3A%2F%2Fconsole.cloud.google.com%2Fvertex-ai%2Fcolab%2Fimport%2Fhttps%3A%252F%252Fraw.githubusercontent.com%252FGoogleCloudPlatform%252Fapplied-ai-engineering-samples%252Fmain%252Fgenai-on-vertex-ai%252Fgemini%252Fmodel_upgrades%252Frag_embeddings%252Fvertex_colab%252Frag_embeddings_eval.ipynb\">\n",
48+
" <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
49+
" </a>\n",
50+
" </td>\n",
51+
" <td style=\"text-align: center\">\n",
52+
" <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-rag_retrieval&utm_medium=aRT-clicks&utm_campaign=rag_retrieval&destination=rag_retrieval&url=https%3A%2F%2Fconsole.cloud.google.com%2Fvertex-ai%2Fworkbench%2Fdeploy-notebook%3Fdownload_url%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fmain%2Fgenai-on-vertex-ai%2Fgemini%2Fmodel_upgrades%2Frag_embeddings%2Fvertex_colab%2Frag_embeddings_eval.ipynb\">\n",
53+
" <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
54+
" </a>\n",
55+
" </td>\n",
56+
" <td style=\"text-align: center\">\n",
57+
" <a href=\"https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples/blob/main/genai-on-vertex-ai/gemini/model_upgrades/rag_embeddings/vertex_colab/rag_embeddings_eval.ipynb\">\n",
58+
" <img width=\"32px\" src=\"https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg\" alt=\"GitHub logo\"><br> View on GitHub\n",
59+
" </a>\n",
60+
" </td>\n",
61+
"</table>"
62+
]
63+
},
64+
{
65+
"cell_type": "markdown",
66+
"metadata": {},
67+
"source": [
68+
"- Use case: RAG retrieval\n",
69+
"\n",
70+
"- Metric: This eval uses a Pointwise Retrieval quality template to evaluate the responses and pick an embedding model as the winner. We will define `retrieval quality` as the metric here. It checks whether the `retrieved_context` contains all the key information present in `reference`.\n",
71+
"\n",
72+
"- Evaluation Datasets are based on [RAG Dataset](https://www.kaggle.com/datasets/samuelmatsuoharris/single-topic-rag-evaluation-dataset) in compliance with the following [license](https://www.mit.edu/~amini/LICENSE.md). They include 8 randomly sampled prompts in JSONL files `baseline_dataset.jsonl` and `candidate_dataset.jsonl` with the following structure:\n",
73+
" - `question`: User inputted question \n",
74+
" - `reference`: The golden truth answer for the question\n",
75+
" - `retrieved_context`: The context retrieved from the model \n",
76+
"\n",
77+
"\n",
78+
"- Prompt Template is a zero-shot prompt located in [`prompt_template.txt`](./prompt_template.txt) with two prompt variables ( `reference` and `retrieved_context`) that are automatically populated from our dataset.\n",
79+
"\n",
80+
"- This eval recipe uses an LLM judge model(gemini-2.0-flash) to evaluate the retrieval quality of the embedding models. "
81+
]
82+
},
83+
{
84+
"cell_type": "markdown",
85+
"metadata": {},
86+
"source": [
87+
"## **Prerequisite**\n",
88+
"\n",
89+
"This recipe assumes that the user has already created datasets for the baseline embedding model and the candidate embedding model. The user needs to generate the datasets for the baseline(text-embedding-004) and candidate(text-embedding-005) embedding models. Please refer to [RAG Engine generation notebook](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/rag-engine/rag_engine_eval_service_sdk.ipynb) to create two separate RAG engines and set up corresponding datasets. The `retrieved_context` column in the dataset is the context retrieved from the respective RAG engine for each one of the questions."
90+
]
91+
},
92+
{
93+
"cell_type": "markdown",
94+
"metadata": {},
95+
"source": [
96+
"### Configure Eval Settings"
97+
]
98+
},
99+
{
100+
"cell_type": "code",
101+
"execution_count": null,
102+
"metadata": {},
103+
"outputs": [],
104+
"source": [
105+
"%%writefile .env\n",
106+
"PROJECT_ID=your-project-id # Google Cloud Project ID\n",
107+
"LOCATION=us-central1 # Region for all required Google Cloud services\n",
108+
"EXPERIMENT_NAME=rag-embeddings-eval-recipe-demo # Creates Vertex AI Experiment to track the eval runs\n",
109+
"BASELINE_EMBEDDING_MODEL=text-embedding-004\n",
110+
"CANDIDATE_EMBEDDING_MODEL=text-embedding-005\n",
111+
"MODEL=gemini-2.0-flash # This model will be the judge for performing evaluations\n",
112+
"BASELINE_DATASET_URI=\"gs://gemini_assets/rag_embeddings/baseline_dataset.jsonl\" # Baseline embedding model dataset in Google Cloud Storage\n",
113+
"CANDIDATE_DATASET_URI=\"gs://gemini_assets/rag_embeddings/candidate_dataset.jsonl\" # Candidate embedding model dataset in Google Cloud Storage\n",
114+
"PROMPT_TEMPLATE_URI=\"gs://gemini_assets/rag_embeddings/prompt_template.txt\" # Text file in Google Cloud Storage\n",
115+
"METRIC_NAME=\"retrieval_quality\""
116+
]
117+
},
118+
{
119+
"cell_type": "markdown",
120+
"metadata": {},
121+
"source": [
122+
"### Install Python Libraries"
123+
]
124+
},
125+
{
126+
"cell_type": "code",
127+
"execution_count": null,
128+
"metadata": {},
129+
"outputs": [],
130+
"source": [
131+
"%pip install --upgrade --quiet google-cloud-aiplatform[evaluation] python-dotenv\n",
132+
"# The error \"session crashed\" is expected. Please ignore it and proceed to the next cell.\n",
133+
"import IPython\n",
134+
"IPython.Application.instance().kernel.do_shutdown(True)"
135+
]
136+
},
137+
{
138+
"cell_type": "code",
139+
"execution_count": null,
140+
"metadata": {},
141+
"outputs": [],
142+
"source": [
143+
"import os\n",
144+
"import json\n",
145+
"import pandas as pd\n",
146+
"import sys\n",
147+
"import vertexai\n",
148+
"from dotenv import load_dotenv\n",
149+
"from google.cloud import storage\n",
150+
"\n",
151+
"from datetime import datetime\n",
152+
"from IPython.display import clear_output\n",
153+
"from vertexai.evaluation import EvalTask, EvalResult, PointwiseMetric"
154+
]
155+
},
156+
{
157+
"cell_type": "markdown",
158+
"metadata": {},
159+
"source": [
160+
"### Authenticate to Google Cloud (requires permission to open a popup window)"
161+
]
162+
},
163+
{
164+
"cell_type": "code",
165+
"execution_count": null,
166+
"metadata": {},
167+
"outputs": [],
168+
"source": [
169+
"load_dotenv(override=True)\n",
170+
"if os.getenv(\"PROJECT_ID\") == \"your-project-id\":\n",
171+
" raise ValueError(\"Please configure your Google Cloud Project ID in the first cell.\")\n",
172+
"if \"google.colab\" in sys.modules: \n",
173+
" from google.colab import auth \n",
174+
" auth.authenticate_user()\n",
175+
"vertexai.init(project=os.getenv('PROJECT_ID'), location=os.getenv('LOCATION'))"
176+
]
177+
},
178+
{
179+
"cell_type": "markdown",
180+
"metadata": {},
181+
"source": [
182+
"### Run the eval on both models on the Pairwise Autorater"
183+
]
184+
},
185+
{
186+
"cell_type": "code",
187+
"execution_count": null,
188+
"metadata": {},
189+
"outputs": [],
190+
"source": [
191+
"def load_file(gcs_uri: str) -> str:\n",
192+
" blob = storage.Blob.from_string(gcs_uri, storage.Client())\n",
193+
" return blob.download_as_string().decode('utf-8')\n",
194+
"\n",
195+
"def load_dataset(dataset_uri: str):\n",
196+
" jsonl = load_file(dataset_uri)\n",
197+
" samples = [json.loads(line) for line in jsonl.splitlines() if line.strip()]\n",
198+
" df = pd.DataFrame(samples)\n",
199+
" return df\n",
200+
"\n",
201+
"def load_prompt_template() -> str:\n",
202+
" blob = storage.Blob.from_string(os.getenv(\"PROMPT_TEMPLATE_URI\"), storage.Client())\n",
203+
" return blob.download_as_string().decode('utf-8')\n",
204+
"\n",
205+
"def run_eval(model: str, embedding_model: str, dataset_uri: str) -> EvalResult:\n",
206+
" timestamp = f\"{datetime.now().strftime('%b-%d-%H-%M-%S')}\".lower()\n",
207+
" return EvalTask(\n",
208+
" dataset=dataset_uri,\n",
209+
" metrics=[PointwiseMetric(\n",
210+
" metric=os.getenv('METRIC_NAME'),\n",
211+
" metric_prompt_template= load_prompt_template()\n",
212+
" ) \n",
213+
" ],\n",
214+
" experiment=os.getenv('EXPERIMENT_NAME')\n",
215+
" ).evaluate(\n",
216+
" response_column_name= 'retrieved_context',\n",
217+
" experiment_run_name=f\"{timestamp}-{embedding_model}-{model.replace('.', '-')}\"\n",
218+
" )\n",
219+
"\n",
220+
"\n",
221+
"baseline_metrics = run_eval(os.getenv(\"MODEL\"), os.getenv(\"BASELINE_EMBEDDING_MODEL\"), os.getenv(\"BASELINE_DATASET_URI\"))\n",
222+
"candidate_metrics = run_eval(os.getenv(\"MODEL\"), os.getenv(\"CANDIDATE_EMBEDDING_MODEL\"), os.getenv(\"CANDIDATE_DATASET_URI\"))\n",
223+
"clear_output()\n",
224+
"print(\"Average score for baseline model retrieval quality:\", round(baseline_metrics.summary_metrics[f'{os.getenv(\"METRIC_NAME\")}/mean'],3))\n",
225+
"print(\"Average score for candidate model retrieval quality:\", round(candidate_metrics.summary_metrics[f'{os.getenv(\"METRIC_NAME\")}/mean'],3))"
226+
]
227+
}
228+
],
229+
"metadata": {
230+
"kernelspec": {
231+
"display_name": ".venv",
232+
"language": "python",
233+
"name": "python3"
234+
},
235+
"language_info": {
236+
"codemirror_mode": {
237+
"name": "ipython",
238+
"version": 3
239+
},
240+
"file_extension": ".py",
241+
"mimetype": "text/x-python",
242+
"name": "python",
243+
"nbconvert_exporter": "python",
244+
"pygments_lexer": "ipython3",
245+
"version": "3.10.12"
246+
}
247+
},
248+
"nbformat": 4,
249+
"nbformat_minor": 2
250+
}
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# RAG Retrieval
2+
3+
### _Eval Recipe for model migration_
4+
5+
This Eval Recipe demonstrates how to compare performance of two embedding models on a RAG dataset using [Vertex AI Evaluation Service](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview).
6+
7+
We will be looking at `text-embedding-004` as our baseline model and `text-embedding-005` as our candidate model. Please follow the [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) here to get an understanding of the various text embedding models.
8+
9+
- Use case: RAG retrieval
10+
11+
- Metric: This eval uses a Pointwise Retrieval quality template to evaluate the responses and pick a model as the winner. We will define `retrieval quality` as the metric here. It checks whether the `retrieved_context` contains all the key information present in `reference`.
12+
13+
- Evaluation Datasets are based on [RAG Dataset](https://www.kaggle.com/datasets/samuelmatsuoharris/single-topic-rag-evaluation-dataset) in compliance with the following [license](https://www.mit.edu/~amini/LICENSE.md). They include 8 randomly sampled prompts in JSONL files `baseline_dataset.jsonl` and `candidate_dataset.jsonl` with the following structure:
14+
- `question`: User inputted question
15+
- `reference`: The golden truth answer for the question
16+
- `retrieved_context`: The context retrieved from the model.
17+
18+
- Prompt Template is a zero-shot prompt located in [`prompt_template.txt`](./prompt_template.txt) with two prompt variables ( `reference` and `retrieved_context`) that are automatically populated from our dataset.
19+
20+
- This eval recipe uses an LLM judge model(gemini-2.0-flash) to evaluate the retrieval quality of the embedding models.
21+
22+
### _Prerequisite_
23+
24+
This recipe assumes that the user has already created datasets for the baseline embedding model and the candidate embedding model. The user needs to generate the datasets for the baseline(text-embedding-004) and candidate(text-embedding-005) embedding models. Please refer to [RAG Engine generation notebook](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/rag-engine/rag_engine_eval_service_sdk.ipynb) to create two separate RAG engines and set up corresponding datasets. The `retrieved_context` column in the dataset is the context retrieved from the respective RAG engine for each one of the questions.
25+
26+
- Python script [`eval.py`](./eval.py) configures the evaluation:
27+
- `run_eval`: configures the evaluation task, runs it on the 2 models and prints the results.
28+
- `load_dataset`: loads the dataset including the contents of all documents.
29+
30+
- Shell script [`run.sh`](./run.sh) installs the required Python libraries and runs `eval.py`
31+
32+
- Google Cloud Shell is the easiest option as it automatically clones our Github repo:
33+
34+
<a href="https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples&cloudshell_git_branch=main&cloudshell_workspace=genai-on-vertex-ai/gemini/model_upgrades">
35+
<img alt="Open in Cloud Shell" src="http://gstatic.com/cloudssh/images/open-btn.png">
36+
</a>
37+
38+
- Alternatively, you can use the following command to clone this repo to any Linux environment with configured [Google Cloud Environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment):
39+
40+
``` bash
41+
git clone --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && \
42+
cd applied-ai-engineering-samples && \
43+
git sparse-checkout init && \
44+
git sparse-checkout set genai-on-vertex-ai/gemini/model_upgrades && \
45+
git pull origin main
46+
cd genai-on-vertex-ai/gemini/model_upgrades
47+
```
48+
49+
1. Navigate to the Eval Recipe directory in terminal, set your Google Cloud Project ID and run the shell script `run.sh`.
50+
51+
``` bash
52+
cd rag_embeddings/vertex_script
53+
export PROJECT_ID="[your-project-id]"
54+
./run.sh
55+
```
56+
57+
1. The resulting metrics will be displayed in the script output.
58+
59+
1. You can use [Vertex AI Experiments](https://console.cloud.google.com/vertex-ai/experiments) to view the history of evaluations for each experiment, including the final metrics scores.
60+
61+
## How to customize this Eval Recipe:
62+
63+
We will have two runs, one for the baseline model and the candidate model
64+
65+
1. Edit the Python script `eval.py`:
66+
- set the `project` parameter of vertexai.init to your Google Cloud Project ID.
67+
- set the parameter `model` in the run_eval calls (e.g., 'gemini-2.0-flash') to the LLM you want to use for performing the evaluation task.
68+
- set the parameter `embedding_model` to the model that you want to run the evaluation for
69+
- configure a unique `experiment_name` for tracking purposes
70+
- set the parameter `dataset_local_path` to the file you are running the evaluations for
71+
1. Replace the contents of `prompt_template.txt` with your custom prompt template. Make sure that prompt template variables map to the dataset attributes.
72+
1. Please refer to our [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval) if you want to further customize your evaluation. Vertex AI Evaluation Service has a lot of features that are not included in this recipe.

0 commit comments

Comments
 (0)